Details

    Description

      The writing of modified InnoDB data pages to data files should be overhauled. See some of the comments in MDEV-15058.

      Attachments

        Issue Links

          Activity

            Page flushing performance should be improved by removing the unsorted buf_pool->flush_list, and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background (MDEV-14481) while the server is already accepting connections.

            marko Marko Mäkelä added a comment - Page flushing performance should be improved by removing the unsorted buf_pool->flush_list , and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background ( MDEV-14481 ) while the server is already accepting connections.

            We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.

            marko Marko Mäkelä added a comment - We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.

            Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program, which suggests fault injection by dmsetup.

            I think that we should perform this kind of fault injection testing for InnoDB.

            marko Marko Mäkelä added a comment - Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program , which suggests fault injection by dmsetup . I think that we should perform this kind of fault injection testing for InnoDB.
            marko Marko Mäkelä added a comment - The FOSDEM 2019 talk on PostgreSQL vs. fsync is worth watching.

            In the function buf_flush_write_block_low(), there is a call log_write_up_to(bpage->newest_modification, true), which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

            marko Marko Mäkelä added a comment - In the function buf_flush_write_block_low() , there is a call log_write_up_to(bpage->newest_modification, true) , which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

            As noted in MDEV-19356, it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.

            marko Marko Mäkelä added a comment - As noted in MDEV-19356 , it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.

            As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync(). We must keep this in mind.

            Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.

            marko Marko Mäkelä added a comment - As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync() . We must keep this in mind. Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.
            anjumnaveed81 Anjum Naveed added a comment -

            I will love to contribute with this. Please let me know how can I be of help.

            anjumnaveed81 Anjum Naveed added a comment - I will love to contribute with this. Please let me know how can I be of help.

            anjumnaveed81, we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5.
            If you have some design ideas, feel free to write them here.

            marko Marko Mäkelä added a comment - anjumnaveed81 , we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5. If you have some design ideas, feel free to write them here.
            zhaiwx1987 zhai weixiang added a comment -

            It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

            zhaiwx1987 zhai weixiang added a comment - It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

            zhaiwx1987, such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed MDEV-14462).

            wlad, as noted in MDEV-16264, I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.

            marko Marko Mäkelä added a comment - zhaiwx1987 , such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed MDEV-14462 ). wlad , as noted in MDEV-16264 , I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.
            marko Marko Mäkelä added a comment - - edited

            The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with MDEV-15058, and will be hopefully helped by MDEV-15053 already.

            When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.

            marko Marko Mäkelä added a comment - - edited The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with MDEV-15058 , and will be hopefully helped by MDEV-15053 already. When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.

            I think that as part of this task, we should try to simplify buf_flush_write_block_low(). Can we avoid acquiring log_sys.mutex? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN) is newer than what has been persisted in redo log (log_sys.write_lsn)? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see MDEV-14462) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t>.

            Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue?

            We should remove the BUF_FLUSH_SINGLE_PAGE, but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on MDEV-14550 I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct, try to avoid repeated writes of a frequently changed page).

            marko Marko Mäkelä added a comment - I think that as part of this task, we should try to simplify buf_flush_write_block_low() . Can we avoid acquiring log_sys.mutex ? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN ) is newer than what has been persisted in redo log ( log_sys.write_lsn )? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see MDEV-14462 ) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t> . Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue ? We should remove the BUF_FLUSH_SINGLE_PAGE , but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on MDEV-14550 I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct , try to avoid repeated writes of a frequently changed page).

            As part of this task, I think that the following threads that MDEV-16264 failed to convert to tasks should be removed or refactored:

            • fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush)
            • recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of MDEV-14481.
            marko Marko Mäkelä added a comment - As part of this task, I think that the following threads that MDEV-16264 failed to convert to tasks should be removed or refactored: fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush) recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of MDEV-14481 .

            Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.

            marko Marko Mäkelä added a comment - Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.

            It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF. The logic would sometimes fail, as reported in MDEV-16339.

            I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block(), instead of passive sleeping?

            marko Marko Mäkelä added a comment - It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF . The logic would sometimes fail, as reported in MDEV-16339 . I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block() , instead of passive sleeping?

            Now that most of the work has been completed in MDEV-23399 and MDEV-23855, I think that the main work to be done will be in MDEV-12227, MDEV-23756 and possibly other tickets that have been linked as related.

            Because this is mostly an umbrella task, having an estimate is not too meaningful.

            marko Marko Mäkelä added a comment - Now that most of the work has been completed in MDEV-23399 and MDEV-23855 , I think that the main work to be done will be in MDEV-12227 , MDEV-23756 and possibly other tickets that have been linked as related. Because this is mostly an umbrella task, having an estimate is not too meaningful.

            I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.

            wlad Vladislav Vaintroub added a comment - I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.

            I think that most of this has already been done. One remaining tweak is MDEV-26827. We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.

            marko Marko Mäkelä added a comment - I think that most of this has already been done. One remaining tweak is MDEV-26827 . We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.

            After MDEV-26055 and MDEV-26827 have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378, submitting scatter-gather write requests instead of multiple single-page write requests.

            marko Marko Mäkelä added a comment - After MDEV-26055 and MDEV-26827 have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378 , submitting scatter-gather write requests instead of multiple single-page write requests.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.