Details

    Description

      The writing of modified InnoDB data pages to data files should be overhauled. See some of the comments in MDEV-15058.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.4 [ 22408 ]
            marko Marko Mäkelä made changes -

            Page flushing performance should be improved by removing the unsorted buf_pool->flush_list, and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background (MDEV-14481) while the server is already accepting connections.

            marko Marko Mäkelä added a comment - Page flushing performance should be improved by removing the unsorted buf_pool->flush_list , and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background ( MDEV-14481 ) while the server is already accepting connections.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.

            marko Marko Mäkelä added a comment - We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.
            marko Marko Mäkelä made changes -

            Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program, which suggests fault injection by dmsetup.

            I think that we should perform this kind of fault injection testing for InnoDB.

            marko Marko Mäkelä added a comment - Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program , which suggests fault injection by dmsetup . I think that we should perform this kind of fault injection testing for InnoDB.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - The FOSDEM 2019 talk on PostgreSQL vs. fsync is worth watching.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            NRE Projects RM_105_CANDIDATE
            marko Marko Mäkelä made changes -

            In the function buf_flush_write_block_low(), there is a call log_write_up_to(bpage->newest_modification, true), which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

            marko Marko Mäkelä added a comment - In the function buf_flush_write_block_low() , there is a call log_write_up_to(bpage->newest_modification, true) , which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

            As noted in MDEV-19356, it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.

            marko Marko Mäkelä added a comment - As noted in MDEV-19356 , it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync(). We must keep this in mind.

            Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.

            marko Marko Mäkelä added a comment - As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync() . We must keep this in mind. Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.
            marko Marko Mäkelä made changes -
            anjumnaveed81 Anjum Naveed added a comment -

            I will love to contribute with this. Please let me know how can I be of help.

            anjumnaveed81 Anjum Naveed added a comment - I will love to contribute with this. Please let me know how can I be of help.
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            anjumnaveed81, we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5.
            If you have some design ideas, feel free to write them here.

            marko Marko Mäkelä added a comment - anjumnaveed81 , we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5. If you have some design ideas, feel free to write them here.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            thiru Thirunarayanan Balathandayuthapani made changes -
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            zhaiwx1987 zhai weixiang added a comment -

            It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

            zhaiwx1987 zhai weixiang added a comment - It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

            zhaiwx1987, such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed MDEV-14462).

            wlad, as noted in MDEV-16264, I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.

            marko Marko Mäkelä added a comment - zhaiwx1987 , such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed MDEV-14462 ). wlad , as noted in MDEV-16264 , I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.
            marko Marko Mäkelä made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Vladislav Vaintroub [ wlad ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä added a comment - - edited

            The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with MDEV-15058, and will be hopefully helped by MDEV-15053 already.

            When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.

            marko Marko Mäkelä added a comment - - edited The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with MDEV-15058 , and will be hopefully helped by MDEV-15053 already. When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            I think that as part of this task, we should try to simplify buf_flush_write_block_low(). Can we avoid acquiring log_sys.mutex? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN) is newer than what has been persisted in redo log (log_sys.write_lsn)? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see MDEV-14462) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t>.

            Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue?

            We should remove the BUF_FLUSH_SINGLE_PAGE, but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on MDEV-14550 I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct, try to avoid repeated writes of a frequently changed page).

            marko Marko Mäkelä added a comment - I think that as part of this task, we should try to simplify buf_flush_write_block_low() . Can we avoid acquiring log_sys.mutex ? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN ) is newer than what has been persisted in redo log ( log_sys.write_lsn )? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see MDEV-14462 ) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t> . Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue ? We should remove the BUF_FLUSH_SINGLE_PAGE , but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on MDEV-14550 I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct , try to avoid repeated writes of a frequently changed page).
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            As part of this task, I think that the following threads that MDEV-16264 failed to convert to tasks should be removed or refactored:

            • fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush)
            • recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of MDEV-14481.
            marko Marko Mäkelä added a comment - As part of this task, I think that the following threads that MDEV-16264 failed to convert to tasks should be removed or refactored: fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush) recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of MDEV-14481 .
            marko Marko Mäkelä made changes -

            Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.

            marko Marko Mäkelä added a comment - Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.
            julien.fritsch Julien Fritsch made changes -
            Due Date 2020-04-03
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF. The logic would sometimes fail, as reported in MDEV-16339.

            I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block(), instead of passive sleeping?

            marko Marko Mäkelä added a comment - It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF . The logic would sometimes fail, as reported in MDEV-16339 . I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block() , instead of passive sleeping?
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.5 [ 23123 ]
            julien.fritsch Julien Fritsch made changes -
            marko Marko Mäkelä made changes -
            julien.fritsch Julien Fritsch made changes -
            Due Date 2020-04-03
            julien.fritsch Julien Fritsch made changes -
            Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            Now that most of the work has been completed in MDEV-23399 and MDEV-23855, I think that the main work to be done will be in MDEV-12227, MDEV-23756 and possibly other tickets that have been linked as related.

            Because this is mostly an umbrella task, having an estimate is not too meaningful.

            marko Marko Mäkelä added a comment - Now that most of the work has been completed in MDEV-23399 and MDEV-23855 , I think that the main work to be done will be in MDEV-12227 , MDEV-23756 and possibly other tickets that have been linked as related. Because this is mostly an umbrella task, having an estimate is not too meaningful.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Vladislav Vaintroub [ wlad ]
            wlad Vladislav Vaintroub made changes -
            Priority Critical [ 2 ] Major [ 3 ]

            I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.

            wlad Vladislav Vaintroub added a comment - I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.6 [ 24028 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Assignee Vladislav Vaintroub [ wlad ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.7 [ 24805 ]

            I think that most of this has already been done. One remaining tweak is MDEV-26827. We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.

            marko Marko Mäkelä added a comment - I think that most of this has already been done. One remaining tweak is MDEV-26827 . We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 87924 ] MariaDB v4 [ 130854 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.8 [ 26121 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.9 [ 26905 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 10.10 [ 27530 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s 10.11 [ 27614 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.3 [ 28565 ]
            Fix Version/s 11.2 [ 28603 ]

            After MDEV-26055 and MDEV-26827 have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378, submitting scatter-gather write requests instead of multiple single-page write requests.

            marko Marko Mäkelä added a comment - After MDEV-26055 and MDEV-26827 have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378 , submitting scatter-gather write requests instead of multiple single-page write requests.
            marko Marko Mäkelä made changes -
            Fix Version/s 10.10.4 [ 28522 ]
            Fix Version/s 10.9.6 [ 28520 ]
            Fix Version/s 10.8.8 [ 28518 ]
            Fix Version/s 10.6.13 [ 28514 ]
            Fix Version/s 11.0.2 [ 28706 ]
            Fix Version/s 10.11.3 [ 28524 ]
            Fix Version/s 11.1.1 [ 28704 ]
            Fix Version/s 11.3 [ 28565 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.