Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19514

Defer change buffer merge until pages are requested

Details

    Description

      For MDEV-14481, we must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread.

      This would also simplify MDEV-16526, because the change buffer I/O would not have to be treated as a special case.

      The change buffer format will not be changed as part of this task. That could remain part of MDEV-11634.

      As part of this, the counter innodb_ibuf_merge_usec will be removed from information_schema.innodb_metrics.

      Attachments

        1. MDEV-19514.ods
          57 kB
        2. MDEV-19514-2.ods
          74 kB
        3. MDEV-19514-3.ods
          63 kB

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]
            marko Marko Mäkelä made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            marko Marko Mäkelä made changes -
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status Confirmed [ 10101 ] In Progress [ 3 ]

            MDEV-19514 description mentions that we can remove buf_pool->watch and BUF_POOL_BLOCK_WATCH. But I would like
            to differ in it. Because buffer pool watch introduced to avoid the race between user page read and purge
            buffering.

            • When purge threads tried to do purge the records for non-unique secondary index leaf page, It tries to
              access page in buffer pool with mode BUF_GET_IF_IN_POOL_OR_WATCH. If the page is not present in buffer pool
              then purge assigns page id to one of the watch page which was created during buffer pool initialization.
            • After setting the watch, purge tries to do insert the purge buffering operation. Purge threads checks
              whether page has been read in the mean time after acquiring change buffer page. If that check fails then
              purge thread will go ahead with purge buffering because normal page read has to wait for change buffer
              page in buf_page_io_complete() to merge change buffer changes anyway.
            • Setting the watch also increases the buffer fix count. So it make sure that page hasn't been kicked
              out of LRU list.

            So that, watch solves the race between page read and purge buffering. As discussed with marko, we don't
            need to remove watch.

            thiru Thirunarayanan Balathandayuthapani added a comment - MDEV-19514 description mentions that we can remove buf_pool->watch and BUF_POOL_BLOCK_WATCH. But I would like to differ in it. Because buffer pool watch introduced to avoid the race between user page read and purge buffering. When purge threads tried to do purge the records for non-unique secondary index leaf page, It tries to access page in buffer pool with mode BUF_GET_IF_IN_POOL_OR_WATCH. If the page is not present in buffer pool then purge assigns page id to one of the watch page which was created during buffer pool initialization. After setting the watch, purge tries to do insert the purge buffering operation. Purge threads checks whether page has been read in the mean time after acquiring change buffer page. If that check fails then purge thread will go ahead with purge buffering because normal page read has to wait for change buffer page in buf_page_io_complete() to merge change buffer changes anyway. Setting the watch also increases the buffer fix count. So it make sure that page hasn't been kicked out of LRU list. So that, watch solves the race between page read and purge buffering. As discussed with marko, we don't need to remove watch.
            marko Marko Mäkelä added a comment - - edited

            I agree that we probably must keep the buffer pool watch mechanism.

            As part of this work, the InnoDB master thread will no longer perform change buffer merge in the background. Merges will occur when secondary index leaf pages need to be accessed due to executing SQL, purging transaction history, updating index cardinality statistics, or during shutdown with innodb_fast_shutdown=0.

            As part of this work, I think that we should adjust or remove the ability for innodb_force_recovery to prevent change buffer merge. After we remove the merges that would occur at I/O completion, we should have the following situation:

            • innodb_force_recovery=2, which prevents background operations, would disable purge and update of persistent statistics, which could cause reads of secondary index leaf pages, and thus change buffer merge. (Encryption key rotation will no longer cause change buffer merges.) Any remaining merge activity should occur directly due to SQL activity. Tools like mysqldump should not access secondary indexes.
            • innodb_force_recovery=4 becomes redundant and should be treated like innodb_force_recovery=3, which prevents transaction rollback, in addition to disabling the background tasks and ignoring corrupted pages or inaccessible data files.
            • innodb_force_recovery=5 as well as innodb_force_recovery=4 will lose the ability to introduce further corruption. (Currently, they can corrupt secondary index leaf pages.)

            Note: With innodb_force_recovery=5 you could still get an inconsistent logical dump of the data (it is essentially READ UNCOMMITTED), but while reading the database, it would not corrupt the database further.

            marko Marko Mäkelä added a comment - - edited I agree that we probably must keep the buffer pool watch mechanism. As part of this work, the InnoDB master thread will no longer perform change buffer merge in the background. Merges will occur when secondary index leaf pages need to be accessed due to executing SQL, purging transaction history, updating index cardinality statistics, or during shutdown with innodb_fast_shutdown=0 . As part of this work, I think that we should adjust or remove the ability for innodb_force_recovery to prevent change buffer merge. After we remove the merges that would occur at I/O completion, we should have the following situation: innodb_force_recovery=2 , which prevents background operations, would disable purge and update of persistent statistics, which could cause reads of secondary index leaf pages, and thus change buffer merge. (Encryption key rotation will no longer cause change buffer merges.) Any remaining merge activity should occur directly due to SQL activity. Tools like mysqldump should not access secondary indexes. innodb_force_recovery=4 becomes redundant and should be treated like innodb_force_recovery=3 , which prevents transaction rollback, in addition to disabling the background tasks and ignoring corrupted pages or inaccessible data files. innodb_force_recovery=5 as well as innodb_force_recovery=4 will lose the ability to introduce further corruption. (Currently, they can corrupt secondary index leaf pages.) Note: With innodb_force_recovery=5 you could still get an inconsistent logical dump of the data (it is essentially READ UNCOMMITTED ), but while reading the database, it would not corrupt the database further.
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status In Progress [ 3 ] In Review [ 10002 ]

            Please fix the hang in the test main.tc_heuristic_recover. I can do the review after that.

            marko Marko Mäkelä added a comment - Please fix the hang in the test main.tc_heuristic_recover. I can do the review after that.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status Stalled [ 10000 ] In Review [ 10002 ]
            marko Marko Mäkelä made changes -
            Description For MDEV-14481, we must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread.

            This would also simplify MDEV-16526, because the change buffer I/O would not have to be treated as a special case.

            This should also allow us to remove the buf_pool->watch and BUF_BLOCK_POOL_WATCH.

            The change buffer format will not be changed as part of this task. That could remain part of MDEV-11634.
            For MDEV-14481, we must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread.

            This would also simplify MDEV-16526, because the change buffer I/O would not have to be treated as a special case.

            The change buffer format will not be changed as part of this task. That could remain part of MDEV-11634.

            As part of this, the counter {{innodb_ibuf_merge_usec}} will be removed from {{information_schema.innodb_metrics}}.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            Status Stalled [ 10000 ] In Review [ 10002 ]

            I pushed some suggested follow-up changes to the branch. I think that it is good to go after some testing, for both stability and performance.

            marko Marko Mäkelä added a comment - I pushed some suggested follow-up changes to the branch. I think that it is good to go after some testing, for both stability and performance.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-19514.ods [ 49026 ]
            axel Axel Schwenke added a comment -

            I did a comparative benchmark of the latest commit in bb-10.5-mdev-19514 vs. the last merge of that branch with 10.5 (labeled "baseline"). As discussed with marko I used OLTP tables with a total size slightly bigger than the buffer pool. I modified the SELECTs to use the secondary index - hence reads will potentially merge buffered changes. Writes are either UPDATEs of the indexed column or DELETE or INSERT.
            In a nutshell: performance is very much the same. If at all, there is slight advantage of the final commit over the baseline. Out of curiosity I also did a run of the baseline with change buffering disabled. It shows clearly that the change buffer has a positive impact on performance.
            Details are in attachment MDEV-19514.ods

            axel Axel Schwenke added a comment - I did a comparative benchmark of the latest commit in bb-10.5-mdev-19514 vs. the last merge of that branch with 10.5 (labeled "baseline"). As discussed with marko I used OLTP tables with a total size slightly bigger than the buffer pool. I modified the SELECTs to use the secondary index - hence reads will potentially merge buffered changes. Writes are either UPDATEs of the indexed column or DELETE or INSERT. In a nutshell: performance is very much the same. If at all, there is slight advantage of the final commit over the baseline. Out of curiosity I also did a run of the baseline with change buffering disabled. It shows clearly that the change buffer has a positive impact on performance. Details are in attachment MDEV-19514.ods

            axel, thank you! I see a trend of a slight improvement with the change buffering enabled. That could be because the change buffer merges no longer occur in the background, preserving I/O and CPU capacity for serving the immediate tasks.

            marko Marko Mäkelä added a comment - axel , thank you! I see a trend of a slight improvement with the change buffering enabled. That could be because the change buffer merges no longer occur in the background, preserving I/O and CPU capacity for serving the immediate tasks.
            marko Marko Mäkelä made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Matthias Leich [ mleich ]
            marko Marko Mäkelä added a comment - - edited

            I have one more change for consideration: removing the ability of the purge of history to submit work to the change buffer. This also removes the buffer pool watch mechanism. This was motivated by my observation in MDEV-11634 that the change buffering is never used on transaction rollback. I believe that we can improve the performance of purge in a more controlled fashion by MDEV-16260.

            I ported the change to 10.2 as a fix of MDEV-19344.

            marko Marko Mäkelä added a comment - - edited I have one more change for consideration: removing the ability of the purge of history to submit work to the change buffer . This also removes the buffer pool watch mechanism. This was motivated by my observation in MDEV-11634 that the change buffering is never used on transaction rollback. I believe that we can improve the performance of purge in a more controlled fashion by MDEV-16260 . I ported the change to 10.2 as a fix of MDEV-19344 .
            axel Axel Schwenke made changes -
            Attachment MDEV-19514-2.ods [ 49084 ]
            Attachment MDEV-19514-3.ods [ 49085 ]
            axel Axel Schwenke added a comment -

            I did two more rounds of benchmarking.

            1. commit 6203deb02fd Stop buffering delete (purge) operations vs. previous state of bb-10.5-MDEV-19514. Results are in attached file MDEV-19514-2.ods. The change has on average a positive impact on performance.

            2. commit 6203deb02fd with different setting of innodb_change_buffering. Results are in attached file MDEV-19514-3.ods. It turns out that the default of "all" gives best performance. Specifically with "inserts" für INSERT-only buffering the performance suffers.

            axel Axel Schwenke added a comment - I did two more rounds of benchmarking. 1. commit 6203deb02fd Stop buffering delete (purge) operations vs. previous state of bb-10.5- MDEV-19514 . Results are in attached file MDEV-19514-2.ods . The change has on average a positive impact on performance. 2. commit 6203deb02fd with different setting of innodb_change_buffering. Results are in attached file MDEV-19514-3.ods . It turns out that the default of "all" gives best performance. Specifically with "inserts" für INSERT-only buffering the performance suffers.
            axel Axel Schwenke made changes -
            Attachment MDEV-19514-2.ods [ 49084 ]
            axel Axel Schwenke made changes -
            Attachment MDEV-19514-2.ods [ 49143 ]
            axel Axel Schwenke added a comment -

            I updated MDEV-19514-2.ods with two more sheets. While I ran the first benchmarks with a buffer pool size of 32G I now also did runs with 20G and 40G buffer pool respectively. I did this after seeing the results of the same change for 10.2 in MDEV-19344 in order to verify if 10.5 really behaves that much better.

            It turns out that with 20G buffer pool, 10.5 suffers the same performance drop as 10.2 when purge operations are not buffered.

            axel Axel Schwenke added a comment - I updated MDEV-19514-2.ods with two more sheets. While I ran the first benchmarks with a buffer pool size of 32G I now also did runs with 20G and 40G buffer pool respectively. I did this after seeing the results of the same change for 10.2 in MDEV-19344 in order to verify if 10.5 really behaves that much better. It turns out that with 20G buffer pool, 10.5 suffers the same performance drop as 10.2 when purge operations are not buffered.
            marko Marko Mäkelä made changes -

            Based on the benchmark results, we will keep the purge buffering. If there had been no regression, we would have done it in MDEV-19344.

            marko Marko Mäkelä added a comment - Based on the benchmark results, we will keep the purge buffering. If there had been no regression, we would have done it in MDEV-19344 .

                RQG testing on the tree 10.5-MDEV-19514 including the patch for
                MDEV-20805
                The tree showed only open known bugs. 
            

            mleich Matthias Leich added a comment - RQG testing on the tree 10.5-MDEV-19514 including the patch for MDEV-20805 The tree showed only open known bugs.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2019-10-11 14:57:48.0 2019-10-11 14:57:48.628
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5.0 [ 23709 ]
            Fix Version/s 10.5 [ 23123 ]
            Assignee Matthias Leich [ mleich ] Marko Mäkelä [ marko ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            MariaDB 10.5.7 will include a follow-up fix: The page read completion callback function used to invoke a function that could request the change buffer bitmap page from the buffer pool. Allocating pages from the buffer pool on read completion is a bad idea and could potentially lead to hangs.

            marko Marko Mäkelä added a comment - MariaDB 10.5.7 will include a follow-up fix : The page read completion callback function used to invoke a function that could request the change buffer bitmap page from the buffer pool. Allocating pages from the buffer pool on read completion is a bad idea and could potentially lead to hangs.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 96805 ] MariaDB v4 [ 133957 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            Sometimes the change buffer helps a lot. One result is from the insert benchmark and the insert rate is more than 3X larger when the change buffer is enabled.

            mdcallag Mark Callaghan added a comment - Sometimes the change buffer helps a lot. One result is from the insert benchmark and the insert rate is more than 3X larger when the change buffer is enabled.

            mdcallag, thank you for your comment. I see that you ran your benchmark on MySQL 8.0.32 and not a version of MariaDB that would include MDEV-24621. That would help when loading data into an initially empty table.

            marko Marko Mäkelä added a comment - mdcallag , thank you for your comment. I see that you ran your benchmark on MySQL 8.0.32 and not a version of MariaDB that would include MDEV-24621 . That would help when loading data into an initially empty table.

            It isn't possible to show the perf impact of having the change buffer enabled and then disabled for a workload by using MariaDB 11, assuming that MariaDB 11 no longer supports the change buffer.

            mdcallag Mark Callaghan added a comment - It isn't possible to show the perf impact of having the change buffer enabled and then disabled for a workload by using MariaDB 11, assuming that MariaDB 11 no longer supports the change buffer.
            marko Marko Mäkelä added a comment - - edited

            mdcallag, there are 7 major versions of MariaDB Server (10.5 through 10.11) where not much has been changed with regard to the change buffer. It was disabled by default in MDEV-27734 (10.5), deprecated in MDEV-27735 (10.9), and removed 3 major releases later (MDEV-29694). Apart from MDEV-30009 and other data corruption bugs covered in my FOSDEM 2023 talk, we have at least MDEV-30134 that I will have to analyze and fix.

            marko Marko Mäkelä added a comment - - edited mdcallag , there are 7 major versions of MariaDB Server (10.5 through 10.11) where not much has been changed with regard to the change buffer. It was disabled by default in MDEV-27734 (10.5), deprecated in MDEV-27735 (10.9), and removed 3 major releases later ( MDEV-29694 ). Apart from MDEV-30009 and other data corruption bugs covered in my FOSDEM 2023 talk , we have at least MDEV-30134 that I will have to analyze and fix.

            My workload (3 secondary indexes, uniform random access, database about 8X larger than memory) is closer to a worst case for showing how bad things can get without the change buffer. Your workload (1 secondary index, database not much larger than memory) is closer to a best case. If your workload cached all, or most, secondary index leaf pages then you will obviously not see a benefit from the change buffer.

            mdcallag Mark Callaghan added a comment - My workload (3 secondary indexes, uniform random access, database about 8X larger than memory) is closer to a worst case for showing how bad things can get without the change buffer. Your workload (1 secondary index, database not much larger than memory) is closer to a best case. If your workload cached all, or most, secondary index leaf pages then you will obviously not see a benefit from the change buffer.
            mihaQ MikaH added a comment -

            Thank you Mark Callaghan for sharing your experiences related to innodb change buffering. I am playing with dataset sizes 4-8x more than available RAM. We stay on SW-level 10.5.6 until MariaDB & Codership is able to publish newer software (including wsrep) that can beat the performance and stability of the 10.5.6, and we have verified it on our own performance and stability tests. I will share our results but it takes time.

            mihaQ MikaH added a comment - Thank you Mark Callaghan for sharing your experiences related to innodb change buffering. I am playing with dataset sizes 4-8x more than available RAM. We stay on SW-level 10.5.6 until MariaDB & Codership is able to publish newer software (including wsrep) that can beat the performance and stability of the 10.5.6, and we have verified it on our own performance and stability tests. I will share our results but it takes time.

            mihaQ, I assume that you experienced performance regressions related to some page flushing changes in MariaDB 10.5.7. They should be mostly addressed in later releases of the 10.5 series. In the 10.6 series, as you can read in MDEV-30628, there is a regression that I am currently working on. Preliminary results related to MDEV-26055 and MDEV-26827 are very promising. Once that is tackled, I will move on to MDEV-29401.

            marko Marko Mäkelä added a comment - mihaQ , I assume that you experienced performance regressions related to some page flushing changes in MariaDB 10.5.7. They should be mostly addressed in later releases of the 10.5 series. In the 10.6 series, as you can read in MDEV-30628 , there is a regression that I am currently working on. Preliminary results related to MDEV-26055 and MDEV-26827 are very promising. Once that is tackled, I will move on to MDEV-29401 .
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.