[MDEV-19514] Defer change buffer merge until pages are requested Created: 2019-05-17  Updated: 2023-10-19  Resolved: 2019-10-11

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.5.0

Type: Task Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Attachments: File MDEV-19514-2.ods     File MDEV-19514-3.ods     File MDEV-19514.ods    
Issue Links:
Blocks
blocks MDEV-11634 Improve the InnoDB change buffer Closed
blocks MDEV-14481 Execute InnoDB crash recovery in the ... Closed
blocks MDEV-16526 Overhaul the InnoDB page flushing Closed
blocks MDEV-18724 Replace buf_block_t::mutex with more ... Closed
is blocked by MDEV-20543 Latching order violation during B-tre... Closed
is blocked by MDEV-20805 ibuf_add_free_page() is not initializ... Closed
PartOf
is part of MDEV-12700 Allow innodb_read_only startup withou... Closed
Problem/Incident
causes MDEV-22090 Change buffer is not freed after drop... Closed
Relates
relates to MDEV-16989 InnoDB hang on crash recovery: Waited... Closed
relates to MDEV-21030 MariaDB keepts crashing on start Closed
relates to MDEV-23380 InnoDB reads a page from disk despite... Closed
relates to MDEV-23973 Change buffer corruption when realloc... Closed
relates to MDEV-25783 CHECK TABLE harvests InnoDB: Index 'a... Closed
relates to MDEV-26464 InnoDB: Failing assertion: UT_LIST_GE... Closed
relates to MDEV-27199 Require ib_logfile0 to exist unless i... Closed
relates to MDEV-27734 Set innodb_change_buffering=none by d... Closed
relates to MDEV-18698 Show InnoDB's internal background thr... Open
relates to MDEV-27153 ibdata1 file Leaking ? (errno 135) - ... Closed
relates to MDEV-31621 Remove ibuf_read_merge_pages() call f... Closed

 Description   

For MDEV-14481, we must defer the change buffer merge to the moment when the secondary index leaf page is requested by a user thread.

This would also simplify MDEV-16526, because the change buffer I/O would not have to be treated as a special case.

The change buffer format will not be changed as part of this task. That could remain part of MDEV-11634.

As part of this, the counter innodb_ibuf_merge_usec will be removed from information_schema.innodb_metrics.



 Comments   
Comment by Thirunarayanan Balathandayuthapani [ 2019-06-03 ]

MDEV-19514 description mentions that we can remove buf_pool->watch and BUF_POOL_BLOCK_WATCH. But I would like
to differ in it. Because buffer pool watch introduced to avoid the race between user page read and purge
buffering.

  • When purge threads tried to do purge the records for non-unique secondary index leaf page, It tries to
    access page in buffer pool with mode BUF_GET_IF_IN_POOL_OR_WATCH. If the page is not present in buffer pool
    then purge assigns page id to one of the watch page which was created during buffer pool initialization.
  • After setting the watch, purge tries to do insert the purge buffering operation. Purge threads checks
    whether page has been read in the mean time after acquiring change buffer page. If that check fails then
    purge thread will go ahead with purge buffering because normal page read has to wait for change buffer
    page in buf_page_io_complete() to merge change buffer changes anyway.
  • Setting the watch also increases the buffer fix count. So it make sure that page hasn't been kicked
    out of LRU list.

So that, watch solves the race between page read and purge buffering. As discussed with marko, we don't
need to remove watch.

Comment by Marko Mäkelä [ 2019-07-03 ]

I agree that we probably must keep the buffer pool watch mechanism.

As part of this work, the InnoDB master thread will no longer perform change buffer merge in the background. Merges will occur when secondary index leaf pages need to be accessed due to executing SQL, purging transaction history, updating index cardinality statistics, or during shutdown with innodb_fast_shutdown=0.

As part of this work, I think that we should adjust or remove the ability for innodb_force_recovery to prevent change buffer merge. After we remove the merges that would occur at I/O completion, we should have the following situation:

  • innodb_force_recovery=2, which prevents background operations, would disable purge and update of persistent statistics, which could cause reads of secondary index leaf pages, and thus change buffer merge. (Encryption key rotation will no longer cause change buffer merges.) Any remaining merge activity should occur directly due to SQL activity. Tools like mysqldump should not access secondary indexes.
  • innodb_force_recovery=4 becomes redundant and should be treated like innodb_force_recovery=3, which prevents transaction rollback, in addition to disabling the background tasks and ignoring corrupted pages or inaccessible data files.
  • innodb_force_recovery=5 as well as innodb_force_recovery=4 will lose the ability to introduce further corruption. (Currently, they can corrupt secondary index leaf pages.)

Note: With innodb_force_recovery=5 you could still get an inconsistent logical dump of the data (it is essentially READ UNCOMMITTED), but while reading the database, it would not corrupt the database further.

Comment by Marko Mäkelä [ 2019-09-03 ]

Please fix the hang in the test main.tc_heuristic_recover. I can do the review after that.

Comment by Marko Mäkelä [ 2019-09-26 ]

I pushed some suggested follow-up changes to the branch. I think that it is good to go after some testing, for both stability and performance.

Comment by Axel Schwenke [ 2019-09-27 ]

I did a comparative benchmark of the latest commit in bb-10.5-mdev-19514 vs. the last merge of that branch with 10.5 (labeled "baseline"). As discussed with marko I used OLTP tables with a total size slightly bigger than the buffer pool. I modified the SELECTs to use the secondary index - hence reads will potentially merge buffered changes. Writes are either UPDATEs of the indexed column or DELETE or INSERT.
In a nutshell: performance is very much the same. If at all, there is slight advantage of the final commit over the baseline. Out of curiosity I also did a run of the baseline with change buffering disabled. It shows clearly that the change buffer has a positive impact on performance.
Details are in attachment MDEV-19514.ods

Comment by Marko Mäkelä [ 2019-09-30 ]

axel, thank you! I see a trend of a slight improvement with the change buffering enabled. That could be because the change buffer merges no longer occur in the background, preserving I/O and CPU capacity for serving the immediate tasks.

Comment by Marko Mäkelä [ 2019-10-03 ]

I have one more change for consideration: removing the ability of the purge of history to submit work to the change buffer. This also removes the buffer pool watch mechanism. This was motivated by my observation in MDEV-11634 that the change buffering is never used on transaction rollback. I believe that we can improve the performance of purge in a more controlled fashion by MDEV-16260.

I ported the change to 10.2 as a fix of MDEV-19344.

Comment by Axel Schwenke [ 2019-10-04 ]

I did two more rounds of benchmarking.

1. commit 6203deb02fd Stop buffering delete (purge) operations vs. previous state of bb-10.5-MDEV-19514. Results are in attached file MDEV-19514-2.ods. The change has on average a positive impact on performance.

2. commit 6203deb02fd with different setting of innodb_change_buffering. Results are in attached file MDEV-19514-3.ods. It turns out that the default of "all" gives best performance. Specifically with "inserts" für INSERT-only buffering the performance suffers.

Comment by Axel Schwenke [ 2019-10-08 ]

I updated MDEV-19514-2.ods with two more sheets. While I ran the first benchmarks with a buffer pool size of 32G I now also did runs with 20G and 40G buffer pool respectively. I did this after seeing the results of the same change for 10.2 in MDEV-19344 in order to verify if 10.5 really behaves that much better.

It turns out that with 20G buffer pool, 10.5 suffers the same performance drop as 10.2 when purge operations are not buffered.

Comment by Marko Mäkelä [ 2019-10-11 ]

Based on the benchmark results, we will keep the purge buffering. If there had been no regression, we would have done it in MDEV-19344.

Comment by Matthias Leich [ 2019-10-11 ]

    RQG testing on the tree 10.5-MDEV-19514 including the patch for
    MDEV-20805
    The tree showed only open known bugs. 

Comment by Marko Mäkelä [ 2020-10-15 ]

MariaDB 10.5.7 will include a follow-up fix: The page read completion callback function used to invoke a function that could request the change buffer bitmap page from the buffer pool. Allocating pages from the buffer pool on read completion is a bad idea and could potentially lead to hangs.

Comment by Mark Callaghan [ 2023-02-08 ]

Sometimes the change buffer helps a lot. One result is from the insert benchmark and the insert rate is more than 3X larger when the change buffer is enabled.

Comment by Marko Mäkelä [ 2023-02-08 ]

mdcallag, thank you for your comment. I see that you ran your benchmark on MySQL 8.0.32 and not a version of MariaDB that would include MDEV-24621. That would help when loading data into an initially empty table.

Comment by Mark Callaghan [ 2023-02-08 ]

It isn't possible to show the perf impact of having the change buffer enabled and then disabled for a workload by using MariaDB 11, assuming that MariaDB 11 no longer supports the change buffer.

Comment by Marko Mäkelä [ 2023-02-09 ]

mdcallag, there are 7 major versions of MariaDB Server (10.5 through 10.11) where not much has been changed with regard to the change buffer. It was disabled by default in MDEV-27734 (10.5), deprecated in MDEV-27735 (10.9), and removed 3 major releases later (MDEV-29694). Apart from MDEV-30009 and other data corruption bugs covered in my FOSDEM 2023 talk, we have at least MDEV-30134 that I will have to analyze and fix.

Comment by Mark Callaghan [ 2023-02-09 ]

My workload (3 secondary indexes, uniform random access, database about 8X larger than memory) is closer to a worst case for showing how bad things can get without the change buffer. Your workload (1 secondary index, database not much larger than memory) is closer to a best case. If your workload cached all, or most, secondary index leaf pages then you will obviously not see a benefit from the change buffer.

Comment by MikaH [ 2023-03-08 ]

Thank you Mark Callaghan for sharing your experiences related to innodb change buffering. I am playing with dataset sizes 4-8x more than available RAM. We stay on SW-level 10.5.6 until MariaDB & Codership is able to publish newer software (including wsrep) that can beat the performance and stability of the 10.5.6, and we have verified it on our own performance and stability tests. I will share our results but it takes time.

Comment by Marko Mäkelä [ 2023-03-09 ]

mihaQ, I assume that you experienced performance regressions related to some page flushing changes in MariaDB 10.5.7. They should be mostly addressed in later releases of the 10.5 series. In the 10.6 series, as you can read in MDEV-30628, there is a regression that I am currently working on. Preliminary results related to MDEV-26055 and MDEV-26827 are very promising. Once that is tackled, I will move on to MDEV-29401.

Generated at Thu Feb 08 08:52:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.