[MDEV-16526] Overhaul the InnoDB page flushing Created: 2018-06-19 Updated: 2023-08-15 Resolved: 2023-08-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 11.1.1, 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | flush, performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The writing of modified InnoDB data pages to data files should be overhauled. See some of the comments in MDEV-15058. |
| Comments |
| Comment by Marko Mäkelä [ 2019-02-04 ] |
|
Page flushing performance should be improved by removing the unsorted buf_pool->flush_list, and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background ( |
| Comment by Marko Mäkelä [ 2019-03-20 ] |
|
We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations. |
| Comment by Marko Mäkelä [ 2019-03-20 ] |
|
Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program, which suggests fault injection by dmsetup. I think that we should perform this kind of fault injection testing for InnoDB. |
| Comment by Marko Mäkelä [ 2019-03-20 ] |
|
The FOSDEM 2019 talk on PostgreSQL vs. fsync is worth watching. |
| Comment by Marko Mäkelä [ 2019-04-12 ] |
|
In the function buf_flush_write_block_low(), there is a call log_write_up_to(bpage->newest_modification, true), which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads. |
| Comment by Marko Mäkelä [ 2019-04-29 ] |
|
As noted in |
| Comment by Marko Mäkelä [ 2019-05-02 ] |
|
As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync(). We must keep this in mind. Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way. |
| Comment by Anjum Naveed [ 2019-05-14 ] |
|
I will love to contribute with this. Please let me know how can I be of help. |
| Comment by Marko Mäkelä [ 2019-05-17 ] |
|
anjumnaveed81, we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5. |
| Comment by zhai weixiang [ 2019-10-30 ] |
|
It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO |
| Comment by Marko Mäkelä [ 2019-11-15 ] |
|
zhaiwx1987, such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed wlad, as noted in |
| Comment by Marko Mäkelä [ 2019-11-15 ] |
|
The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck. |
| Comment by Marko Mäkelä [ 2019-11-15 ] |
|
I think that as part of this task, we should try to simplify buf_flush_write_block_low(). Can we avoid acquiring log_sys.mutex? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN) is newer than what has been persisted in redo log (log_sys.write_lsn)? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue? We should remove the BUF_FLUSH_SINGLE_PAGE, but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on |
| Comment by Marko Mäkelä [ 2019-11-28 ] |
|
As part of this task, I think that the following threads that
|
| Comment by Marko Mäkelä [ 2019-12-11 ] |
|
Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task. |
| Comment by Marko Mäkelä [ 2020-02-28 ] |
|
It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF. The logic would sometimes fail, as reported in I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block(), instead of passive sleeping? |
| Comment by Marko Mäkelä [ 2020-10-28 ] |
|
Now that most of the work has been completed in Because this is mostly an umbrella task, having an estimate is not too meaningful. |
| Comment by Vladislav Vaintroub [ 2021-01-26 ] |
|
I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there. |
| Comment by Marko Mäkelä [ 2021-10-26 ] |
|
I think that most of this has already been done. One remaining tweak is |
| Comment by Marko Mäkelä [ 2023-08-15 ] |
|
After |