[MDEV-16526] Overhaul the InnoDB page flushing Created: 2018-06-19  Updated: 2023-08-15  Resolved: 2023-08-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 11.1.1, 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Task Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: flush, performance

Issue Links:
Blocks
is blocked by MDEV-15053 Reduce buf_pool_t::mutex contention Closed
is blocked by MDEV-18115 Remove dummy tablespace for the redo log Closed
is blocked by MDEV-19514 Defer change buffer merge until pages... Closed
is blocked by MDEV-23399 10.5 performance regression with IO-b... Closed
is blocked by MDEV-26827 Make page flushing even faster Closed
PartOf
includes MDEV-12227 Defer writes to the InnoDB temporary ... Closed
includes MDEV-23754 Replace buf_pool.flush_list with a pr... Closed
includes MDEV-23756 Implement event-driven innodb_adaptiv... Open
Relates
relates to MDEV-11378 AliSQL: [Perf] Issue#23 MERGE INNODB ... Open
relates to MDEV-11384 AliSQL: [Feature] Issue#19 BUFFER POO... Closed
relates to MDEV-15058 Remove multiple InnoDB buffer pool in... Closed
relates to MDEV-16339 Upgrading to 10.1.32 shows innodb_emp... Closed
relates to MDEV-17481 mariadb service won't shutdown when i... Closed
relates to MDEV-19356 Assertion 'space->free_limit == 0 || ... Closed
relates to MDEV-21132 Remove buf_page_t::newest_modification Closed
relates to MDEV-24854 Change innodb_flush_method=O_DIRECT b... Closed
relates to MDEV-11916 Page compression - use smaller writes... Open
relates to MDEV-12226 Avoid writes of freed (garbage) pages... Closed
relates to MDEV-13670 [Note] InnoDB: page_cleaner: 1000ms i... Closed
relates to MDEV-14425 Change the InnoDB redo log format to ... Closed
relates to MDEV-14550 Error log flood : "InnoDB: page_clea... Closed
relates to MDEV-15528 Avoid writing freed InnoDB pages Closed

 Description   

The writing of modified InnoDB data pages to data files should be overhauled. See some of the comments in MDEV-15058.



 Comments   
Comment by Marko Mäkelä [ 2019-02-04 ]

Page flushing performance should be improved by removing the unsorted buf_pool->flush_list, and always sorting the writes in the same way as buf_pool->flush_rbt does. In this way, there could be more progress from the page writes. This should also allow crash recovery to proceed in the background (MDEV-14481) while the server is already accepting connections.

Comment by Marko Mäkelä [ 2019-03-20 ]

We should look at implementing direct I/O. Where applicable, perhaps we could also replace `fsync()` with asynchronous `IOCB_CMD_FDSYNC` operations.

Comment by Marko Mäkelä [ 2019-03-20 ]

Craig Ringer on the PostgreSQL mailing list shared a link to his fsync() test program, which suggests fault injection by dmsetup.

I think that we should perform this kind of fault injection testing for InnoDB.

Comment by Marko Mäkelä [ 2019-03-20 ]

The FOSDEM 2019 talk on PostgreSQL vs. fsync is worth watching.

Comment by Marko Mäkelä [ 2019-04-12 ]

In the function buf_flush_write_block_low(), there is a call log_write_up_to(bpage->newest_modification, true), which seems out of the place. Yes, we must ensure that we will not write out any pages before writing the corresponding redo log (write-ahead logging). But, maybe we could just move to the next block in the flush_list whose newest modification is old enough to be written out. Only if we have run out of all such pages, it would seem to make sense to wait for a flush of the redo log. In this way, perhaps the page flushing could be managed by fewer threads.

Comment by Marko Mäkelä [ 2019-04-29 ]

As noted in MDEV-19356, it might be worthwhile for fil_system.LRU to prefer closing those files for which there are no pending changes in the buffer pool.

Comment by Marko Mäkelä [ 2019-05-02 ]

As pointed out in MySQL Bug #94912, writes to O_DIRECT files do not imply fdatasync(). We must keep this in mind.

Perhaps all persistent InnoDB files (data and redo log files) should be written to in the same way.

Comment by Anjum Naveed [ 2019-05-14 ]

I will love to contribute with this. Please let me know how can I be of help.

Comment by Marko Mäkelä [ 2019-05-17 ]

anjumnaveed81, we are in the early planning stage, and the plan is to completely overhaul the InnoDB buffer pool I/O interface in 10.5.
If you have some design ideas, feel free to write them here.

Comment by zhai weixiang [ 2019-10-30 ]

It would be great if we can flush page without acquiring S lock on block. For example we can copy out the page before IO

Comment by Marko Mäkelä [ 2019-11-15 ]

zhaiwx1987, such ‘shadow page’ flushing is indeed being used by other databases. MariaDB already uses separate write buffers for page_compressed and encrypted tables, but probably unnecessarily keeps holding the page latch. It is definitely worth trying, and could make log checkpoint faster (even after we have addressed MDEV-14462).

wlad, as noted in MDEV-16264, I think that SRV_MAX_N_IO_THREADS and any related code and variables should be removed.

Comment by Marko Mäkelä [ 2019-11-15 ]

The goal of this task is to ensure that a single buffer pool instance will not perform worse in a write-heavy benchmark than multiple buffer pools. This is a potential concern with MDEV-15058, and will be hopefully helped by MDEV-15053 already.

When running the benchmark, you should disable the InnoDB doublewrite buffer, because that is an obvious bottleneck.

Comment by Marko Mäkelä [ 2019-11-15 ]

I think that as part of this task, we should try to simplify buf_flush_write_block_low(). Can we avoid acquiring log_sys.mutex? Instead of calling log_write_up_to() in that function, perhaps the page flush can simply skip those pages whose buf_page_t::newest_modification (which by the way should actually be removed, and replaced with direct access to FIL_PAGE_LSN) is newer than what has been persisted in redo log (log_sys.write_lsn)? Then, between flush batches, trigger the log flush (if it cannot be done by the dedicated log writer task; see MDEV-14462) to ensure progress. I think that we should change the type of log_sys.write_lsn and log_sys.lsn to std::atomic<lsn_t>.

Furthermore, I think that we should try to remove buf_pool_t::flush_rbt (whose purpose is to ensure during redo log apply that buf_pool_t::flush_list is kept in a more reasonable order) and try to always keep the flush_list sorted in a way that tries to ensure optimal progress. Maybe use std::priority_queue?

We should remove the BUF_FLUSH_SINGLE_PAGE, but I think that we may have to keep both LRU and flush_list flushing. I may be mistaken, but based on MDEV-14550 I have the impression that in the worst case, the LRU and flush_list batches are taking too much time, blocking the progress of the other type of flushing batch. We should try to minimize mutex contention while keeping the I/O system saturated with useful page flushes (with large values of innodb_max_dirty_pages_pct, try to avoid repeated writes of a frequently changed page).

Comment by Marko Mäkelä [ 2019-11-28 ]

As part of this task, I think that the following threads that MDEV-16264 failed to convert to tasks should be removed or refactored:

  • fil_crypt_thread (its sole purpose is to make pages dirty, so that they will be re-encrypted on page flush)
  • recv_writer_thread and buf_flush_page_cleaner_coordinator have somewhat overlapping functionality. Not having a separate ‘crash recovery’ mode would seem to be a requirement of MDEV-14481.
Comment by Marko Mäkelä [ 2019-12-11 ]

Mark Callaghan’s blog (re)post Historical - InnoDB IO Performance provides some historical overview that could be relevant for this task.

Comment by Marko Mäkelä [ 2020-02-28 ]

It appears that XtraDB used to disable the single-page flushing when using a larger buffer pool, via the Boolean-disguised-as-enum parameter innodb_empty_free_list_algorithm=BACKOFF. The logic would sometimes fail, as reported in MDEV-16339.

I think that we should consider removing the single-page flushing nevertheless. Perhaps there could be active signaling between the page cleaner and the buf_LRU_get_free_block(), instead of passive sleeping?

Comment by Marko Mäkelä [ 2020-10-28 ]

Now that most of the work has been completed in MDEV-23399 and MDEV-23855, I think that the main work to be done will be in MDEV-12227, MDEV-23756 and possibly other tickets that have been linked as related.

Because this is mostly an umbrella task, having an estimate is not too meaningful.

Comment by Vladislav Vaintroub [ 2021-01-26 ]

I adjusted the priority to "Major", eventhough it is probably "minor" , for the rest of the things left there.

Comment by Marko Mäkelä [ 2021-10-26 ]

I think that most of this has already been done. One remaining tweak is MDEV-26827. We failed to observe any improvement with it; in fact, a small regression was observed. So, it will need some additional work.

Comment by Marko Mäkelä [ 2023-08-15 ]

After MDEV-26055 and MDEV-26827 have been completed, there is not much that can be improved in this area. I can only think of MDEV-11378, submitting scatter-gather write requests instead of multiple single-page write requests.

Generated at Thu Feb 08 08:29:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.