[MDEV-26827] Make page flushing even faster Created: 2021-10-14  Updated: 2024-01-19  Resolved: 2023-03-16

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5, 10.6, 10.7
Fix Version/s: 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: performance

Issue Links:
Blocks
blocks MDEV-16526 Overhaul the InnoDB page flushing Closed
blocks MDEV-33053 InnoDB LRU flushing does not run befo... Closed
is blocked by MDEV-25113 Reduce effect of parallel background ... Closed
is blocked by MDEV-26055 Adaptive flushing is still not gettin... Closed
is blocked by MDEV-27058 Buffer page descriptors are too large Closed
Problem/Incident
causes MDEV-30900 MacOS Crash when testing Closed
causes MDEV-31084 Assertion `waiting' failed in TP_conn... Closed
causes MDEV-31114 Assertion `!tls_worker_data->is_waiti... Closed
causes MDEV-31309 The status variable Innodb_buffer_poo... Closed
causes MDEV-31350 test innodb.recovery_memory failed on... Closed
causes MDEV-32029 Assertion failures in log_sort_flush_... Closed
causes MDEV-32588 InnoDB may hang when running out of b... Closed
causes MDEV-33275 buf_flush_LRU(): mysql_mutex_assert_o... Closed
Relates
relates to MDEV-28052 test main.implicit_commit crashed on ... Needs Feedback
relates to MDEV-31048 InnoDB read_slots and write_slots are... Closed
relates to MDEV-32134 InnoDB hang in buf_flush_wait_LRU_bat... Closed
relates to MDEV-32511 Race condition between page write com... Closed
relates to MDEV-26004 Excessive wait times in buf_LRU_get_f... Closed
relates to MDEV-31350 test innodb.recovery_memory failed on... Closed
relates to MDEV-32681 Test case innodb.undo_truncate_recove... Closed

 Description   

MDEV-25113 removed the acquisition of buf_pool.flush_list_mutex from buf_page_write_complete(). But, acquiring buf_pool.mutex should not really be necessary except in eviction flushing (request.is_LRU()) or when the block->lock is not available (no uncompressed copy of a ROW_FORMAT=COMPRESSED exists in the buffer pool).

In the common case of checkpoint flushing, we can actually rely on the block->lock to protect us. The only other reason to hold buf_pool.mutex was updating the count of outstanding page writes (buf_pool.n_flush_list()). That counter is actually redundant, because we can refer to write_slots->pending_io_count().

Finally, we would remove pthread_cond_broadcast(&buf_pool.done_flush_list) from buf_page_write_complete(). Only the buf_flush_page_cleaner would call that. To wait for page writes to complete, os_aio_wait_until_no_pending_writes() may be invoked.

Interface change:
New parameter Innodb_buffer_pool_pages_split has been added



 Comments   
Comment by Krunal Bauskar [ 2021-10-20 ]

I tested this on arm and x86.
x86 I see an improvement of 2-7%.
ARM I see an improvement of 2-5%.

Comment by Marko Mäkelä [ 2021-10-20 ]

krunalbauskar, thank you! Was that https://github.com/MariaDB/server/commit/c95aa553b9db56e2013f4e5db34f90eda28af995 (or with some minor fixups done today)? That branch also included MDEV-26826 and MDEV-26828.

Comment by Marko Mäkelä [ 2021-10-22 ]

In our internal testing, we are not observing any performance improvement with this change, no matter what type of configuration is used. In fact, there is a small regression. So, this one will require some more work. The other fixes (MDEV-26826, MDEV-26828, MDEV-26769) seem to help a little.

Comment by Marko Mäkelä [ 2021-11-08 ]

Much of the time, on page write completion we would also invoke buf_dblwr_t::write_completed(), which will acquire and release buf_dblwr.mutex. It does not seem easy to remove, other than by removing the need for the doublewrite buffer (see MDEV-11659). That could be the next concurrency bottleneck after buf_pool.mutex.

Comment by Marko Mäkelä [ 2021-11-18 ]

We are still observing a performance regression after rebasing this on MDEV-27058. Hence, I think that we must preserve an explicit counter of the number of outstanding page writes. We could protect that counter with buf_dblwr.mutex, which we could acquire also for page writes that skipped the doublewrite buffer. Instead of removing the condition variable buf_pool.done_flush_list, we can replace it with something that pairs with buf_dblwr.mutex.

Comment by Marko Mäkelä [ 2022-01-18 ]

The first part of this was pushed as MDEV-27416, to fix a rare hang during checkpoint. That change did not cause any performance regression in our internal tests, but it did cause MDEV-27499.

The remaining 2 commits were last rebased on MDEV-14425, and an observed regression remained.

I am keeping this ticket "in progress" as a reminder for me that I consider this an important issue that needs to be resolved. After all, performance tests of MDEV-14425 indicates that by far, the most contended mutexes under write load are log_sys.mutex and buf_pool.mutex, and this change aims to reduce contention on the latter.

Comment by Marko Mäkelä [ 2022-02-21 ]

A further idea: Why should the LRU eviction flushing actually evict each page on write completion? It would seem better to keep the write completion function as simple as possible, and to evict pages in the actual user threads that need to allocate a page. Page allocation will already have to hold buf_pool.mutex anyway, even when no page is being evicted.

Furthermore, the buf_flush_page_cleaner() thread could initiate two types of page writes, like it used to do before MDEV-23855: Not only write blocks ordered by buf_pool.flush_list, but also some dirty blocks ordered by the buf_pool.LRU list. That would ensure that some of the least frequently used pages are clean, and simply allow a less frequently used block to be replaced by whatever needs to allocate the page. Pages that do not need to be written back to data files can be reused instantly.

Comment by Marko Mäkelä [ 2022-10-17 ]

In MDEV-29383 I noted the following: Performance could be improved if we did not set mtr_t::m_made_dirty already when registering MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX, but deferred it until the moment we set the MTR_MEMO_MODIFY flag on a block. In that way, even if a mini-transaction acquired a U or X latch on a page but never modified that page, mtr_t::commit() could avoid acquiring log_sys.flush_order_mutex. We only need that mutex when the mini-transaction actually needs to add a previously clean block to buf_pool.flush_list.

Comment by Marko Mäkelä [ 2023-03-07 ]

After I rebased this on an attempted fix of MDEV-26055, I realized that a follow-up to MDEV-25113 is needed, now that page write completion no longer involves acquiring buf_pool.mutex. All use of buf_pool.flush_hp will have to be evaluated carefully. The ‘dirtiness’ of blocks will be protected by buf_page_t::lock, which needs to be acquired earlier during flushing.

Furthermore, I realized that buf_page_t::write_complete() may not need to perform buf_pool.LRU eviction at all; starting with a fix of MDEV-26055 that would be done in larger batches by buf_pool_page_cleaner(). This would make IORequest::is_LRU() redundant.

Generated at Thu Feb 08 09:48:15 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.