[MDEV-23399] 10.5 performance regression with IO-bound tpcc Created: 2020-08-04  Updated: 2022-04-27  Resolved: 2020-10-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.4
Fix Version/s: 10.5.7

Type: Bug Priority: Critical
Reporter: Axel Schwenke Assignee: Marko Mäkelä
Resolution: Fixed Votes: 7
Labels: performance, regression

Attachments: File my.cnf.01    
Issue Links:
Blocks
blocks MDEV-12227 Defer writes to the InnoDB temporary ... Closed
blocks MDEV-14481 Execute InnoDB crash recovery in the ... Closed
blocks MDEV-16526 Overhaul the InnoDB page flushing Closed
blocks MDEV-23720 Change innodb_log_optimize_ddl=OFF by... Closed
blocks MDEV-23756 Implement event-driven innodb_adaptiv... Open
is blocked by MDEV-23410 buf_LRU_scan_and_free_block() fails t... Closed
Problem/Incident
causes MDEV-25072 Hang + Heap Corruption: SIGABRT in __... Closed
causes MDEV-25773 sysbench-TPCC failed to prepare with ... Closed
causes MDEV-25801 DROP DATABASE very slow after innodb ... Closed
causes MDEV-27022 Buffer pool is being flushed during r... Closed
causes MDEV-28371 Assertion fold == id.fold() failed in... Closed
is caused by MDEV-15053 Reduce buf_pool_t::mutex contention Closed
is caused by MDEV-15058 Remove multiple InnoDB buffer pool in... Closed
Relates
relates to MDEV-11384 AliSQL: [Feature] Issue#19 BUFFER POO... Closed
relates to MDEV-21452 Use condition variables and normal mu... Closed
relates to MDEV-23719 Make lock_sys use page_id_t Closed
relates to MDEV-23855 InnoDB log checkpointing causes regre... Closed
relates to MDEV-24022 mariabackup.undo_space_id failed in ... Closed
relates to MDEV-24278 InnoDB page cleaner keeps waking up o... Closed
relates to MDEV-24913 Assertion !recv_no_log_write in log_w... Closed
relates to MDEV-26004 Excessive wait times in buf_LRU_get_f... Closed
relates to MDEV-27466 Deadlock when purge thread races the ... Closed
relates to MDEV-28052 test main.implicit_commit crashed on ... Needs Feedback
relates to MDEV-28415 ALTER TABLE on a large table hangs In... Closed
relates to MDEV-13670 [Note] InnoDB: page_cleaner: 1000ms i... Closed
relates to MDEV-14550 Error log flood : "InnoDB: page_clea... Closed
relates to MDEV-21330 Lock monitor doesn't print a semaphor... Closed
relates to MDEV-24024 innodb.ibuf_not_empty failed in buildbot Closed

 Description   

Triggered by this blog post from Percona.
Problem could be reproduced with sysbench-tpcc and Percona settings. Buffer pool 25G for 100G data set (1000 warehouses). Datadir located on SSD. tpcc workload with 32 benchmark threads on hardware with 16 cores/32 hyperthreads.
Throughput starts high and then decreases over a varying time period (500 .. 1200 seconds) to reach ~200 tps. Performance schema shows lots of time spent with buf_pool_mutex. CPU usage of the mariadbd process is rather low around 300%.
MySQL 8.0 does not show that problem. MariaDB 10.5.4 performs better than pre-10.5.5 shapshot.



 Comments   
Comment by Axel Schwenke [ 2020-08-04 ]

PMP stack dump and flame graph here

Comment by Marko Mäkelä [ 2020-08-04 ]

In the flame graph, buf_LRU_get_free_block() caught my attention. I attached GDB to the server at the time of degraded performance, and I think that I found some proof of my hypothesis that the problem is that the buffer pool is filled with dirty pages:

10.5 c12d24e29162f35f4fb751b761a0fbd84db19498

(gdb) p buf_pool.LRU
$1 = {count = 1610618, start = 0x7f6500160f30, end = 0x7f6120069520, node = &buf_page_t::LRU}
(gdb) p buf_pool.flush_list
$2 = {count = 1607508, start = 0x7f64e8147b70, end = 0x7f636013eaf0, node = &buf_page_t::list}
(gdb) buf_LRU_free_from_common_LRU_list
(gdb) finish
(gdb) info locals
scanned = 134468

We are conducting a linear search among the 1.6 million buffer pool blocks, searching for one of the about 3000 replaceable blocks. A search that I checked finished in 134,468 iterations. While we are doing this, we are blocking pretty much the entire buffer pool by holding the buf_pool.mutex.
Before MDEV-15058, we could have multiple buffer pool instances, and such a scan would only block a single buffer pool instance, possibly letting some InnoDB threads to proceed while using other buffer pools.

I think that this can be fixed rather by introducing a new buf_pool.clean list that would contain the subset of buf_pool.LRU that is not part of either buf_pool.flush_list or buf_pool.unzip_LRU. Then, the 134,468 iterations of the buf_pool.LRU list should be replaced by visiting the start of buf_pool.clean, replacing O(n) complexity with O(1).

Comment by Marko Mäkelä [ 2020-08-04 ]

For MDEV-15058, we were actually concerned of a regression like this. We did test a scenario where the buffer pool is much smaller than the database, but we made the mistake that we only covered relatively small buffer pools, using most of the RAM for /dev/shm so that we would minimize the I/O overhead and amplify the CPU impact. Due to our usage of a tiny buffer pool, we only covered the critical section with an O(n) scan using a small value of n.

axel repeated this problem with a few dozen threads. At larger number of connections, I can imagine that it should become even more disastrous. The fact that buf_LRU_get_free_block() is holding buf_pool.mutex during the O(n) scan is not only blocking other threads that are trying to use the buffer pool, but it is also preventing other threads from helping with the situation, because each of the following will require buf_pool.mutex:

  • Flushing dirty pages
  • Making freed pages available for reuse: buf_release_freed_page()

Another run with a significantly larger number of concurrent connections was showing buf_flush_single_page_from_LRU() as a CPU hotspot. I think that also it could be explained by the contention on buf_pool.mutex between multiple threads executing buf_LRU_get_free_block(), but we cannot confirm it before a fix is available. The single-page flushing is something that we would like to remove in MDEV-16526.

I suspect that multiple buffer pool instances simply worked around this problem by allowing multiple page cleaner threads to work concurrently on other buffer pool instances while a lengthy scan was blocking one buffer pool instance.

Comment by Marko Mäkelä [ 2020-08-05 ]

I noticed a more obvious regression that I filed as MDEV-23410. It was introduced by MDEV-15053, which was completed after MDEV-15058. That could also explain why we did not observe this anomaly while testing MDEV-15058.

I started working on a fix that basically replaces the debug-only list buf_pool.zip_clean with a buf_pool.clean list, which would be a subset of the buf_pool.LRU list. In this way, the LRU replacement algorithm can scan a shorter list of clean pages instead of a list of all pages. Even if a page is clean, it cannot be evicted if it is currently buffer-fixed. So, we cannot avoid an O(n) scan completely, but we can make the n smaller and kind-of move a condition outside the iteration loop. I will have to solve one problem still: When a clean page needs to be inserted into the middle of the buf_pool.LRU list, we must find the corresponding insert position in the buf_pool.clean list, and this would require an O(n) scan while holding buf_pool.mutex. To make matters worse, this scan would be executed during normal operation, even when no page eviction is needed. We might perhaps order buf_pool.clean in a different way, while still offering reasonable guarantees that a more recently used clean page would not be evicted before a less recently used dirty page.

Comment by Marko Mäkelä [ 2020-08-07 ]

I think that we must consider the following changes:

  • Revive the XtraDB option innodb_empty_free_list_algorithm to disable the single-page flushing (and maybe even enable backoff by default, and prepare a path for deprecating, ignoring, and removing the option)
  • Split buf_pool.LRU into multiple partitions, and keep track of dirty pages in each partition, to allow the LRU replacement to avoid useless traversals, and to prioritize flushing.
  • Do not allow the buf_pool.flush_list processing to block LRU flushing for long periods of time.

I now think that the following options are a dead end, because they could significantly slow down the server in the case when the buffer pool is large enough to hold the entire database:

  • Introducing a buf_pool.clean subset of buf_pool.LRU that would exclude those pages that need flushing
  • Replacing buf_pool.LRU with such a subset of its former self
  • Replacing the linked list with a priority queue, to reduce the computational complexity of the page eviction from O(n)

We have frequent operations that would be affected by such changes:

  • Making a block the most recently used: O(1) with a linked list
  • Inserting a block to the correct position of the ‘clean’ list after it has been flushed: Currently not needed; would be O(n) with a linked list.
Comment by Marko Mäkelä [ 2020-08-19 ]

So far, I have done the following:

  • Removed single-page flushing. Instead, a flushing batch will be initiated or waited for. One debug-instrumented test is failing due to something related to the change buffering.
  • Removed a blocking wait for buf_block_t::lock SX-latch in buf_pool.flush_list flushing.
  • Use a normal mutex and condition variable (via simple mysql_mutex_t and mysql_cond_t wrappers) for synchronization, instead of using the ‘homebrew’ InnoDB mutex and event.

I am experimenting with one further idea:

  • During a page flushing batch, skip those pages whose modifications are newer than what has been durably written to the redo log.
  • At the start of a batch, ensure that enough redo log has been written so that progress can be guaranteed.

I hope that this will sufficiently address the regression, so that it would not be necessary to refine buf_pool.LRU in a GA release.

Comment by Marko Mäkelä [ 2020-08-26 ]

If we cannot find a simple change in 10.5 that can fix the performance regression (some trivial bug like MDEV-23410), we may have to rewrite the entire page flushing. I am working on it, and it might end up either in the 10.5 GA release or in the next major release (10.6). The rewrite is essentially blocking MDEV-12227, MDEV-14481, MDEV-16526.

Comment by Bernardo Perez [ 2020-09-09 ]

Hello,

I am following this report with great interest. I am not sure whether it would be possible for you answer this at this stage but do you think the fix for this issue would make it for the next 10.5.6?

If the answer is yes, do we have estimations on when 10.5.6 could be ready?
If the answer is no, could it be possible that you define when this would make it and for what version?

Thanks in advance.

Comment by Sergei Golubchik [ 2020-09-09 ]

The planned release road map is on the default Jira dashboard, you should see it at https://jira.mariadb.org unless you've created a custom dashboard in your account. Currently the road map says 2020-10-30 for 10.5.6.

Yes, the fix for this issue hopefully will be in.

Comment by Marko Mäkelä [ 2020-09-10 ]

My development branch simplifies the page flushing a lot by removing single-page flushing and by removing many waits from the page cleaner thread. The page cleaner thread will only perform one type of page flushing. Today I resumed development and fixed a long-standing problem: LRU page eviction would too eagerly evict more recently used pages if there were unwritten changes for less recently used pages. With today’s changes, the performance regression in one type of workload seemed to go away.

The changes will have to be tested extensively, both for performance and correctness in various types of workloads. There are a few more things that I think could be improved in this area. If performance turns out to be acceptable for the current changes, we might minimize the risk and implement the remaining ideas in the 10.6 release series.

Comment by Marko Mäkelä [ 2020-09-14 ]

I have now simplified the page cleaner thread to perform a single loop that performs a single type of page flushing. The ‘LRU flushing batches’ are only initiated by user threads when the buffer pool is running out.

I am wondering whether there is any real use case for setting the parameter innodb_io_capacity. The page cleaner could be simplified further if we could remove the timed wait and the call to srv_check_activity(), and just keep flushing pages basically at full capacity at all times. If this is acceptable, the page cleaner could be triggered by signaling its condition variable whenever innodb_max_dirty_pages_pct is exceeded. That parameter (as well as innodb_log_file_size) could still be used to control the eagerness of page flushing.

Comment by Marko Mäkelä [ 2020-09-15 ]

The parameters innodb_io_capacity (and innodb_io_capacity_max) are observed by the following subsystems:

  • Loading a buffer pool dump will be constrained by the innodb_io_capacity.
  • Since MDEV-19514, a complete change buffer merge is only executed on shutdown with innodb_fast_shutdown=0 only, in a loop interleaved with emptying the ‘background drop table’ queue. The innodb_io_capacity limits the size of the individual batches of change buffer merge, which might be useful if there are many buffered changes for tables that are to be dropped.
  • Page flushing aims to wake up once per second and be throttled by innodb_io_capacity.

Encryption (actually page-dirtying) threads are controlled by a separate parameter innodb_encryption_rotation_iops.

If we make the page flushing purely event-driven and remove the ‘wake up at least once per second’ logic, we will have to detach the flushing from innodb_io_capacity. But I think that we must preserve that parameter innodb_io_capacity to control the loading of buffer pool dumps. We should probably deprecate and ignore innodb_io_capacity_max.

Comment by Rick Pizzi [ 2020-09-15 ]

If innodb_io_capacity will only control buffer pool dump load, please let's rename it accordingly then.

Comment by Marko Mäkelä [ 2020-09-15 ]

I agree that overloading configuration parameters for multiple unrelated purposes is a bad practice.

I missed one more configuration parameter that we have: innodb_adaptive_flushing is enabled by default and documented as follows:

Attempt flushing dirty pages to avoid IO bursts at checkpoints.

I think that with innodb_adaptive_flushing=ON, it makes sense for the page cleaner to obey the innodb_io_capacity limit. My revised proposal is as follows:

  • Other threads will signal the page cleaner when the proportion of dirty blocks exceeds innodb_max_dirty_pages_pct_lwm.
  • When innodb_adaptive_flushing=OFF, whenever the page cleaner is woken up, it will flush enough to satisfy innodb_max_dirty_pages_pct.
  • When innodb_adaptive_flushing=ON (default), the page cleaner will wake up at least once per second, and it will limit the amount of work by innodb_io_capacity.
Comment by Vladislav Vaintroub [ 2020-09-15 ]

there is an ambiguity in this proposal , with innodb_adaptive_flushing=ON and innodb_max_dirty_pages_pct_lwm is exceeded. limiting amount of work by innodb_io_capacity might be not enough then.

Comment by Marko Mäkelä [ 2020-09-18 ]

I filed MDEV-23756 for the experimental idea regarding innodb_adaptive_flushing=OFF. I do not think that it is necessary for the 10.5 GA release.

Comment by Marko Mäkelä [ 2020-09-25 ]

axel pointed out that innodb_buffer_pool_pages_flushed appears to be less than the actual number of writes, that is, buf_flush_stats() is not being invoked in every code path. That should be fixed as part of this ticket.

Comment by Marko Mäkelä [ 2020-10-02 ]

wlad, please review the squashed commit.

axel and krunalbauskar, please test the performance. I think that we must deal with MDEV-23855 separately.

mleich, please run the wide battery of stress tests. In previous tests more than a week ago, some corruption or crashes on crash recovery occurred. I believe that the problem may have been fixed since then.

Comment by Bernardo Perez [ 2020-10-15 ]

Hello,

I can see that this JIRA has moved to "stalled". I was wondering if it could be possible to have an understanding on the current state of the fix and a target minor version were this fix will arrive.

Thanks in advance.

Comment by Marko Mäkelä [ 2020-10-15 ]

This scenario (write-heavy workload that does not fit in the buffer pool) was addressed by rewriting most of the page cleaner thread and page flushing, by simplifying related data structures and reducing mutex operations. LRU flushing will now only be initiated by user threads, and the page cleaner thread will perform solely checkpoint-related flushing. There is no single-page flushing anymore, and the page cleaner will not wait for log writes or page latches.

Performance will be improved further in MDEV-23855 for write-heavy cases where all data does fit in the buffer pool. Among other things, that will remove contention on fil_system.mutex between the page cleaner and threads executing write completion callbacks. The work is mostly done.

Comment by Yap Sok Ann [ 2021-03-08 ]

... Furthermore, if the FIL_PAGE_LSN of a page is ahead of log_sys.get_flushed_lsn(), that is, what has been persistently written to the redo log, we would trigger a log flush and then resume the page flushing. This would unnecessarily limit the performance of the page cleaner thread and trigger the infamous messages "InnoDB: page_cleaner: 1000ms intended loop took 4450ms. The settings might not be optimal" that were suppressed in commit d1ab89037a518fcffbc50c24e4bd94e4ec33aed0 unless log_warnings>2.

Our revised algorithm will make log_sys.get_flushed_lsn() advance at the start of buf_flush_lists(), and then execute a 'best effort' to write out all pages. The flush batches will skip pages that were modified since the log was written, or are are currently exclusively locked.

This seems like a very nice design, but I have some concern about how it was done previously, and how it is still being done in the latest MySQL/Percona:

  1. Call log_write_up_to() with the newest LSN of the modified page
  2. Write out the modified page

As the block mutex is not held, does it mean that in between step 1 and step 2, some mtr can always further modify the page with a newer LSN?

If that's the case, a crash after step 2 would mean that the data files are now ahead of the redo log. What would be the consequences of that?

Sorry if this is a noob question. I am rather interested about innodb page flushing performance, and after trying to understand the code a little (still stuck with PXC 5.6 here), I am really curious what's the point of step 1 if it can't guarantee anything.

Comment by Marko Mäkelä [ 2021-08-17 ]

sayap, sorry, I did not notice your comment. Generally, https://mariadb.zulipchat.com/ would be a better platform for such discussions.

In MDEV-25948 we actually backtracked a little and removed the log_flush_task that would potentially reduce the amount of calls to log_flush_up_to(). There were several improvements to page flushing performance in MariaDB 10.5.12 and 10.6.4, and our testing in MDEV-25451 is indicating rather stable throughput.

The block mutex was removed already in MDEV-15053. I suppose that you mean the page latch? I think that we always hold the page latch when writing out a modified page. Before we write it, we will ensure that the FIL_PAGE_LSN is not ahead the durable position of the write-ahead log. Page writes are generally optional (MDEV-24626 removed the last exception). Only for log checkpoints, we must advance the MIN(oldest_modification) by page writes.

Generated at Thu Feb 08 09:22:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.