[MDEV-23399] 10.5 performance regression with IO-bound tpcc Created: 2020-08-04 Updated: 2022-04-27 Resolved: 2020-10-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5.4 |
| Fix Version/s: | 10.5.7 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Axel Schwenke | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 7 |
| Labels: | performance, regression | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Triggered by this blog post from Percona. |
| Comments |
| Comment by Axel Schwenke [ 2020-08-04 ] | ||||||||||||
|
PMP stack dump and flame graph here | ||||||||||||
| Comment by Marko Mäkelä [ 2020-08-04 ] | ||||||||||||
|
In the flame graph, buf_LRU_get_free_block() caught my attention. I attached GDB to the server at the time of degraded performance, and I think that I found some proof of my hypothesis that the problem is that the buffer pool is filled with dirty pages:
We are conducting a linear search among the 1.6 million buffer pool blocks, searching for one of the about 3000 replaceable blocks. A search that I checked finished in 134,468 iterations. While we are doing this, we are blocking pretty much the entire buffer pool by holding the buf_pool.mutex. I think that this can be fixed rather by introducing a new buf_pool.clean list that would contain the subset of buf_pool.LRU that is not part of either buf_pool.flush_list or buf_pool.unzip_LRU. Then, the 134,468 iterations of the buf_pool.LRU list should be replaced by visiting the start of buf_pool.clean, replacing O(n) complexity with O(1). | ||||||||||||
| Comment by Marko Mäkelä [ 2020-08-04 ] | ||||||||||||
|
For axel repeated this problem with a few dozen threads. At larger number of connections, I can imagine that it should become even more disastrous. The fact that buf_LRU_get_free_block() is holding buf_pool.mutex during the O(n) scan is not only blocking other threads that are trying to use the buffer pool, but it is also preventing other threads from helping with the situation, because each of the following will require buf_pool.mutex:
Another run with a significantly larger number of concurrent connections was showing buf_flush_single_page_from_LRU() as a CPU hotspot. I think that also it could be explained by the contention on buf_pool.mutex between multiple threads executing buf_LRU_get_free_block(), but we cannot confirm it before a fix is available. The single-page flushing is something that we would like to remove in I suspect that multiple buffer pool instances simply worked around this problem by allowing multiple page cleaner threads to work concurrently on other buffer pool instances while a lengthy scan was blocking one buffer pool instance. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-08-05 ] | ||||||||||||
|
I noticed a more obvious regression that I filed as I started working on a fix that basically replaces the debug-only list buf_pool.zip_clean with a buf_pool.clean list, which would be a subset of the buf_pool.LRU list. In this way, the LRU replacement algorithm can scan a shorter list of clean pages instead of a list of all pages. Even if a page is clean, it cannot be evicted if it is currently buffer-fixed. So, we cannot avoid an O(n) scan completely, but we can make the n smaller and kind-of move a condition outside the iteration loop. I will have to solve one problem still: When a clean page needs to be inserted into the middle of the buf_pool.LRU list, we must find the corresponding insert position in the buf_pool.clean list, and this would require an O(n) scan while holding buf_pool.mutex. To make matters worse, this scan would be executed during normal operation, even when no page eviction is needed. We might perhaps order buf_pool.clean in a different way, while still offering reasonable guarantees that a more recently used clean page would not be evicted before a less recently used dirty page. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-08-07 ] | ||||||||||||
|
I think that we must consider the following changes:
I now think that the following options are a dead end, because they could significantly slow down the server in the case when the buffer pool is large enough to hold the entire database:
We have frequent operations that would be affected by such changes:
| ||||||||||||
| Comment by Marko Mäkelä [ 2020-08-19 ] | ||||||||||||
|
So far, I have done the following:
I am experimenting with one further idea:
I hope that this will sufficiently address the regression, so that it would not be necessary to refine buf_pool.LRU in a GA release. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-08-26 ] | ||||||||||||
|
If we cannot find a simple change in 10.5 that can fix the performance regression (some trivial bug like | ||||||||||||
| Comment by Bernardo Perez [ 2020-09-09 ] | ||||||||||||
|
Hello, I am following this report with great interest. I am not sure whether it would be possible for you answer this at this stage but do you think the fix for this issue would make it for the next 10.5.6? If the answer is yes, do we have estimations on when 10.5.6 could be ready? Thanks in advance. | ||||||||||||
| Comment by Sergei Golubchik [ 2020-09-09 ] | ||||||||||||
|
The planned release road map is on the default Jira dashboard, you should see it at https://jira.mariadb.org unless you've created a custom dashboard in your account. Currently the road map says 2020-10-30 for 10.5.6. Yes, the fix for this issue hopefully will be in. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-09-10 ] | ||||||||||||
|
My development branch simplifies the page flushing a lot by removing single-page flushing and by removing many waits from the page cleaner thread. The page cleaner thread will only perform one type of page flushing. Today I resumed development and fixed a long-standing problem: LRU page eviction would too eagerly evict more recently used pages if there were unwritten changes for less recently used pages. With today’s changes, the performance regression in one type of workload seemed to go away. The changes will have to be tested extensively, both for performance and correctness in various types of workloads. There are a few more things that I think could be improved in this area. If performance turns out to be acceptable for the current changes, we might minimize the risk and implement the remaining ideas in the 10.6 release series. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-09-14 ] | ||||||||||||
|
I have now simplified the page cleaner thread to perform a single loop that performs a single type of page flushing. The ‘LRU flushing batches’ are only initiated by user threads when the buffer pool is running out. I am wondering whether there is any real use case for setting the parameter innodb_io_capacity. The page cleaner could be simplified further if we could remove the timed wait and the call to srv_check_activity(), and just keep flushing pages basically at full capacity at all times. If this is acceptable, the page cleaner could be triggered by signaling its condition variable whenever innodb_max_dirty_pages_pct is exceeded. That parameter (as well as innodb_log_file_size) could still be used to control the eagerness of page flushing. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-09-15 ] | ||||||||||||
|
The parameters innodb_io_capacity (and innodb_io_capacity_max) are observed by the following subsystems:
Encryption (actually page-dirtying) threads are controlled by a separate parameter innodb_encryption_rotation_iops. If we make the page flushing purely event-driven and remove the ‘wake up at least once per second’ logic, we will have to detach the flushing from innodb_io_capacity. But I think that we must preserve that parameter innodb_io_capacity to control the loading of buffer pool dumps. We should probably deprecate and ignore innodb_io_capacity_max. | ||||||||||||
| Comment by Rick Pizzi [ 2020-09-15 ] | ||||||||||||
|
If innodb_io_capacity will only control buffer pool dump load, please let's rename it accordingly then. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-09-15 ] | ||||||||||||
|
I agree that overloading configuration parameters for multiple unrelated purposes is a bad practice. I missed one more configuration parameter that we have: innodb_adaptive_flushing is enabled by default and documented as follows:
I think that with innodb_adaptive_flushing=ON, it makes sense for the page cleaner to obey the innodb_io_capacity limit. My revised proposal is as follows:
| ||||||||||||
| Comment by Vladislav Vaintroub [ 2020-09-15 ] | ||||||||||||
|
there is an ambiguity in this proposal , with innodb_adaptive_flushing=ON and innodb_max_dirty_pages_pct_lwm is exceeded. limiting amount of work by innodb_io_capacity might be not enough then. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-09-18 ] | ||||||||||||
|
I filed MDEV-23756 for the experimental idea regarding innodb_adaptive_flushing=OFF. I do not think that it is necessary for the 10.5 GA release. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-09-25 ] | ||||||||||||
|
axel pointed out that innodb_buffer_pool_pages_flushed appears to be less than the actual number of writes, that is, buf_flush_stats() is not being invoked in every code path. That should be fixed as part of this ticket. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-10-02 ] | ||||||||||||
|
wlad, please review the squashed commit. axel and krunalbauskar, please test the performance. I think that we must deal with mleich, please run the wide battery of stress tests. In previous tests more than a week ago, some corruption or crashes on crash recovery occurred. I believe that the problem may have been fixed since then. | ||||||||||||
| Comment by Bernardo Perez [ 2020-10-15 ] | ||||||||||||
|
Hello, I can see that this JIRA has moved to "stalled". I was wondering if it could be possible to have an understanding on the current state of the fix and a target minor version were this fix will arrive. Thanks in advance. | ||||||||||||
| Comment by Marko Mäkelä [ 2020-10-15 ] | ||||||||||||
|
This scenario (write-heavy workload that does not fit in the buffer pool) was addressed by rewriting most of the page cleaner thread and page flushing, by simplifying related data structures and reducing mutex operations. LRU flushing will now only be initiated by user threads, and the page cleaner thread will perform solely checkpoint-related flushing. There is no single-page flushing anymore, and the page cleaner will not wait for log writes or page latches. Performance will be improved further in | ||||||||||||
| Comment by Yap Sok Ann [ 2021-03-08 ] | ||||||||||||
This seems like a very nice design, but I have some concern about how it was done previously, and how it is still being done in the latest MySQL/Percona:
As the block mutex is not held, does it mean that in between step 1 and step 2, some mtr can always further modify the page with a newer LSN? If that's the case, a crash after step 2 would mean that the data files are now ahead of the redo log. What would be the consequences of that? Sorry if this is a noob question. I am rather interested about innodb page flushing performance, and after trying to understand the code a little (still stuck with PXC 5.6 here), I am really curious what's the point of step 1 if it can't guarantee anything. | ||||||||||||
| Comment by Marko Mäkelä [ 2021-08-17 ] | ||||||||||||
|
sayap, sorry, I did not notice your comment. Generally, https://mariadb.zulipchat.com/ would be a better platform for such discussions. In The block mutex was removed already in |