[MDEV-27466] Deadlock when purge thread races the flush thread for page lock under Innodb redo log space pressure Created: 2022-01-11 Updated: 2022-03-24 Resolved: 2022-03-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.4.20 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Sasha Pachev | Assignee: | Marko Mäkelä |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | hang, not-10.5 | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
This is not a full bug report yet, though I do have a reproducible scenario on a test server. I am working on a smaller test case. To cause the problem, we do this: set innodb_log_file_size to 4M to quickly simulate log space shortage This time we go into the deadlock timeout, and from the gdb trace we can see this race:
The flush thread wants a page lock that is being held by another thread.
So apparently the purge thread is waiting for the log space to become available, but it will never happen because it is doing it while holding the lock on the page that the flush thread wants to flush to make that space available. We are working on isolating this into something manageable that we can post, but since we are aware of this bug, I figured I would open an issue and hopefully get some tips on the code in question. In particular, I would like to know how to get a call in trx_undo_truncate_start() to do something meaningful - I can get a call to happen, but it returns immediately because the limit argument is 0. |
| Comments |
| Comment by Marko Mäkelä [ 2022-01-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for the report! I think that this problem has been fixed by In MariaDB Server 10.5, there is only a single buf_flush_page_cleaner() thread that executes only checkpoint flushing. Furthermore, only that thread is updating checkpoints, after each page flushing batch. In earlier versions, there are multiple page cleaner threads, and they are not updating checkpoints themselves. Also, A problem with old style of flushing is that it is blocking while waiting for a page latch. I believe that this also is the cause of the messages that You could also be affected by In 10.5 or later, such checkpoint hangs should not be possible. I recently ran some tests for It would be very interesting to know if you can repeat such a hang on 10.5 or 10.6. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-01-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Marko - I was able to reproduce the problem on 10.4.20 with innodb_flush_neighbors set to OFF producing the same trace. I did a double-take as the call to buf_flush_neighbors() was in the trace. But upon checking the code, I saw that buf_flush_neighbors() still runs even if innodb_flush_neighbors is off, just does less work. While I have a good amount of confidence than the fixes in 10.5 and 10.6 will likely avoid this problem, it is a priority for us to make 10.4 stable against this type of race with the smallest and safest code change possible. So I will continue to work on a simple reproducible test. Do you have any suggestions how to make
get called with a non-zero value for the
argument? I am also able to try custom patches, as well as examine the variables in gdb during the deadlock if you need me to. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-12 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
spachev, could you provide a definite answer that the hang is not reproducible for the latest 10.5 or 10.6? To answer your question: I think that you will want to set the following, in order to make undo tablespace truncation occur more frequently.
That is what I am using in tests that wait for everything to be purged. There is also a parameter innodb_purge_batch_size that could play a role here. You might also want to check the tests innodb.undo_truncate and innodb.undo_truncate_recover. The entry point for innodb_undo_log_truncate=ON (which is effective with innodb_undo_tablespaces>2) is trx_purge_truncate_history(). It invokes a dedicated function mtr_t::commit_shrink() for shrinking the undo tablespace. The algorithm is:
This allows us to rebuild undo tablespaces by using just the redo log. Last time I checked, MySQL starting with 5.7 uses a separate log file for truncating undo tablespaces, and therefore it will have to execute a log checkpoint (write out all dirty pages from the buffer pool) on the main redo log. One thing that got better in 10.4 compared to 10.2 and 10.3 is that we have a MLOG_MEMSET record, which helps reduce the amount of the redo log that needs to be written for reinitializing the undo tablespace. I did not expect setting innodb_flush_neighbors=OFF to make this hang go away. I only thought that it could reduce the probability. A possible fix could be something like this:
Note: I did not check if this compiles or behaves correctly. Even if it passed correctness tests, it might still result in a performance regression. With the multiple buffer pool instances and page cleaner threads that we have before 10.5, things can behave in a chaotic fashion, and it is difficult to predict the performance impact. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-01-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm that the issue does not happen on 10.6. I had to play with the size of the buffer pool to make it a fair comparison. With 512 MB buffer pool 10.4 was deadlocking, but 10.6 would give me "The total number of locks exceeds the lock table size". So had to experimentally find the size that was big enough for 10.6 to do something different, but small enough for 10.4 to deadlock, and achieved the desired result at the size of 805306368 at which 10.6 was able to complete the test successfully while 10.4 deadlocked. It may be of significance that the column being updated was added by INSTANT ALTER TABLE algorithm. At least we have not been able to reproduce it on other columns. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Quick update. With the patch the deadlock does not happen, the query completes with an error . However it takes a very long time. There is a period of what appears to be flush starvation. The update sits in buf_flush_wait_flushed() for a very long time, and there is no flushing activity for a while. However, the situation eventually resolves itself. I still have not fully understood what is happening. I will once I have. However, I thought I would post what I have for the MariaDB team to have something to work with. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Another quick update. During the flush stall phase mentioned earlier I am looking at buf_flush_lists() called from buf_flush_page_cleaner_coordinator() and seeing a pattern of trickle flushing. 2-4 pages followed 0 in an alternating pattern. 0 causes a 10 second wait before the next flush because the flush thread thinks there is nothing to do. But in the meantime we are running short on log space, so the update execution as well as the purge thread are waiting for the log space to become available. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
spachev, thank you. Possibly somewhat related to this, in Instant ADD COLUMN ( Before I worked on In 10.6 and later, the buf_pool.mutex and log_sys.mutex remain the main bottlenecks in write-heavy workloads. In 10.5, also lock_sys.mutex was hot, but that one should not interfere with page flushing. We can run correctness tests on any patch that you would submit, but I am afraid that for tweaking the performance of 10.4, your are mostly on your own. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In this particular case, when the stall happens, buf_flush_LRU_list_batch() is not being called at all for a while. Then after some period time, maybe half hour, all of a sudden it starts getting called, the LRU list size is 39K, so we are not dealing with BUF_LRU_MIN_LEN restriction at all, the flushes happen, and the rows start getting updated. Will continue debugging... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Before I no longer remember how the LRU eviction logic worked before my changes. I think that the idea of LRU eviction by the page cleaner threads was to "pre-evict" some pages so that user threads would almost never have to wait for page writes in case all pages are dirty and something needs to be evicted to make room for another page. My changes adopted the Linux kernel’s early idea of "try to retain buffer caches as long as possible", and in 10.5 only the user threads that are trying to allocate a block may initiate LRU eviction flushing (with no distinction between single-page flushing and "LRU batch" flushing). One thread runs the LRU flush, others that might concurrently try to allocate a page will wait for that batch to complete. On write completion, the pages will actually be freed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-01-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update. We are able to avoid the stall with the following patch:
I have not yet tested, but I think the simple variant that only has
will work as well. I observed that trx_undo_truncate_start() could start where there is plenty of space in the log, but while freeing the pages manages to push the checkpoint age to the limit. Just one call it to it. I have not yet fully understood the mechanism, but I am guessing it has to do with taking some page locks which results in blocking the flush activity (with Marko's patch from this issue) or deadlocking (without the patch). I am not suggesting this patch is necessarily the correct solution or that it may not cause other problems. I am just posting this to demonstrate that reducing purge truncation activity when the log space is short works around the problem. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
spachev, thank you for the update. This reminds me of https://github.com/MariaDB/server/pull/1889 ( Can you submit a pull request for this? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-02-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you! Can you please rebase that on the latest 10.4? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-02-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko - done, also added the safety explanation comment as you requested. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by VAROQUI Stephane [ 2022-02-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can 10.3 be affected as well ? i hit similar long semaphore wait 600s on a read only replica with conservative and multi source GTID replication
Unfortunate the stack did not write to disk but from monitoring we get
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sasha Pachev [ 2022-02-23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
stephane@skysql.com - I think your case is different. From the status we see that the difference between "Log sequence number " and "Last checkpoint at" is only a bit over 5 M. Even if the log is very small, the smallest log capacity (2 logs, 4 MB each) is 8 M, and even if it is that low, you still have 3 M of log space left. But it looks like something might have gone wrong due to a slow or stuck log flush, which would produce similar symptoms. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-03-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Added performance test results for PR#2021. For sysbench-tpcc we have even an increased performance - when the data set fits in the buffer pool (tpcc1.pdf Not shown: for standard sysbench workloads no change in performance was observed. To avoid negative effects for customers who are still on 10.4, I'd prefer offering the patched MariaDB 10.4 as a custom build to interested parties. |