[MDEV-24188] Hang in buf_page_create() after reusing a previously freed page Created: 2020-11-10 Updated: 2021-04-19 Resolved: 2020-11-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2.35, 10.2.36, 10.3.26, 10.3.27, 10.4.16, 10.4.17, 10.5.7, 10.5.8 |
| Fix Version/s: | 10.2.37, 10.3.28, 10.4.18, 10.5.9 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Matthias Leich | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | hang, regression, rr-profile | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
|
| Comments |
| Comment by Matthias Leich [ 2020-11-10 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-10 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
To me, this looked like a deadlock between the page flush (which io-fixes the page) and buf_page_create(), which will remain in the following loop. Something else had already acquired block->lock in exclusive mode. I did not check what that was: I now see that the page latch would be acquired after the wait loop:
In any case, both threads would remain blocked. It is possible that 10.5.7 is not affected by this, thanks to various changes (the latest one being | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-12 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
As far as I can tell, the following happened:
The problematic wait started in Thread 33 around the same time the io-fix was set by Thread 1. The wait condition in buf_page_create() that we implemented in I suspect that an even rarer variant of this hang might be possible. A mini-transaction that had previously freed a page might be reusing the page in buf_page_create() again. In this case, I did not find the block in mtr_t::m_memo. We do have the constraint that a mini-transaction must not acquire further page latches after allocating a page. That constraint could apply to freeing pages as well, but I did not check that yet. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-12 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I made a mistake in my analysis and was looking at the wrong block. The correct block indeed is one that the mini-transaction had previously x-latched. The mtr->m_memo.m_first_block.m_data starts with the following:
The block of interest is 0x5f9209096860, and the 0x2 is MTR_MEMO_PAGE_X_FIX. The page had been freed in our mini-transaction. We seem to have at least B-tree level 2 here (and innodb_page_size=4k to help with that).
We had X-latched the page much earlier. I think that we must rewrite the
A little later, that thread would block, waiting for SX-latch on the block, while our buf_page_create() is waiting for the io-fix to be released. Hence, it is a deadlock (or livelock). | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-12 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
The wait loop was originally added in Edit: it looks like
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2020-11-13 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-13 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I pushed this to 10.2 and and merged up to 10.5 immediately. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Olaf Buitelaar [ 2020-11-16 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
I seem to suffer from the same issue;
Any chance a release might be pushed forward to address this? Also is there an configuration option to disable the forced shutdown after 10min? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-16 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
olafbuitelaar, thank you for the report. You can use
to have the server abort in 5 minutes (300 seconds) instead of the default timeout. I do not think that it is useful to let the server continue in a livelocked state, because service to some connections will be denied, depending on which latches the hung buf_page_create() threads are holding. Eventually, all I/O threads would be blocked and nothing could be accessed. If the hung buf_page_create() is executed as part of a CREATE TABLE operation, then all other InnoDB threads will be blocked, waiting for the data dictionary latch. Today, we double-checked that the wait loop that was originally added in The hang was caused because in The probability of this hang can be reduced by configuring some parameters related to page flushing, but I do not think that it can be prevented completely. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Olaf Buitelaar [ 2020-11-16 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for your reply. If i can provide more information please let me know. I'll try to tweak the parameters related to page flushing. We use ```create table``` regularly to create temporary tables. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-17 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
olafbuitelaar, please be aware that due to
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-11-17 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
The scenario of this hang is that a page had been freed, a page write (or thanks to |