[MDEV-32909] Hang not caught by semaphore wait threshold Created: 2023-11-29 Updated: 2023-11-30 Resolved: 2023-11-30 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.11.6 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Xan Charbonnet | Assignee: | Marko Mäkelä |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | hang | ||
| Environment: |
Debian 11 Bullseye; MariaDB from the Maria repositories. |
||
| Issue Links: |
|
||||||||
| Description |
|
Hello, Last night right after midnight one of my servers seemed to hang. All (or at least a great many?) queries would hang forever in the Execute phase. Of course this hung up the entire Galera cluster as well. I was awakened by an alert around 12:40am. My innodb_fatal_semaphore_wait_threshold is set to 32 seconds, so this hang was not caught by that watchdog. At 12:49 I was able to send MariaDB a signal which caused it to crash and dump core. So I do have a stack trace for this situation which I don't want to post publicly, but which is available upon request. I also have SHOW PROCESSLIST logs for much of the time in case that helps. If a MariaDB expert could take a look at the situation I would appreciate it! Thanks. |
| Comments |
| Comment by Marko Mäkelä [ 2023-11-29 ] | |||||||||||||||||||||||||||||||||||||||||
|
Just today, while working on additional instrumentation for You have already filed | |||||||||||||||||||||||||||||||||||||||||
| Comment by Xan Charbonnet [ 2023-11-29 ] | |||||||||||||||||||||||||||||||||||||||||
|
Marko, the only reason I thought this was different was becuase it didn't get caught by the watchdog. I do have the full stack trace which I'll email you. It sounds like this is likely to be the same thing, just with a shared latch rather than an exclusive latch. If this is the same thing, then maybe this bug should be converted to making the dict_sys.latch watchdog cover both kinds of latches? Thank you! | |||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-30 ] | |||||||||||||||||||||||||||||||||||||||||
|
xan@biblionix.com, thank you for the stack traces that you provided via email. I can confirm that this hang is a duplicate of
The same buf_pool.page_hash latch is being waited for when a corrupted page is being evicted from the buffer pool on read completion:
The watchdog on dict_sys.latch is not catching this hang, because we do not have any threads waiting for it, either in shared or exclusive mode. For a shared-mode wait, if there was a conflicting request (a thread is already holding an exclusive lock or waiting for one), I think that we would escalate to exclusive locking during the wait. This should employ the watchdog. The old watchdog, which covered most of InnoDB, was removed in | |||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-30 ] | |||||||||||||||||||||||||||||||||||||||||
|
I see that dict_sys.freeze() invokes srw_lock::rd_lock(), which in the futex-based implementation would temporarily escalate the lock in rd_wait(). These waits are not covered by the watchdog that is implemented in dict_sys_t::lock_wait(). It is not trivial to change this to use the watchdog, because we have multiple different implementations here: the futex-based one on Linux and various BSDs ( When it comes to the root cause of this corruption, I am a bit puzzled. Based on MDEV-32115 it seems that the default wsrep_sst_method=rsync should work reliably on 10.5 and later releases. Perhaps this would better be explained by MDEV-32174, which is a bug in ROLLBACK on ROW_FORMAT=COMPRESSED tables. But, I do not yet see how it could make individual pages look as corrupted when they are being read into the buffer pool. |