[MDEV-24911] Missing warning before [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex Created: 2021-02-18 Updated: 2023-12-02 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6 |
| Fix Version/s: | 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Roel Van de Paar | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | debug | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
This reduced but unfinished testcase:
Has at least once lead to:
However the bug is extremely hard to reproduce.
|
| Comments |
| Comment by Roel Van de Paar [ 2021-03-09 ] | |||||||
|
See | |||||||
| Comment by Marko Mäkelä [ 2021-03-22 ] | |||||||
|
Please change the title to be more descriptive. By design, ib::fatal::~fatal() will abort execution. What is interesting is why the code was invoked. (It was invoked by the watchdog that checks for long wait time on dict_sys.mutex. That was changed in Always for any crash, please include any preceding error log messages. For any hang (anything that trips the watchdog is supposed to be a hang), please include a stack trace of all threads. The stack trace for the watchdog task itself is not interesting at all. What we would need is stack traces of the threads that participate in the long lock wait. Maybe something is holding dict_sys.mutex and then acquiring MDL with a too long timeout. An rr replay trace of this hang would help much more than a core dump. | |||||||
| Comment by Roel Van de Paar [ 2021-03-24 ] | |||||||
|
The original error log is convoluted with other errors prior to the crash due to the input sql, however this looks relevant (and full log attached also):
| |||||||
| Comment by Marko Mäkelä [ 2021-03-31 ] | |||||||
|
Roel, also that particular message is useless by itself. The message suggested to follow the advice of https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ (and include the stack traces of the hung threads in the ticket that you would file, i.e., this one). | |||||||
| Comment by Marko Mäkelä [ 2021-03-31 ] | |||||||
|
Because the input does include innodb_change_buffering_debug=1, the cause of the hang is very likely common with Let us keep this ticket open until we have fixed the 10.6 watchdog so that it would give an early warning before the fatal message. | |||||||
| Comment by Marko Mäkelä [ 2021-10-22 ] | |||||||
|
Possibly related to bugs in a development version of
I copied the above snippet from a run with a broken Ubuntu Impish kernel 5.13.0-19-generic that causes io_uring related hangs (see | |||||||
| Comment by Marko Mäkelä [ 2023-11-03 ] | |||||||
|
In I don’t know if this is actually fixable, or worth fixing. The server is unresponsive and partly unusable. The watchdog task will finally kick in. | |||||||
| Comment by Xan Charbonnet [ 2023-11-11 ] | |||||||
|
For what it's worth: the early warning on the server hang is really useful for tuning the innodb_fatal_semaphore_wait_threshold setting. It looks like the intention is for a warning to be printed at 1/4, 1/2, and 3/4 of the threshold. To minimize downtime as a result of a hang, I'd like to lower my threshold as low as it can go without disrupting normal operation. If I run my system for a couple of weeks and don't see any warnings, I know it should be safe to divide my innodb_fatal_semaphore_wait_threshold by 4. Once I've done that I can keep an eye out for warnings and see if I can divide by 4 again, or how often I see any of the warnings appear. None of that works if the threshold warnings aren't reliable. Maybe it still isn't worth fixing, but I'd at least like to explain why it's useful. | |||||||
| Comment by Otto Kekäläinen [ 2023-12-02 ] | |||||||
|
Related downstream bug report in Ubuntu: https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/2008718 | |||||||
| Comment by Marko Mäkelä [ 2023-12-02 ] | |||||||
|
otto, based on the version numbers mentioned in the Ubuntu bug report, I would think that a likely explanation is |