[MDEV-35424] False alarm/crash: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.11.10, 10.6, 10.11, 11.4, 11.8
Fix Version/s: 10.6, 10.11, 11.4, 11.8
Component/s: Storage Engine - InnoDB
Labels:
- crash
- regression
Environment:
Debian bookworm, maria packages from the maria repository.

Description

Hello,

I upgraded from 10.11.9 to 10.11.10 yesterday afternoon. Today right around noon I got "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch" and my mariadb service restarted.

I'm going to be uploading a stack trace to the MariaDB FTP/SFTP/WebDAV Server once I have an MDEV number so I know how to name it. If somebody could take a look I would appreciate it. Thank you!

Attachments

Issue Links

is caused by

MDEV-21452 Use condition variables and normal mutexes instead of InnoDB os_event and mutex

Closed

relates to

MDEV-34863 RAM Usage Changed Significantly Between 10.11 Releases

Closed

MDEV-35154 dict_sys_t::load_table() is holding exclusive dict_sys.latch for unnecessarily long time

Confirmed

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Marko Mäkelä added a comment - 2025-02-20 16:59

danblack, https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of ~~MDEV-34863~~, which would introduce a parameter innodb_buffer_pool_size_min. A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached.

This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min. I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?

Marko Mäkelä added a comment - 2025-02-20 16:59 danblack , https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of MDEV-34863 , which would introduce a parameter innodb_buffer_pool_size_min . A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached. This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min . I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?

Xan Charbonnet added a comment - 2025-03-13 01:55

A machine in my 10.11.11 Galera cluster had another "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch", "got signal 6" event yesterday and dumped its core. It doesn't seem to have been related to a memory pressure event. There was a memory pressure event near it, but it was after the shutdown had started. The fatal event was at 2025-03-11 09:56:37 US Central, and the memory pressure event was at 09:56:41.

I've generated a full stacktrace (hopefully correctly) and uploaded it to the "private" folder of the MariaDB SFTP server. It's called MDEV-35424_backtraces_nj-beta_2025-03-11.txt.

Xan Charbonnet added a comment - 2025-03-13 01:55 A machine in my 10.11.11 Galera cluster had another "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch", "got signal 6" event yesterday and dumped its core. It doesn't seem to have been related to a memory pressure event. There was a memory pressure event near it, but it was after the shutdown had started. The fatal event was at 2025-03-11 09:56:37 US Central, and the memory pressure event was at 09:56:41. I've generated a full stacktrace (hopefully correctly) and uploaded it to the "private" folder of the MariaDB SFTP server. It's called MDEV-35424 _backtraces_nj-beta_2025-03-11.txt.

Marko Mäkelä added a comment - 2025-03-13 05:44

Thank you. Meanwhile, I revised my proposed fix of ~~MDEV-34863~~ so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking.

The stack traces look reasonable. We’re missing the debug information for libstdc++.so, but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer:

	srv_shutdown_state = SRV_SHUTDOWN_CLEANUP;

	if (srv_buffer_pool_dump_at_shutdown &&

		!srv_read_only_mode && srv_fast_shutdown < 2) {

		buf_dump_start();

	srv_monitor_timer.reset();

	if (do_srv_shutdown) {

		srv_shutdown(srv_fast_shutdown == 0);

loop:

Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere:

void dict_sys_t::lock_wait(SRW_LOCK_ARGS(const char *file, unsigned line)) noexcept

  ulonglong now= my_hrtime_coarse().val, old= 0;

  if (latch_ex_wait_start.compare_exchange_strong

      (old, now, std::memory_order_relaxed, std::memory_order_relaxed))

    latch.wr_lock(SRW_LOCK_ARGS(file, line));

    latch_ex_wait_start.store(0, std::memory_order_relaxed);

    return;

  // …

  latch.wr_lock(SRW_LOCK_ARGS(file, line));

This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch, and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced.

To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits.

Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended (MDEV-33594, MDEV-34988, MDEV-34999, ~~MDEV-35436~~), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make ~~MDEV-36226~~ observe innodb_fatal_semaphore_wait?

For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in ~~MDEV-21452~~ I initially simply removed it and then implemented a replacement, which was revised in ~~MDEV-24258~~. Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.

Marko Mäkelä added a comment - 2025-03-13 05:44 Thank you. Meanwhile, I revised my proposed fix of MDEV-34863 so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking. The stack traces look reasonable. We’re missing the debug information for libstdc++.so , but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer: srv_shutdown_state = SRV_SHUTDOWN_CLEANUP; if (srv_buffer_pool_dump_at_shutdown && !srv_read_only_mode && srv_fast_shutdown < 2) { buf_dump_start(); } srv_monitor_timer.reset(); if (do_srv_shutdown) { srv_shutdown(srv_fast_shutdown == 0); } loop: Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere: void dict_sys_t::lock_wait(SRW_LOCK_ARGS( const char *file, unsigned line)) noexcept { ulonglong now= my_hrtime_coarse().val, old= 0; if (latch_ex_wait_start.compare_exchange_strong (old, now, std::memory_order_relaxed, std::memory_order_relaxed)) { latch.wr_lock(SRW_LOCK_ARGS(file, line)); latch_ex_wait_start.store(0, std::memory_order_relaxed); return ; } // … latch.wr_lock(SRW_LOCK_ARGS(file, line)); } This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch , and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced. To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits. Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended ( MDEV-33594 , MDEV-34988 , MDEV-34999 , MDEV-35436 ), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make MDEV-36226 observe innodb_fatal_semaphore_wait ? For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in MDEV-21452 I initially simply removed it and then implemented a replacement , which was revised in MDEV-24258 . Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.

Otto Kekäläinen added a comment - 2025-03-26 07:40

This was reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101340

Otto Kekäläinen added a comment - 2025-03-26 07:40 This was reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101340

Marko Mäkelä added a comment - 2025-03-26 15:56

For the record, ~~MDEV-34863~~ disables the memory pressure event handling by default.

Marko Mäkelä added a comment - 2025-03-26 15:56 For the record, MDEV-34863 disables the memory pressure event handling by default.

MariaDB Server

False alarm/crash: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration