Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35424

False alarm/crash: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch

Details

    Description

      Hello,

      I upgraded from 10.11.9 to 10.11.10 yesterday afternoon. Today right around noon I got "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch" and my mariadb service restarted.

      I'm going to be uploading a stack trace to the MariaDB FTP/SFTP/WebDAV Server once I have an MDEV number so I know how to name it. If somebody could take a look I would appreciate it. Thank you!

      Attachments

        Issue Links

          Activity

            danblack, https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of MDEV-34863, which would introduce a parameter innodb_buffer_pool_size_min. A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached.

            This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min. I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?

            marko Marko Mäkelä added a comment - danblack , https://github.com/MariaDB/server/pull/3826 now includes a tentative fix of MDEV-34863 , which would introduce a parameter innodb_buffer_pool_size_min . A single pressure event would attempt to shrink the buffer pool by 8 MiB (or on 32-bit systems by 2 MiB) unless the minimum size would be reached. This might not be ideal (the user would have to manually SET GLOBAL innodb_buffer_pool_size to a larger value), but this would be a more graceful response to the memory pressure events, and one that could be disabled by specifying a larger innodb_buffer_pool_size_min . I only covered it with the debug instrumented regression test innodb.mem_pressure so far. Can you please test if it would work in a reasonable way when actual memory pressure events are triggered?

            A machine in my 10.11.11 Galera cluster had another "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch", "got signal 6" event yesterday and dumped its core. It doesn't seem to have been related to a memory pressure event. There was a memory pressure event near it, but it was after the shutdown had started. The fatal event was at 2025-03-11 09:56:37 US Central, and the memory pressure event was at 09:56:41.

            I've generated a full stacktrace (hopefully correctly) and uploaded it to the "private" folder of the MariaDB SFTP server. It's called MDEV-35424_backtraces_nj-beta_2025-03-11.txt.

            xan@biblionix.com Xan Charbonnet added a comment - A machine in my 10.11.11 Galera cluster had another "innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch", "got signal 6" event yesterday and dumped its core. It doesn't seem to have been related to a memory pressure event. There was a memory pressure event near it, but it was after the shutdown had started. The fatal event was at 2025-03-11 09:56:37 US Central, and the memory pressure event was at 09:56:41. I've generated a full stacktrace (hopefully correctly) and uploaded it to the "private" folder of the MariaDB SFTP server. It's called MDEV-35424 _backtraces_nj-beta_2025-03-11.txt.

            Thank you. Meanwhile, I revised my proposed fix of MDEV-34863 so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking.

            The stack traces look reasonable. We’re missing the debug information for libstdc++.so, but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer:

            	srv_shutdown_state = SRV_SHUTDOWN_CLEANUP;
             
            	if (srv_buffer_pool_dump_at_shutdown &&
            		!srv_read_only_mode && srv_fast_shutdown < 2) {
            		buf_dump_start();
            	}
            	srv_monitor_timer.reset();
             
            	if (do_srv_shutdown) {
            		srv_shutdown(srv_fast_shutdown == 0);
            	}
             
             
            loop:
            

            Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere:

            void dict_sys_t::lock_wait(SRW_LOCK_ARGS(const char *file, unsigned line)) noexcept
            {
              ulonglong now= my_hrtime_coarse().val, old= 0;
              if (latch_ex_wait_start.compare_exchange_strong
                  (old, now, std::memory_order_relaxed, std::memory_order_relaxed))
              {
                latch.wr_lock(SRW_LOCK_ARGS(file, line));
                latch_ex_wait_start.store(0, std::memory_order_relaxed);
                return;
              }
              // …
              latch.wr_lock(SRW_LOCK_ARGS(file, line));
            }
            

            This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch, and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced.

            To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits.

            Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended (MDEV-33594, MDEV-34988, MDEV-34999, MDEV-35436), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make MDEV-36226 observe innodb_fatal_semaphore_wait?

            For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in MDEV-21452 I initially simply removed it and then implemented a replacement, which was revised in MDEV-24258. Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.

            marko Marko Mäkelä added a comment - Thank you. Meanwhile, I revised my proposed fix of MDEV-34863 so that a memory pressure event would attempt to shrink the buffer pool to halfway between the current size and a user-specified minimum limit, which by default would be set so that memory pressure events would not trigger any shrinking. The stack traces look reasonable. We’re missing the debug information for libstdc++.so , but it does not really matter. The single-frame threads with ?? likely are io_uring worker threads that I believe are running some code belonging to the Linux kernel. We can see that the buf_flush_page_cleaner() thread is in an unbounded wait, that is, there is no buffer pool flushing going on. I also see that the shutdown is trying to reset the watchdog timer: srv_shutdown_state = SRV_SHUTDOWN_CLEANUP;   if (srv_buffer_pool_dump_at_shutdown && !srv_read_only_mode && srv_fast_shutdown < 2) { buf_dump_start(); } srv_monitor_timer.reset();   if (do_srv_shutdown) { srv_shutdown(srv_fast_shutdown == 0); }     loop: Only after the loop: label would the shutdown start the final buffer pool flush. First I was suspecting some kind of a glitch around the srv_monitor_timer.reset() call, but it seems to me that the cause would seem to be elsewhere: void dict_sys_t::lock_wait(SRW_LOCK_ARGS( const char *file, unsigned line)) noexcept { ulonglong now= my_hrtime_coarse().val, old= 0; if (latch_ex_wait_start.compare_exchange_strong (old, now, std::memory_order_relaxed, std::memory_order_relaxed)) { latch.wr_lock(SRW_LOCK_ARGS(file, line)); latch_ex_wait_start.store(0, std::memory_order_relaxed); return ; } // … latch.wr_lock(SRW_LOCK_ARGS(file, line)); } This function is not always resetting the latch_ex_wait_start to disable the watchdog kill. If I remember correctly, my reasoning was that there could be multiple threads waiting for an exclusive dict_sys.latch , and we would want to preserve the time when the first pending wait started. The idea is that the thread that is blocked in the early return code path would reset the field. I can easily see that we could miss some hangs, but it is not obvious to me how we could get a false alarm like the one that you reproduced. To fix this, we will need a better data structure that can represent the start time of each pending wait. Maybe even a wait queue, to enforce FIFO ordering of the exclusive latch waits. Maybe we should just allow innodb_fatal_semaphore_wait=0 to disable this watchdog altogether? After all, we want to make the dict_sys.latch less contended ( MDEV-33594 , MDEV-34988 , MDEV-34999 , MDEV-35436 ), and as a result of those changes, this watchdog would lose its effectiveness. Most recent hangs that we have analyzed have occurred because of the buffer pool. Maybe we should implement another watchdog there, such as make MDEV-36226 observe innodb_fatal_semaphore_wait ? For the record, I was not keen to implement any replacement for the infamous A long semaphore wait watchdog. In fact, when I removed the sync_array in MDEV-21452 I initially simply removed it and then implemented a replacement , which was revised in MDEV-24258 . Some colleagues in QA and support insisted that we retain something to catch hangs, and I thought that covering only the dict_sys.latch could be a good enough approximation.
            otto Otto Kekäläinen added a comment - This was reported in Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1101340

            For the record, MDEV-34863 disables the memory pressure event handling by default.

            marko Marko Mäkelä added a comment - For the record, MDEV-34863 disables the memory pressure event handling by default.

            People

              debarun Debarun Banerjee
              xan@biblionix.com Xan Charbonnet
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.