Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24911

Missing warning before [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex

Details

    Description

      This reduced but unfinished testcase:

      # mysqld options used (likely many or most are not required): --max_allowed_packet=33554432 --maximum-bulk_insert_buffer_size=1M --maximum-join_buffer_size=1M --maximum-max_heap_table_size=1M --maximum-max_join_size=1M --maximum-myisam_max_sort_file_size=1M --maximum-myisam_mmap_size=1M --maximum-myisam_sort_buffer_size=1M --maximum-optimizer_trace_max_mem_size=1M --maximum-preload_buffer_size=1M --maximum-query_alloc_block_size=1M --maximum-query_prealloc_size=1M --maximum-range_alloc_block_size=1M --maximum-read_buffer_size=1M --maximum-read_rnd_buffer_size=1M --maximum-sort_buffer_size=1M --maximum-tmp_table_size=1M --maximum-transaction_alloc_block_size=1M --maximum-transaction_prealloc_size=1M --log-output=none --sql_mode=ONLY_FULL_GROUP_BY --innodb_file_per_table=1 --innodb_flush_method=O_DIRECT --innodb_stats_persistent=off --loose-idle_write_transaction_timeout=0 --loose-idle_transaction_timeout=0 --loose-idle_readonly_transaction_timeout=0 --connect_timeout=60 --interactive_timeout=28800 --slave_net_timeout=60 --net_read_timeout=30 --net_write_timeout=60 --loose-table_lock_wait_timeout=50 --wait_timeout=28800 --lock-wait-timeout=86400 --innodb-lock-wait-timeout=50 --log_output=FILE --log-bin --log_bin_trust_function_creators=1 --loose-max-statement-time=30 --loose-debug_assert_on_not_freed_memory=0 --innodb-buffer-pool-size=300M
      SET GLOBAL innodb_stats_persistent=1;#NOERROR
      SET GLOBAL innodb_change_buffering_debug=1;#NOERROR
      SET @@GLOBAL.innodb_limit_optimistic_insert_debug = 2;#NOERROR
      CREATE TABLE t1(c1 DATE NULL, c2 BINARY(25) NOT NULL, c3 SMALLINT(4) NULL, c4 BINARY(15) NOT NULL PRIMARY KEY, c5 DATE NOT NULL UNIQUE KEY,c6 NUMERIC(10,8) NOT NULL DEFAULT 3.141592);#NOERROR                                               
      ALTER TABLE t1 MODIFY COLUMN c2 YEAR(2); ;
      SELECT SLEEP(3);   # Likely not required
      

      Has at least once lead to:

      10.6.0 9118fd360a3da0bba521caf2a35c424968235ac4 (Debug)

      Core was generated by `/test/MD010121-mariadb-10.6.0-linux-x86_64-dbg/bin/mysqld --no-defaults --max_a'.
      Program terminated with signal SIGABRT, Aborted.
      #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
          at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      [Current thread is 1 (Thread 0x1497a59f0700 (LWP 3581261))]
      (gdb) bt
      #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      #1  0x000055c8bec6a0d7 in my_write_core (sig=sig@entry=6) at /test/10.6_dbg/mysys/stacktrace.c:424
      #2  0x000055c8be3feab1 in handle_fatal_signal (sig=6) at /test/10.6_dbg/sql/signal_handler.cc:330
      #3  <signal handler called>
      #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #5  0x00001497da723859 in __GI_abort () at abort.c:79
      #6  0x000055c8bea4b263 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_dbg/storage/innobase/ut/ut0ut.cc:573                                                                                              
      #7  0x000055c8be9f4b88 in srv_monitor_task () at /test/10.6_dbg/storage/innobase/include/ut0ut.h:412
      #8  0x000055c8bebf55a0 in tpool::thread_pool_generic::timer_generic::run (this=0x55c8c22e4cc0) at /test/10.6_dbg/tpool/tpool_generic.cc:309                                                                                                   
      #9  tpool::thread_pool_generic::timer_generic::execute (arg=0x55c8c22e4cc0) at /test/10.6_dbg/tpool/tpool_generic.cc:329                                                                                                                      
      #10 0x000055c8bebf650f in tpool::task::execute (this=0x55c8c22e4d00) at /test/10.6_dbg/tpool/task.cc:52
      #11 0x000055c8bebf50b9 in tpool::thread_pool_generic::worker_main (this=0x55c8c1b7ee40, thread_var=0x55c8c1b8e5e0) at /test/10.6_dbg/tpool/tpool_generic.cc:546                                                                               
      #12 0x000055c8bebf53f2 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89                                                                                                                    
      #13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95                                                  
      #14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244         
      #15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251                       
      #16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
      #17 0x00001497dab15d84 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #18 0x00001497dac31609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #19 0x00001497da820293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      However the bug is extremely hard to reproduce.
      The original run also had this in the error log, though it is not clear if this would still show with the reduced testcase (likely, but not certain):

      10.6.0 9118fd360a3da0bba521caf2a35c424968235ac4 (Debug)

      2021-01-16 17:25:21 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/                                  
      210116 17:25:21 [ERROR] mysqld got signal 6 ;
      

      Attachments

        Issue Links

          Activity

            Possibly related to bugs in a development version of MDEV-26827, I remember seeing early warning messages before the final fatal error. So, the logic does work in some cases. In MDEV-24258, dict_sys.mutex was removed and the instrumentation was changed to affect dict_sys.latch:

            10.6 059a5f11711fd502982abed8781faf9f255fa975

            Version: '10.6.5-MariaDB-1:10.6.5+maria~impish-log'  socket: '/dev/shm/var/tmp/3/mysqld.1.sock'  port: 16020  mariadb.org binary distribution
            2021-10-21 17:47:31 6 [Note] Deleted Master_info file '/dev/shm/var/3/mysqld.1/data/master.info'.
            2021-10-21 17:47:31 6 [Note] Deleted Master_info file '/dev/shm/var/3/mysqld.1/data/relay-log.info'.
            2021-10-21 17:47:31 8 [Note] Start binlog_dump to slave_server(2), pos(master-bin.000001, 4), using_gtid(0), gtid('')
            2021-10-21 17:57:45 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
            

            I copied the above snippet from a run with a broken Ubuntu Impish kernel 5.13.0-19-generic that causes io_uring related hangs (see MDEV-26674). In that log, there are several such messages without any early warning.

            marko Marko Mäkelä added a comment - Possibly related to bugs in a development version of MDEV-26827 , I remember seeing early warning messages before the final fatal error. So, the logic does work in some cases. In MDEV-24258 , dict_sys.mutex was removed and the instrumentation was changed to affect dict_sys.latch : 10.6 059a5f11711fd502982abed8781faf9f255fa975 Version: '10.6.5-MariaDB-1:10.6.5+maria~impish-log' socket: '/dev/shm/var/tmp/3/mysqld.1.sock' port: 16020 mariadb.org binary distribution 2021-10-21 17:47:31 6 [Note] Deleted Master_info file '/dev/shm/var/3/mysqld.1/data/master.info'. 2021-10-21 17:47:31 6 [Note] Deleted Master_info file '/dev/shm/var/3/mysqld.1/data/relay-log.info'. 2021-10-21 17:47:31 8 [Note] Start binlog_dump to slave_server(2), pos(master-bin.000001, 4), using_gtid(0), gtid('') 2021-10-21 17:57:45 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ I copied the above snippet from a run with a broken Ubuntu Impish kernel 5.13.0-19-generic that causes io_uring related hangs (see MDEV-26674 ). In that log, there are several such messages without any early warning.

            In MDEV-32371, there was no early warning for a server hang either.

            I don’t know if this is actually fixable, or worth fixing. The server is unresponsive and partly unusable. The watchdog task will finally kick in.

            marko Marko Mäkelä added a comment - In MDEV-32371 , there was no early warning for a server hang either. I don’t know if this is actually fixable, or worth fixing. The server is unresponsive and partly unusable. The watchdog task will finally kick in.

            For what it's worth: the early warning on the server hang is really useful for tuning the innodb_fatal_semaphore_wait_threshold setting. It looks like the intention is for a warning to be printed at 1/4, 1/2, and 3/4 of the threshold.

            To minimize downtime as a result of a hang, I'd like to lower my threshold as low as it can go without disrupting normal operation. If I run my system for a couple of weeks and don't see any warnings, I know it should be safe to divide my innodb_fatal_semaphore_wait_threshold by 4.

            Once I've done that I can keep an eye out for warnings and see if I can divide by 4 again, or how often I see any of the warnings appear.

            None of that works if the threshold warnings aren't reliable. Maybe it still isn't worth fixing, but I'd at least like to explain why it's useful.

            xan@biblionix.com Xan Charbonnet added a comment - For what it's worth: the early warning on the server hang is really useful for tuning the innodb_fatal_semaphore_wait_threshold setting. It looks like the intention is for a warning to be printed at 1/4, 1/2, and 3/4 of the threshold. To minimize downtime as a result of a hang, I'd like to lower my threshold as low as it can go without disrupting normal operation. If I run my system for a couple of weeks and don't see any warnings, I know it should be safe to divide my innodb_fatal_semaphore_wait_threshold by 4. Once I've done that I can keep an eye out for warnings and see if I can divide by 4 again, or how often I see any of the warnings appear. None of that works if the threshold warnings aren't reliable. Maybe it still isn't worth fixing, but I'd at least like to explain why it's useful.
            otto Otto Kekäläinen added a comment - Related downstream bug report in Ubuntu: https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/2008718

            otto, based on the version numbers mentioned in the Ubuntu bug report, I would think that a likely explanation is MDEV-29835, which started to be more easily reproducible after MDEV-21136 and MDEV-30400 were fixed.

            marko Marko Mäkelä added a comment - otto , based on the version numbers mentioned in the Ubuntu bug report, I would think that a likely explanation is MDEV-29835 , which started to be more easily reproducible after MDEV-21136 and MDEV-30400 were fixed.

            People

              marko Marko Mäkelä
              Roel Roel Van de Paar
              Votes:
              3 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.