Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24225

Long semaphore wait in dict0dict.cc line 943 and semaphore wait >600 sec kill

Details

    Description

      Run the following, in continually random SQL order, using 150 or 200 threads, perpetually:

      DROP DATABASE test;
      CREATE DATABASE test;
      USE test;
      CREATE TABLE t1 (a INT) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (NULL);
      XA BEGIN 'x';
      UPDATE t1 SET a = 1;
      DELETE FROM t1;
      ALTER TABLE t1 ADD COLUMN extra INT;
      INSERT INTO t1 (a) VALUES (2);
      ROLLBACK;
      DELETE FROM t1;
      INSERT INTO t1 (a) VALUES (2);
      XA END 'x';
      XA ROLLBACK 'x';
      DROP TABLE t1;
      

      After about 50k-500k executions, and sufficient time to hit the 600sec timeout, this leads to:

      10.6.0 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8 (Debug)

      2020-11-17 10:53:27 0 [Note] /test/MD211020-mariadb-10.6.0-linux-x86_64-dbg/bin/mysqld: ready for connections.
      Version: '10.6.0-MariaDB-debug' socket: '/test/MD211020-mariadb-10.6.0-linux-x86_64-dbg/socket.sock' port: 12191 MariaDB Server
      2020-11-17 10:53:35 499 [Note] InnoDB: Deferring DROP TABLE `test`.`t1`; renaming to test/#sql-ib33
      2020-11-17 10:57:37 0 [Warning] InnoDB: A long semaphore wait:
      --Thread 22831188305664 has waited at dict0dict.cc line 943 for 241.00 seconds the semaphore:
      Mutex at 0x55e77c0b2a00, Mutex DICT_SYS created /test/10.6_dbg/storage/innobase/dict/dict0dict.cc:1032, lock var 2
      Last time reserved in file /test/10.6_dbg/storage/innobase/row/row0mysql.cc line 3921
      2020-11-17 10:57:37 0 [Note] InnoDB: A semaphore wait:
      --Thread 22831188305664 has waited at dict0dict.cc line 943 for 241.00 seconds the semaphore:
      Mutex at 0x55e77c0b2a00, Mutex DICT_SYS created /test/10.6_dbg/storage/innobase/dict/dict0dict.cc:1032, lock var 2
      Last time reserved in file /test/10.6_dbg/storage/innobase/row/row0mysql.cc line 3921
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending reads 0, writes 0
      InnoDB: ###### Diagnostic info printed to the standard error stream
      ... many more (all on the same dict0dict.cc:943, dict0dict.cc:1032, and row0mysql.cc line 3921)...
      2020-11-17 11:03:47 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      201117 11:03:47 [ERROR] mysqld got signal 6 ;
      

      10.6.0 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8 (Debug)

      Core was generated by `/test/MD211020-mariadb-10.6.0-linux-x86_64-dbg/bin/mysqld --no-defaults --core-'.
      Program terminated with signal SIGABRT, Aborted.
      #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
          at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      [Current thread is 1 (Thread 0x14c3ce1f0700 (LWP 827875))]
      (gdb) bt
      #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      #1 0x000055e77b698021 in my_write_core (sig=sig@entry=6) at /test/10.6_dbg/mysys/stacktrace.c:424
      #2 0x000055e77addf321 in handle_fatal_signal (sig=6) at /test/10.6_dbg/sql/signal_handler.cc:330
      #3 <signal handler called>
      #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #5 0x000014c48b161859 in __GI_abort () at abort.c:79
      #6 0x000055e77b4574b9 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_dbg/storage/innobase/ut/ut0ut.cc:580
      #7 0x000055e77b3e5b04 in srv_error_monitor_task () at /test/10.6_dbg/storage/innobase/include/ut0ut.h:346
      #8 0x000055e77b623746 in tpool::thread_pool_generic::timer_generic::run (this=0x55e77edb1af0) at /test/10.6_dbg/tpool/tpool_generic.cc:299
      #9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55e77edb1af0) at /test/10.6_dbg/tpool/tpool_generic.cc:317
      #10 0x000055e77b62467f in tpool::task::execute (this=0x55e77edb1b30) at /test/10.6_dbg/tpool/task.cc:52
      #11 0x000055e77b62325f in tpool::thread_pool_generic::worker_main (this=0x55e77e3f1b20, thread_var=0x55e77e42c4d0) at /test/10.6_dbg/tpool/tpool_generic.cc:518
      #12 0x000055e77b623598 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
      #13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
      #14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
      #15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
      #16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
      #17 0x000014c48b553d84 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #18 0x000014c48b66f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #19 0x000014c48b25e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      10.6.0 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8 (Optimized)

      Core was generated by `/test/MD201020-mariadb-10.6.0-linux-x86_64-opt/bin/mysqld --no-defaults --core-'.
      Program terminated with signal SIGABRT, Aborted.
      #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
          at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      [Current thread is 1 (Thread 0x148ea2df6700 (LWP 1128501))]
      (gdb) bt
      #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      #1 0x000055b083224a4f in my_write_core (sig=sig@entry=6) at /test/10.6_opt/mysys/stacktrace.c:424
      #2 0x000055b082c48130 in handle_fatal_signal (sig=6) at /test/10.6_opt/sql/signal_handler.cc:330
      #3 <signal handler called>
      #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #5 0x0000148fb2181859 in __GI_abort () at abort.c:79
      #6 0x000055b08290d1f7 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_opt/storage/innobase/ut/ut0ut.cc:580
      #7 0x000055b082902fde in srv_error_monitor_task () at /test/10.6_opt/storage/innobase/include/ut0ut.h:346
      #8 0x000055b0831b7159 in tpool::thread_pool_generic::timer_generic::run (this=0x55b08599f5a0) at /test/10.6_opt/tpool/tpool_generic.cc:299
      #9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55b08599f5a0) at /test/10.6_opt/tpool/tpool_generic.cc:317
      #10 0x000055b0831b815a in tpool::task::execute (this=0x55b08599f5e0) at /test/10.6_opt/tpool/task.cc:52
      #11 tpool::task::execute (this=0x55b08599f5e0) at /test/10.6_opt/tpool/task.cc:42
      #12 0x000055b0831b6c47 in tpool::thread_pool_generic::worker_main (this=0x55b0853a52e0, thread_var=0x55b0853dec00) at /test/10.6_opt/tpool/tpool_generic.cc:518
      #13 0x0000148fb2573d84 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #14 0x0000148fb268f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #15 0x0000148fb227e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Attachments

        Issue Links

          Activity

            In both MDEV-24225.txt and MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt, the progress of most InnoDB threads is blocked by dict_sys.mutex and dict_sys.latch that are being held by a thread that does not seem to be blocked. In MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt that thread is

            10.5 7f30a5c423249c7fcfc9f9ac1578d06263aa37ac

            Thread 2 (Thread 0x1507ac1ad700 (LWP 4193329)):
            #0  0x000055579e9f0e76 in MutexDebug<TTASEventMutex<GenericPolicy> >::is_owned (this=0x55579f897270 <dict_sys+16>) at /test/10.5_dbg/storage/innobase/include/sync0policy.h:158
            #1  0x000055579ed2e505 in PolicyMutex<TTASEventMutex<GenericPolicy> >::is_owned (this=<optimized out>) at /test/10.5_dbg/storage/innobase/include/ib0mutex.h:635
            #2  dict_get_first_table_name_in_db (name=name@entry=0x150774025478 "test/") at /test/10.5_dbg/storage/innobase/dict/dict0load.cc:223
            #3  0x000055579eb6d82b in row_drop_database_for_mysql (name=<optimized out>, name@entry=0x150774025478 "test/", trx=trx@entry=0x1507a7c03690, found=found@entry=0x1507ac1ab368) at /test/10.5_dbg/storage/innobase/row/row0mysql.cc:3921
            #4  0x000055579e9c68e9 in innobase_drop_database (hton=<optimized out>, path=<optimized out>) at /test/10.5_dbg/storage/innobase/handler/ha_innodb.cc:13427
            

            This is blocking (among other threads) the master task that is executing row_drop_tables_for_mysql_in_background().

            Likewise, in MDEV-24225.txt, row_drop_tables_for_mysql_in_background() and many others are waiting for the completion of row_drop_database_for_mysql() in Thread 4, which does not seem to be blocked.

            It is hard to say what trips the InnoDB watchdog. There seemingly is some progress, and not everything is stuck.

            But, in any case, the entire "background drop table queue" is an InnoDB workaround for bugs at the storage engine layer. I think that this should be fixed by MDEV-15532, MDEV-21175, MDEV-21602, or similar. The entire workaround should be made redundant and removed.

            marko Marko Mäkelä added a comment - In both MDEV-24225.txt and MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt , the progress of most InnoDB threads is blocked by dict_sys.mutex and dict_sys.latch that are being held by a thread that does not seem to be blocked. In MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt that thread is 10.5 7f30a5c423249c7fcfc9f9ac1578d06263aa37ac Thread 2 (Thread 0x1507ac1ad700 (LWP 4193329)): #0 0x000055579e9f0e76 in MutexDebug<TTASEventMutex<GenericPolicy> >::is_owned (this=0x55579f897270 <dict_sys+16>) at /test/10.5_dbg/storage/innobase/include/sync0policy.h:158 #1 0x000055579ed2e505 in PolicyMutex<TTASEventMutex<GenericPolicy> >::is_owned (this=<optimized out>) at /test/10.5_dbg/storage/innobase/include/ib0mutex.h:635 #2 dict_get_first_table_name_in_db (name=name@entry=0x150774025478 "test/") at /test/10.5_dbg/storage/innobase/dict/dict0load.cc:223 #3 0x000055579eb6d82b in row_drop_database_for_mysql (name=<optimized out>, name@entry=0x150774025478 "test/", trx=trx@entry=0x1507a7c03690, found=found@entry=0x1507ac1ab368) at /test/10.5_dbg/storage/innobase/row/row0mysql.cc:3921 #4 0x000055579e9c68e9 in innobase_drop_database (hton=<optimized out>, path=<optimized out>) at /test/10.5_dbg/storage/innobase/handler/ha_innodb.cc:13427 This is blocking (among other threads) the master task that is executing row_drop_tables_for_mysql_in_background() . Likewise, in MDEV-24225.txt , row_drop_tables_for_mysql_in_background() and many others are waiting for the completion of row_drop_database_for_mysql() in Thread 4, which does not seem to be blocked. It is hard to say what trips the InnoDB watchdog. There seemingly is some progress, and not everything is stuck. But, in any case, the entire "background drop table queue" is an InnoDB workaround for bugs at the storage engine layer. I think that this should be fixed by MDEV-15532 , MDEV-21175 , MDEV-21602 , or similar. The entire workaround should be made redundant and removed.

            Roel, is this repeatable with the latest 10.6? The broken-by-design "background drop table queue" was removed in 10.6.3.

            marko Marko Mäkelä added a comment - Roel , is this repeatable with the latest 10.6? The broken-by-design "background drop table queue" was removed in 10.6.3.

            Issue no longer reproducible in 10.6.3 891a927e804c5a3a582f6137c2f316ef7abb25ca (Debug + Optimized). Also tested 10.7.0 71ed8c136fa203b9b3a678a6d5cc72235ef73ef7 (Debug + Optimized), same result. Closing. Thank you marko

            Roel Roel Van de Paar added a comment - Issue no longer reproducible in 10.6.3 891a927e804c5a3a582f6137c2f316ef7abb25ca (Debug + Optimized). Also tested 10.7.0 71ed8c136fa203b9b3a678a6d5cc72235ef73ef7 (Debug + Optimized), same result. Closing. Thank you marko

            ccalender I tested 10.5.13 4eb7217ec33fef8d23f2dda0c97b442508c81b1d (Debug) and the issue is not reproducible there.

            marko Any idea how/why it could be fixed in 10.5? Thanks

            Roel Roel Van de Paar added a comment - ccalender I tested 10.5.13 4eb7217ec33fef8d23f2dda0c97b442508c81b1d (Debug) and the issue is not reproducible there. marko Any idea how/why it could be fixed in 10.5? Thanks

            If this is not easily repeatable with 10.5.12 or later releases, the explanation could be a number of fixes to page flushing. MDEV-25801 a.k.a. MDEV-25773 could be a good candidate, but there were many other fixes in 10.5.12 that were motivated by MDEV-25451.

            marko Marko Mäkelä added a comment - If this is not easily repeatable with 10.5.12 or later releases, the explanation could be a number of fixes to page flushing. MDEV-25801 a.k.a. MDEV-25773 could be a good candidate, but there were many other fixes in 10.5.12 that were motivated by MDEV-25451 .

            People

              marko Marko Mäkelä
              Roel Roel Van de Paar
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.