[MDEV-24225] Long semaphore wait in dict0dict.cc line 943 and semaphore wait >600 sec kill Created: 2020-11-17  Updated: 2021-10-27  Resolved: 2021-07-26

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB, XA
Affects Version/s: 10.4, 10.5, 10.6
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Roel Van de Paar Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: None

Attachments: Text File MDEV-24225.txt     Text File MDEV-24225_10.5.7_dbg_e3fc9c1db04bdc5566bcdafcc3d5159199089c00.txt     Text File MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt     Text File MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg_2.txt    
Issue Links:
Relates
relates to MDEV-21175 Remove dict_table_t::n_foreign_key_ch... Closed
relates to MDEV-21602 CREATE TABLE…PRIMARY KEY…SELECT worka... Closed
relates to MDEV-15532 XA: Assertion `!log->same_pk' failed ... Closed

 Description   

Run the following, in continually random SQL order, using 150 or 200 threads, perpetually:

DROP DATABASE test;
CREATE DATABASE test;
USE test;
CREATE TABLE t1 (a INT) ENGINE=InnoDB;
INSERT INTO t1 VALUES (NULL);
XA BEGIN 'x';
UPDATE t1 SET a = 1;
DELETE FROM t1;
ALTER TABLE t1 ADD COLUMN extra INT;
INSERT INTO t1 (a) VALUES (2);
ROLLBACK;
DELETE FROM t1;
INSERT INTO t1 (a) VALUES (2);
XA END 'x';
XA ROLLBACK 'x';
DROP TABLE t1;

After about 50k-500k executions, and sufficient time to hit the 600sec timeout, this leads to:

10.6.0 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8 (Debug)

2020-11-17 10:53:27 0 [Note] /test/MD211020-mariadb-10.6.0-linux-x86_64-dbg/bin/mysqld: ready for connections.
Version: '10.6.0-MariaDB-debug' socket: '/test/MD211020-mariadb-10.6.0-linux-x86_64-dbg/socket.sock' port: 12191 MariaDB Server
2020-11-17 10:53:35 499 [Note] InnoDB: Deferring DROP TABLE `test`.`t1`; renaming to test/#sql-ib33
2020-11-17 10:57:37 0 [Warning] InnoDB: A long semaphore wait:
--Thread 22831188305664 has waited at dict0dict.cc line 943 for 241.00 seconds the semaphore:
Mutex at 0x55e77c0b2a00, Mutex DICT_SYS created /test/10.6_dbg/storage/innobase/dict/dict0dict.cc:1032, lock var 2
Last time reserved in file /test/10.6_dbg/storage/innobase/row/row0mysql.cc line 3921
2020-11-17 10:57:37 0 [Note] InnoDB: A semaphore wait:
--Thread 22831188305664 has waited at dict0dict.cc line 943 for 241.00 seconds the semaphore:
Mutex at 0x55e77c0b2a00, Mutex DICT_SYS created /test/10.6_dbg/storage/innobase/dict/dict0dict.cc:1032, lock var 2
Last time reserved in file /test/10.6_dbg/storage/innobase/row/row0mysql.cc line 3921
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending reads 0, writes 0
InnoDB: ###### Diagnostic info printed to the standard error stream
... many more (all on the same dict0dict.cc:943, dict0dict.cc:1032, and row0mysql.cc line 3921)...
2020-11-17 11:03:47 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
201117 11:03:47 [ERROR] mysqld got signal 6 ;

10.6.0 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8 (Debug)

Core was generated by `/test/MD211020-mariadb-10.6.0-linux-x86_64-dbg/bin/mysqld --no-defaults --core-'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
    at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
[Current thread is 1 (Thread 0x14c3ce1f0700 (LWP 827875))]
(gdb) bt
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1 0x000055e77b698021 in my_write_core (sig=sig@entry=6) at /test/10.6_dbg/mysys/stacktrace.c:424
#2 0x000055e77addf321 in handle_fatal_signal (sig=6) at /test/10.6_dbg/sql/signal_handler.cc:330
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5 0x000014c48b161859 in __GI_abort () at abort.c:79
#6 0x000055e77b4574b9 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_dbg/storage/innobase/ut/ut0ut.cc:580
#7 0x000055e77b3e5b04 in srv_error_monitor_task () at /test/10.6_dbg/storage/innobase/include/ut0ut.h:346
#8 0x000055e77b623746 in tpool::thread_pool_generic::timer_generic::run (this=0x55e77edb1af0) at /test/10.6_dbg/tpool/tpool_generic.cc:299
#9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55e77edb1af0) at /test/10.6_dbg/tpool/tpool_generic.cc:317
#10 0x000055e77b62467f in tpool::task::execute (this=0x55e77edb1b30) at /test/10.6_dbg/tpool/task.cc:52
#11 0x000055e77b62325f in tpool::thread_pool_generic::worker_main (this=0x55e77e3f1b20, thread_var=0x55e77e42c4d0) at /test/10.6_dbg/tpool/tpool_generic.cc:518
#12 0x000055e77b623598 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#17 0x000014c48b553d84 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x000014c48b66f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#19 0x000014c48b25e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

10.6.0 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8 (Optimized)

Core was generated by `/test/MD201020-mariadb-10.6.0-linux-x86_64-opt/bin/mysqld --no-defaults --core-'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
    at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
[Current thread is 1 (Thread 0x148ea2df6700 (LWP 1128501))]
(gdb) bt
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1 0x000055b083224a4f in my_write_core (sig=sig@entry=6) at /test/10.6_opt/mysys/stacktrace.c:424
#2 0x000055b082c48130 in handle_fatal_signal (sig=6) at /test/10.6_opt/sql/signal_handler.cc:330
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5 0x0000148fb2181859 in __GI_abort () at abort.c:79
#6 0x000055b08290d1f7 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.6_opt/storage/innobase/ut/ut0ut.cc:580
#7 0x000055b082902fde in srv_error_monitor_task () at /test/10.6_opt/storage/innobase/include/ut0ut.h:346
#8 0x000055b0831b7159 in tpool::thread_pool_generic::timer_generic::run (this=0x55b08599f5a0) at /test/10.6_opt/tpool/tpool_generic.cc:299
#9 tpool::thread_pool_generic::timer_generic::execute (arg=0x55b08599f5a0) at /test/10.6_opt/tpool/tpool_generic.cc:317
#10 0x000055b0831b815a in tpool::task::execute (this=0x55b08599f5e0) at /test/10.6_opt/tpool/task.cc:52
#11 tpool::task::execute (this=0x55b08599f5e0) at /test/10.6_opt/tpool/task.cc:42
#12 0x000055b0831b6c47 in tpool::thread_pool_generic::worker_main (this=0x55b0853a52e0, thread_var=0x55b0853dec00) at /test/10.6_opt/tpool/tpool_generic.cc:518
#13 0x0000148fb2573d84 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#14 0x0000148fb268f609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#15 0x0000148fb227e293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



 Comments   
Comment by Roel Van de Paar [ 2020-11-17 ]

This may be related to MDEV-15532

Comment by Roel Van de Paar [ 2020-11-17 ]

10.5.7 e3fc9c1db04bdc5566bcdafcc3d5159199089c00 (Debug)

Core was generated by `/test/MD211020-mariadb-10.5.7-linux-x86_64-dbg/bin/mysqld --no-defaults --core-'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
    at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
[Current thread is 1 (Thread 0x1553aedf6700 (LWP 1860155))]
(gdb) bt
#0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1  0x000055a40b71a869 in my_write_core (sig=sig@entry=6) at /test/10.5_dbg/mysys/stacktrace.c:424
#2  0x000055a40ae6ff1f in handle_fatal_signal (sig=6) at /test/10.5_dbg/sql/signal_handler.cc:330
#3  <signal handler called>
#4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5  0x00001553fbf53859 in __GI_abort () at abort.c:79
#6  0x000055a40b4e8e75 in ib::fatal::~fatal (this=<optimized out>, __in_chrg=<optimized out>) at /test/10.5_dbg/storage/innobase/ut/ut0ut.cc:580
#7  0x000055a40b477f53 in srv_error_monitor_task () at /test/10.5_dbg/storage/innobase/include/ut0ut.h:346
#8  0x000055a40b6a5e44 in tpool::thread_pool_generic::timer_generic::run (this=0x55a40e1cdea0) at /test/10.5_dbg/tpool/tpool_generic.cc:299
#9  tpool::thread_pool_generic::timer_generic::execute (arg=0x55a40e1cdea0) at /test/10.5_dbg/tpool/tpool_generic.cc:317
#10 0x000055a40b6a6d7d in tpool::task::execute (this=0x55a40e1cdee0) at /test/10.5_dbg/tpool/task.cc:52
#11 0x000055a40b6a595d in tpool::thread_pool_generic::worker_main (this=0x55a40d831cb0, thread_var=0x55a40d86c7e0) at /test/10.5_dbg/tpool/tpool_generic.cc:518
#12 0x000055a40b6a5c96 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__t=<optimized out>, __f=<optimized out>) at /usr/include/c++/9/bits/invoke.h:89
#13 std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=<optimized out>) at /usr/include/c++/9/bits/invoke.h:95
#14 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=<optimized out>) at /usr/include/c++/9/thread:244
#15 std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=<optimized out>) at /usr/include/c++/9/thread:251
#16 std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=<optimized out>) at /usr/include/c++/9/thread:195
#17 0x00001553fc345d84 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#18 0x00001553fc461609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#19 0x00001553fc050293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Comment by Roel Van de Paar [ 2020-11-17 ]

Issue confirmed thus far on:
MD 10.6.0 opt 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8
MD 10.6.0 dbg 5d4599f9750140f92cfdbbe4d292ae1b8dd456f8
MD 10.5.7 opt e3fc9c1db04bdc5566bcdafcc3d5159199089c00
MD 10.5.7 dbg e3fc9c1db04bdc5566bcdafcc3d5159199089c00
MD 10.5.9 dbg 7f30a5c423249c7fcfc9f9ac1578d06263aa37ac
MD 10.4.16 dbg 7eda55619654b76add275695e0a6039e60876e81

Issue does not seem to be present in MS 5.7.32

Comment by Roel Van de Paar [ 2020-11-17 ]

thread apply all bt added (10.5.9 dbg)

Comment by Roel Van de Paar [ 2020-11-17 ]

10.4.16 Crashes in a different way:

10.4.16 7eda55619654b76add275695e0a6039e60876e81 (Debug)

Core was generated by `/test/MD211020-mariadb-10.4.16-linux-x86_64-dbg/bin/mysqld --no-defaults --core'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000014b77a0e0aff in __GI___poll (fds=fds@entry=0x7ffc7acdc950, 
    nfds=nfds@entry=2, timeout=timeout@entry=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:29
[Current thread is 1 (Thread 0x14b779e347c0 (LWP 3463434))]
(gdb) bt
#0  0x000014b77a0e0aff in __GI___poll (fds=fds@entry=0x7ffc7acdc950, nfds=nfds@entry=2, timeout=timeout@entry=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x000014b77a0fdaa2 in __poll_chk (fds=fds@entry=0x7ffc7acdc950, nfds=nfds@entry=2, timeout=timeout@entry=-1, fdslen=fdslen@entry=24) at poll_chk.c:27
#2  0x00005634e75f5cac in poll (__timeout=-1, __nfds=2, __fds=0x7ffc7acdc950) at /usr/include/x86_64-linux-gnu/bits/poll2.h:41
#3  handle_connections_sockets () at /test/10.4_dbg/sql/mysqld.cc:6481
#4  0x00005634e75f7958 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at /test/10.4_dbg/sql/mysqld.cc:5917
#5  0x00005634e75e7ad6 in main (argc=<optimized out>, argv=<optimized out>) at /test/10.4_dbg/sql/main.cc:25

Comment by Marko Mäkelä [ 2020-11-17 ]

In both MDEV-24225.txt and MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt, the progress of most InnoDB threads is blocked by dict_sys.mutex and dict_sys.latch that are being held by a thread that does not seem to be blocked. In MDEV-24225_10.5.9_7f30a5c423249c7fcfc9f9ac1578d06263aa37ac_dbg.txt that thread is

10.5 7f30a5c423249c7fcfc9f9ac1578d06263aa37ac

Thread 2 (Thread 0x1507ac1ad700 (LWP 4193329)):
#0  0x000055579e9f0e76 in MutexDebug<TTASEventMutex<GenericPolicy> >::is_owned (this=0x55579f897270 <dict_sys+16>) at /test/10.5_dbg/storage/innobase/include/sync0policy.h:158
#1  0x000055579ed2e505 in PolicyMutex<TTASEventMutex<GenericPolicy> >::is_owned (this=<optimized out>) at /test/10.5_dbg/storage/innobase/include/ib0mutex.h:635
#2  dict_get_first_table_name_in_db (name=name@entry=0x150774025478 "test/") at /test/10.5_dbg/storage/innobase/dict/dict0load.cc:223
#3  0x000055579eb6d82b in row_drop_database_for_mysql (name=<optimized out>, name@entry=0x150774025478 "test/", trx=trx@entry=0x1507a7c03690, found=found@entry=0x1507ac1ab368) at /test/10.5_dbg/storage/innobase/row/row0mysql.cc:3921
#4  0x000055579e9c68e9 in innobase_drop_database (hton=<optimized out>, path=<optimized out>) at /test/10.5_dbg/storage/innobase/handler/ha_innodb.cc:13427

This is blocking (among other threads) the master task that is executing row_drop_tables_for_mysql_in_background().

Likewise, in MDEV-24225.txt, row_drop_tables_for_mysql_in_background() and many others are waiting for the completion of row_drop_database_for_mysql() in Thread 4, which does not seem to be blocked.

It is hard to say what trips the InnoDB watchdog. There seemingly is some progress, and not everything is stuck.

But, in any case, the entire "background drop table queue" is an InnoDB workaround for bugs at the storage engine layer. I think that this should be fixed by MDEV-15532, MDEV-21175, MDEV-21602, or similar. The entire workaround should be made redundant and removed.

Comment by Marko Mäkelä [ 2021-07-23 ]

Roel, is this repeatable with the latest 10.6? The broken-by-design "background drop table queue" was removed in 10.6.3.

Comment by Roel Van de Paar [ 2021-07-26 ]

Issue no longer reproducible in 10.6.3 891a927e804c5a3a582f6137c2f316ef7abb25ca (Debug + Optimized). Also tested 10.7.0 71ed8c136fa203b9b3a678a6d5cc72235ef73ef7 (Debug + Optimized), same result. Closing. Thank you marko

Comment by Roel Van de Paar [ 2021-10-27 ]

ccalender I tested 10.5.13 4eb7217ec33fef8d23f2dda0c97b442508c81b1d (Debug) and the issue is not reproducible there.

marko Any idea how/why it could be fixed in 10.5? Thanks

Comment by Marko Mäkelä [ 2021-10-27 ]

If this is not easily repeatable with 10.5.12 or later releases, the explanation could be a number of fixes to page flushing. MDEV-25801 a.k.a. MDEV-25773 could be a good candidate, but there were many other fixes in 10.5.12 that were motivated by MDEV-25451.

Generated at Thu Feb 08 09:28:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.