[MDEV-27985] buf_flush_freed_pages() causes InnoDB to hang Created: 2022-03-02  Updated: 2022-06-14  Resolved: 2022-03-15

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6.7
Fix Version/s: 10.9.0, 10.5.16, 10.6.8, 10.7.4, 10.8.3

Type: Bug Priority: Critical
Reporter: Otto Kekäläinen Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: crash, deadlock

Issue Links:
Problem/Incident
causes MDEV-28828 SIGSEGV in buf_flush_LRU_list_batch Closed
is caused by MDEV-15528 Avoid writing freed InnoDB pages Closed
Relates
relates to MDEV-21452 Use condition variables and normal mu... Closed

 Description   

After uploading 1:10.6.7-2 to Debian I saw this test failure on the armel builder:

ain.parser_bug21114_innodb 'innodb'     w4 [ fail ]
        Test ended at 2022-03-02 07:38:10
 
CURRENT_TEST: main.parser_bug21114_innodb
mysqltest: In included file "./include/parser_bug21114.inc": 
included from /<<PKGBUILDDIR>>/mysql-test/main/parser_bug21114_innodb.test at line 244:
At line 36: query 'CREATE TABLE bug21114_child(
pk int not null,
fk_col1 int not null,
fk_col2 int not null,
fk_col3 int not null,
fk_col4 int not null,
CONSTRAINT fk_fct FOREIGN KEY (fk_col1, fk_col2)
REFERENCES $FCT(col1, col2),
CONSTRAINT fk_fct_space FOREIGN KEY (fk_col3, fk_col4)
REFERENCES $FCT (col1, col2)
) ENGINE $engine_type' failed: <Unknown> (2013): Lost connection to server during query
 
The result from queries just before the failure was:
< snip >
DROP TABLE bug21114_child;
DROP TABLE ltrim;
drop table if exists makedate;
drop table if exists bug21114_child;
DROP TABLE bug21114_child;
DROP TABLE makedate;
drop table if exists maketime;
drop table if exists bug21114_child;
DROP TABLE bug21114_child;
DROP TABLE maketime;
drop table if exists make_set;
drop table if exists bug21114_child;
DROP TABLE bug21114_child;
DROP TABLE make_set;
drop table if exists master_pos_wait;
drop table if exists bug21114_child;
DROP TABLE bug21114_child;
DROP TABLE master_pos_wait;
drop table if exists mbrcontains;
drop table if exists bug21114_child;
 
More results from queries before failure can be found in /<<PKGBUILDDIR>>/builddir/mysql-test/var/4/log/parser_bug21114_innodb.log
 
 
Server [mysqld.1 - pid: 8467, winpid: 8467, exit: 256] failed during test run
Server log from this test:
----------SERVER LOG START-----------
2022-03-02  7:38:09 188 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
220302  7:38:09 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.6.7-MariaDB-2-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=3
max_threads=153
thread_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 62381 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0xafb01b20
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0xaf2fbc5c thread_stack 0x49000
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0xafb11de8): CREATE TABLE bug21114_child(
pk int not null,
fk_col1 int not null,
fk_col2 int not null,
fk_col3 int not null,
fk_col4 int not null,
CONSTRAINT fk_fct FOREIGN KEY (fk_col1, fk_col2)
REFERENCES mbrcontains(col1, col2),
CONSTRAINT fk_fct_space FOREIGN KEY (fk_col3, fk_col4)
REFERENCES mbrcontains (col1, col2)
) ENGINE InnoDb
 
Connection ID (thread ID): 188
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
 
The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /<<PKGBUILDDIR>>/builddir/mysql-test/var/4/mysqld.1/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             96403                96403                processes 
Max open files            1024                 1024                 files     
Max locked memory         1584997376           1584997376           bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       96403                96403                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: core
 
----------SERVER LOG END-------------
 
 
 - found 'core' (0/5)
 
Trying 'dbx' to get a backtrace
 
Trying 'gdb' to get a backtrace from coredump /<<PKGBUILDDIR>>/builddir/mysql-test/var/4/log/main.parser_bug21114_innodb-innodb/mysqld.1/data/core
Core generated by '/<<PKGBUILDDIR>>/builddir/sql/mariadbd'
Output from gdb follows. The first stack trace is from the failing thread.
The following stack traces are from all threads (so the failing one is
duplicated).
--------------------------
 
warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing
 
warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing
[New LWP 8498]
[New LWP 8475]
[New LWP 8582]
[New LWP 8476]
[New LWP 8485]
[New LWP 8487]
[New LWP 8479]
[New LWP 8489]
[New LWP 8486]
[New LWP 8581]
[New LWP 8480]
[New LWP 8900]
[New LWP 8468]
[New LWP 8482]
[New LWP 8929]
[New LWP 8495]
[New LWP 8483]
[New LWP 8580]
[New LWP 8579]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabi/libthread_db.so.1".
Core was generated by `/<<PKGBUILDDIR>>/builddir/sql/mariadbd --defaults'.
Program terminated with signal SIGABRT, Aborted.
#0  0xb6aa07a8 in pthread_kill () from /lib/arm-linux-gnueabi/libpthread.so.0
[Current thread is 1 (Thread 0xaf2fc330 (LWP 8498))]
#0  0xb6aa07a8 in pthread_kill () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0x00b3ca2c in handle_fatal_signal (sig=<optimized out>) at ./sql/signal_handler.cc:345
#2  <signal handler called>
#3  0xb66dcfd0 in raise () from /lib/arm-linux-gnueabi/libc.so.6
#4  0xb66c5a88 in abort () from /lib/arm-linux-gnueabi/libc.so.6
#5  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 19 (Thread 0xab1ff330 (LWP 8579)):
#0  0xb678b130 in syscall () from /lib/arm-linux-gnueabi/libc.so.6
#1  0x00f0a590 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:347
#2  srw_mutex_impl<false>::wait_and_lock (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:461
#3  0x00f0ad78 in srw_mutex_impl<false>::wr_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:80
#4  ssux_lock_impl<false>::rd_wait (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:523
#5  0x00f0af5c in ssux_lock_impl<false>::rd_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:216
#6  srw_lock_impl<false>::psi_rd_lock (this=0x16a6a00 <dict_sys+64>, file=<optimized out>, line=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:581
#7  0x00f80e98 in srw_lock_impl<false>::rd_lock (line=834, file=<optimized out>, this=<optimized out>) at ./storage/innobase/include/srw_lock.h:483
#8  0x00f8686c in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f7e178, mdl=mdl@entry=0x2f3c768) at ./storage/innobase/dict/dict0dict.cc:834
#9  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f3c608, updated_extern=0xab1febc2, undo_rec=0xadaf9850 "", node=0x2f3c660) at ./storage/innobase/row/row0purge.cc:1028
#10 row_purge (thr=0x2f3c608, undo_rec=0xadaf9850 "", node=0x2f3c660) at ./storage/innobase/row/row0purge.cc:1206
#11 row_purge_step (thr=thr@entry=0x2f3c608) at ./storage/innobase/row/row0purge.cc:1258
#12 0x00eb137c in que_thr_step (thr=0x2f3c608) at ./storage/innobase/que/que0que.cc:653
#13 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:709
#14 que_run_threads (thr=thr@entry=0x2f3c608) at ./storage/innobase/que/que0que.cc:729
#15 0x00f079d4 in srv_task_execute () at ./storage/innobase/srv/srv0srv.cc:1714
#16 purge_worker_callback () at ./storage/innobase/srv/srv0srv.cc:1957
#17 0x00fffccc in tpool::task_group::execute (this=0x1f5cab8 <purge_task_group>, t=<optimized out>) at ./tpool/task_group.cc:55
#18 0x00ffeafc in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91228) at ./tpool/tpool_generic.cc:549
#19 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#20 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#21 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 18 (Thread 0xaa9fe330 (LWP 8580)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e911a8) at /usr/include/c++/11/bits/std_mutex.h:169
#5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e911a8) at /usr/include/c++/11/condition_variable:201
#6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e911a8) at /usr/include/c++/11/condition_variable:111
#7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e911a8) at /usr/include/c++/11/condition_variable:163
#8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e911a8) at ./tpool/tpool_generic.cc:449
#9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e911a8, t=t@entry=0xaa9fdc94) at ./tpool/tpool_generic.cc:502
#10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e911a8) at ./tpool/tpool_generic.cc:547
#11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 17 (Thread 0xae4f4330 (LWP 8483)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e91428) at /usr/include/c++/11/bits/std_mutex.h:169
#5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e91428) at /usr/include/c++/11/condition_variable:201
#6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e91428) at /usr/include/c++/11/condition_variable:111
#7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e91428) at /usr/include/c++/11/condition_variable:163
#8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e91428) at ./tpool/tpool_generic.cc:449
#9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e91428, t=t@entry=0xae4f3c94) at ./tpool/tpool_generic.cc:502
#10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91428) at ./tpool/tpool_generic.cc:547
#11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 16 (Thread 0xafc5c330 (LWP 8495)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9eadc in pthread_cond_wait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0x0081673c in psi_cond_wait (that=0x170bcd8 <COND_manager>, mutex=0x170bd10 <LOCK_manager>, file=<optimized out>, line=<optimized out>) at ./mysys/my_thr_init.c:596
#4  0x009115f4 in inline_mysql_cond_wait (that=<optimized out>, mutex=<optimized out>, src_file=<optimized out>, src_line=<optimized out>) at ./include/mysql/psi/mysql_thread.h:1070
#5  handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:103
#6  0x00d734c4 in pfs_spawn_thread (arg=0x2fb2748) at ./storage/perfschema/pfs.cc:2201
#7  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#8  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 15 (Thread 0xadca8330 (LWP 8929)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x008167d4 in psi_cond_timedwait (that=0x170a660 <thread_cache>, mutex=0x170a6d0 <thread_cache+112>, abstime=0xadca7c20, file=<optimized out>, line=176) at ./mysys/my_thr_init.c:609
#5  0x00a1dd94 in inline_mysql_cond_timedwait (src_file=0x10fda00 "./sql/thread_cache.h", src_line=176, abstime=0xadca7c20, mutex=<optimized out>, that=0x170a660 <thread_cache>) at ./include/mysql/psi/mysql_thread.h:1086
#6  Thread_cache::park (this=0x170a660 <thread_cache>) at ./sql/thread_cache.h:176
#7  0x00a1d7a4 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x2fcd128, put_in_cache=put_in_cache@entry=true) at ./sql/sql_connect.cc:1431
#8  0x00a1db40 in handle_one_connection (arg=arg@entry=0x2fcd128) at ./sql/sql_connect.cc:1312
#9  0x00d734c4 in pfs_spawn_thread (arg=0x3061048) at ./storage/perfschema/pfs.cc:2201
#10 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#11 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 14 (Thread 0xaecf5330 (LWP 8482)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e914a8) at /usr/include/c++/11/bits/std_mutex.h:169
#5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e914a8) at /usr/include/c++/11/condition_variable:201
#6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e914a8) at /usr/include/c++/11/condition_variable:111
#7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e914a8) at /usr/include/c++/11/condition_variable:163
#8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e914a8) at ./tpool/tpool_generic.cc:449
#9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e914a8, t=t@entry=0xaecf4c94) at ./tpool/tpool_generic.cc:502
#10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e914a8) at ./tpool/tpool_generic.cc:547
#11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 13 (Thread 0xb6ff1880 (LWP 8468)):
#0  0xb6782fe0 in poll () from /lib/arm-linux-gnueabi/libc.so.6
#1  0x0085af4c in poll (__timeout=-1, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/arm-linux-gnueabi/bits/poll2.h:47
#2  handle_connections_sockets () at ./sql/mysqld.cc:6118
#3  0x0085c258 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at ./sql/mysqld.cc:5823
#4  0xb66c5e24 in __libc_start_main () from /lib/arm-linux-gnueabi/libc.so.6
#5  0x0084f1a8 in _start () at ./sql/my_decimal.h:147
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 12 (Thread 0xadcf3330 (LWP 8900)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x008167d4 in psi_cond_timedwait (that=0x170a660 <thread_cache>, mutex=0x170a6d0 <thread_cache+112>, abstime=0xadcf2c20, file=<optimized out>, line=176) at ./mysys/my_thr_init.c:609
#5  0x00a1dd94 in inline_mysql_cond_timedwait (src_file=0x10fda00 "./sql/thread_cache.h", src_line=176, abstime=0xadcf2c20, mutex=<optimized out>, that=0x170a660 <thread_cache>) at ./include/mysql/psi/mysql_thread.h:1086
#6  Thread_cache::park (this=0x170a660 <thread_cache>) at ./sql/thread_cache.h:176
#7  0x00a1d7a4 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x3060c60, put_in_cache=put_in_cache@entry=true) at ./sql/sql_connect.cc:1431
#8  0x00a1db40 in handle_one_connection (arg=arg@entry=0x3060c60) at ./sql/sql_connect.cc:1312
#9  0x00d734c4 in pfs_spawn_thread (arg=0x3061048) at ./storage/perfschema/pfs.cc:2201
#10 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#11 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 11 (Thread 0xafafd330 (LWP 8480)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00f70e50 in buf_flush_page_cleaner () at ./storage/innobase/buf/buf0flu.cc:2190
#5  0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#6  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#7  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 10 (Thread 0xaa1fd330 (LWP 8581)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e91128) at /usr/include/c++/11/bits/std_mutex.h:169
#5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e91128) at /usr/include/c++/11/condition_variable:201
#6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e91128) at /usr/include/c++/11/condition_variable:111
#7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e91128) at /usr/include/c++/11/condition_variable:163
#8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e91128) at ./tpool/tpool_generic.cc:449
#9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e91128, t=t@entry=0xaa1fcc94) at ./tpool/tpool_generic.cc:502
#10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91128) at ./tpool/tpool_generic.cc:547
#11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 9 (Thread 0xad0fe330 (LWP 8486)):
#0  0xb678b130 in syscall () from /lib/arm-linux-gnueabi/libc.so.6
#1  0x00f0a590 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:347
#2  srw_mutex_impl<false>::wait_and_lock (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:461
#3  0x00f0ad78 in srw_mutex_impl<false>::wr_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:80
#4  ssux_lock_impl<false>::rd_wait (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:523
#5  0x00f0af5c in ssux_lock_impl<false>::rd_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:216
#6  srw_lock_impl<false>::psi_rd_lock (this=0x16a6a00 <dict_sys+64>, file=<optimized out>, line=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:581
#7  0x00f80e98 in srw_lock_impl<false>::rd_lock (line=834, file=<optimized out>, this=<optimized out>) at ./storage/innobase/include/srw_lock.h:483
#8  0x00f8686c in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f8c1f8, mdl=mdl@entry=0x2f3d280) at ./storage/innobase/dict/dict0dict.cc:834
#9  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f3c948, updated_extern=0xad0fdaca, undo_rec=0x2f476a0 "", node=0x2f3d178) at ./storage/innobase/row/row0purge.cc:1028
#10 row_purge (thr=0x2f3c948, undo_rec=0x2f476a0 "", node=0x2f3d178) at ./storage/innobase/row/row0purge.cc:1206
#11 row_purge_step (thr=thr@entry=0x2f3c948) at ./storage/innobase/row/row0purge.cc:1258
#12 0x00eb137c in que_thr_step (thr=0x2f3c948) at ./storage/innobase/que/que0que.cc:653
#13 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:709
#14 que_run_threads (thr=0x2f3c948) at ./storage/innobase/que/que0que.cc:729
#15 0x00f11bf4 in trx_purge (n_tasks=n_tasks@entry=4, truncate=<optimized out>) at ./storage/innobase/trx/trx0purge.cc:1312
#16 0x00f07cf8 in purge_coordinator_state::do_purge (this=0x1f5c8d8 <purge_state>) at ./storage/innobase/srv/srv0srv.cc:1812
#17 0x00f078d0 in purge_coordinator_callback () at ./storage/innobase/srv/srv0srv.cc:1966
#18 0x00fffccc in tpool::task_group::execute (this=0x1f5cb20 <purge_coordinator_task_group>, t=<optimized out>) at ./tpool/task_group.cc:55
#19 0x00ffeafc in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91328) at ./tpool/tpool_generic.cc:549
#20 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#21 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#22 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 8 (Thread 0xafca7330 (LWP 8489)):
#0  0xb66de008 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
#1  0xb66de168 in sigtimedwait () from /lib/arm-linux-gnueabi/libc.so.6
#2  0x008519fc in my_sigwait (code=<synthetic pointer>, sig=0xafca6b54, set=0xafca6b58) at ./include/my_pthread.h:195
#3  signal_hand (arg=arg@entry=0x0) at ./sql/mysqld.cc:3116
#4  0x00d734c4 in pfs_spawn_thread (arg=0x2fb62f0) at ./storage/perfschema/pfs.cc:2201
#5  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#6  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 7 (Thread 0xb0efe330 (LWP 8479)):
#0  0xb678b130 in syscall () from /lib/arm-linux-gnueabi/libc.so.6
#1  0xb6ecadf0 in ?? () from /usr/lib/arm-linux-gnueabi/liburing.so.2
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 6 (Thread 0xac8fd330 (LWP 8487)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e912a8) at /usr/include/c++/11/bits/std_mutex.h:169
#5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e912a8) at /usr/include/c++/11/condition_variable:201
#6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e912a8) at /usr/include/c++/11/condition_variable:111
#7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e912a8) at /usr/include/c++/11/condition_variable:163
#8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e912a8) at ./tpool/tpool_generic.cc:449
#9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e912a8, t=t@entry=0xac8fcc94) at ./tpool/tpool_generic.cc:502
#10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e912a8) at ./tpool/tpool_generic.cc:547
#11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 5 (Thread 0xad8ff330 (LWP 8485)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e913a8) at /usr/include/c++/11/bits/std_mutex.h:169
#5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e913a8) at /usr/include/c++/11/condition_variable:201
#6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e913a8) at /usr/include/c++/11/condition_variable:111
#7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e913a8) at /usr/include/c++/11/condition_variable:163
#8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e913a8) at ./tpool/tpool_generic.cc:449
#9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e913a8, t=t@entry=0xad8fec94) at ./tpool/tpool_generic.cc:502
#10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e913a8) at ./tpool/tpool_generic.cc:547
#11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 4 (Thread 0xb1cad330 (LWP 8476)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x008167d4 in psi_cond_timedwait (that=0x1f2a4b8 <COND_checkpoint>, mutex=0x1f2a490 <LOCK_checkpoint>, abstime=abstime@entry=0xb1cacc08, file=<optimized out>, line=line@entry=115) at ./mysys/my_thr_init.c:609
#5  0x00d0fc2c in inline_mysql_cond_timedwait (src_file=0x125ed5c "./storage/maria/ma_servicethread.c", src_line=115, abstime=0xb1cacc08, mutex=<optimized out>, that=<optimized out>) at ./include/mysql/psi/mysql_thread.h:1086
#6  my_service_thread_sleep (control=control@entry=0x1692c04 <checkpoint_control>, sleep_time=<optimized out>) at ./storage/maria/ma_servicethread.c:115
#7  0x00d07424 in ma_checkpoint_background (arg=arg@entry=0x1e) at ./storage/maria/ma_checkpoint.c:725
#8  0x00d734c4 in pfs_spawn_thread (arg=0x2e735c8) at ./storage/perfschema/pfs.cc:2201
#9  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#10 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 3 (Thread 0xa99fc330 (LWP 8582)):
#0  ut_hash_ulint (table_size=<optimized out>, key=<optimized out>) at ./storage/innobase/include/ut0rnd.inl:47
#1  hash_table_t::calc_hash (fold=<optimized out>, this=<optimized out>) at ./storage/innobase/include/hash0hash.h:231
#2  dict_sys_t::find_table (id=<optimized out>, this=0x16a69c0 <dict_sys>) at ./storage/innobase/include/dict0dict.h:1470
#3  dict_acquire_mdl_shared<false> (table=<optimized out>, table@entry=0x2e22750, thd=thd@entry=0x2f616f0, mdl=mdl@entry=0x2f2ff38, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:751
#4  0x00f867a0 in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f616f0, mdl=mdl@entry=0x2f2ff38) at ./storage/innobase/dict/dict0dict.cc:842
#5  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f2fdd8, updated_extern=0xa99fbbc2, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1028
#6  row_purge (thr=0x2f2fdd8, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1206
#7  row_purge_step (thr=thr@entry=0x2f2fdd8) at ./storage/innobase/row/row0purge.cc:1258
#8  0x00eb137c in que_thr_step (thr=0x2f2fdd8) at ./storage/innobase/que/que0que.cc:653
#9  que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:709
#10 que_run_threads (thr=thr@entry=0x2f2fdd8) at ./storage/innobase/que/que0que.cc:729
#11 0x00f079d4 in srv_task_execute () at ./storage/innobase/srv/srv0srv.cc:1714
#12 purge_worker_callback () at ./storage/innobase/srv/srv0srv.cc:1957
#13 0x00fffccc in tpool::task_group::execute (this=0x1f5cab8 <purge_task_group>, t=<optimized out>) at ./tpool/task_group.cc:55
#14 0x00ffeafc in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e910a8) at ./tpool/tpool_generic.cc:549
#15 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
#16 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#17 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 2 (Thread 0xb28f8330 (LWP 8475)):
#0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
#3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
#4  0x008167d4 in psi_cond_timedwait (that=that@entry=0x1fa1610 <COND_timer>, mutex=mutex@entry=0x1fa15c8 <LOCK_timer>, abstime=0xb28f7c70, file=<optimized out>, line=line@entry=321) at ./mysys/my_thr_init.c:609
#5  0x01065870 in inline_mysql_cond_timedwait (that=0x1fa1610 <COND_timer>, mutex=0x1fa15c8 <LOCK_timer>, src_file=0x12d5a9c "./mysys/thr_timer.c", src_line=321, abstime=0xb28f7c70) at ./include/mysql/psi/mysql_thread.h:1086
#6  timer_handler (arg=arg@entry=0x0) at ./mysys/thr_timer.c:321
#7  0x00d734c4 in pfs_spawn_thread (arg=0x2e0ea80) at ./storage/perfschema/pfs.cc:2201
#8  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
#9  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
 
Thread 1 (Thread 0xaf2fc330 (LWP 8498)):
#0  0xb6aa07a8 in pthread_kill () from /lib/arm-linux-gnueabi/libpthread.so.0
#1  0x00b3ca2c in handle_fatal_signal (sig=<optimized out>) at ./sql/signal_handler.cc:345
#2  <signal handler called>
#3  0xb66dcfd0 in raise () from /lib/arm-linux-gnueabi/libc.so.6
#4  0xb66c5a88 in abort () from /lib/arm-linux-gnueabi/libc.so.6
#5  0x00000000 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Full log at https://buildd.debian.org/status/fetch.php?pkg=mariadb-10.6&arch=armel&ver=1%3A10.6.7-2&stamp=1646206973&raw=0

Issue is sporadic - restart fixed it.



 Comments   
Comment by Marko Mäkelä [ 2022-03-03 ]

In the full log, I see also the following:

mariadb-10.6.7

2022-03-02  7:38:09 188 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

Based on the stack traces, this hang seems to be directly related to the dict_sys.latch. It is not always the case. Yesterday, I analyzed a hang where the dict_sys.latch holder was waiting for page allocation, which in turn was waiting for dirty pages to be written out and evicted from the buffer pool, when using the smallest possible buffer pool and when tens or hundreds of server processes were run concurrently.

Among the stack traces, we can see the following:

mariadb-10.6.7

Thread 3 (Thread 0xa99fc330 (LWP 8582)):
#0  ut_hash_ulint (table_size=<optimized out>, key=<optimized out>) at ./storage/innobase/include/ut0rnd.inl:47
#1  hash_table_t::calc_hash (fold=<optimized out>, this=<optimized out>) at ./storage/innobase/include/hash0hash.h:231
#2  dict_sys_t::find_table (id=<optimized out>, this=0x16a69c0 <dict_sys>) at ./storage/innobase/include/dict0dict.h:1470
#3  dict_acquire_mdl_shared<false> (table=<optimized out>, table@entry=0x2e22750, thd=thd@entry=0x2f616f0, mdl=mdl@entry=0x2f2ff38, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:751
#4  0x00f867a0 in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f616f0, mdl=mdl@entry=0x2f2ff38) at ./storage/innobase/dict/dict0dict.cc:842
#5  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f2fdd8, updated_extern=0xa99fbbc2, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1028

This thread is holding a shared dict_sys.latch, because in dict_acquire_mdl_shared() we can see the following right before the call:

  dict_sys.freeze(SRW_LOCK_CALL);
  table= dict_sys.find_table(table_id);

Because there exist threads that are waiting for an exclusive latch, further shared latch requests will be blocked until all exclusive latch requests have been served.

It does not look like the server is actually hung; it is apparently only slow.

I am not a friend of this watchdog mechanism. In fact, initially in MDEV-21452 I removed the infamous Long semaphore wait logic and subsequently, due to some demand from colleagues, implemented a replacement watchdog that only covers dict_sys.mutex (which in MDEV-24258 was merged into dict_sys.latch).

I can imagine that there exist underpowered implementations of the ARMv6 or ARMv7 ISA where the watchdog could easily trigger false alarms. Can you try doubling the timeout from the default 10 minutes with something like the following?

./mtr --mysqld=--loose-innodb-fatal-semaphore-wait-threshold=1200 …

Or try to specify a smaller number of purge tasks instead of the default 4?

./mtr --mysqld=--loose-innodb-purge-threads=1 …

The 10.6 release has only been out there in the wild for a relatively short time. During this time, the watchdog has caught some InnoDB internal hangs (such as MDEV-27414) and a Linux kernel bug (MDEV-26674). I would say that crashes due to false alarms like this have been in the minority.

I think that if we don’t receive many genuine hang reports due to InnoDB bugs in the coming few years, this watchdog mechanism could be removed, or perhaps better replaced with something else, such as sd_watchdog_enabled(3) and related systemd functions.

Comment by Otto Kekäläinen [ 2022-03-06 ]

Note that this issue is sporadic. To verify the proposed fix I would need to build armel many times on Debian builders which currently is inconvenient for me. Based on your assessment the slowness is the cause to sporadic failure, which makes sense and explains why we never see this on Launchpad builders.

Comment by Marko Mäkelä [ 2022-03-09 ]

otto, that is correct. I would suggest that you lower the number of purge threads for this builder, and see if the issue remains. If you did that, we would get a confirmation after some reasonably large number of test runs, in some months or years.

Can you try to check from the log if the test really hung for at least 10 minutes? Unfortunately the fail line never includes the execution time, like the pass lines do. Theoretically, the current watchdog could deliver some false alarms, and I do not think that we can fix that easily, because I want to keep the underlying data structure and logic as simple as possible.

Even if MariaDB implemented an interface to the Systemd watchdog, we might have to retain the current code as a fallback when that watchdog is not enabled at runtime. I do not think that mtr would ever start the server with the Systemd watchdog enabled. So, it would not help here.

Comment by Marko Mäkelä [ 2022-03-11 ]

mleich provided a core dump of a 10.9 shutdown hang that could explain this.

In that hang, we have the following:

  • One thread is waiting for an exclusive dict_sys.latch. That will block all future shared latch acquisitions in other threads (dict_sys_t::freeze()).
  • Thread 3 (a purge task) is holding a shared dict_sys.latch (and thus blocking all the above mentioned threads) and waiting to acquire a write lock on a write-fixed page (which would have to be written out first).
  • The buf_flush_page_cleaner() thread is holding buf_pool.mutex and buf_pool.flush_list_mutex and waiting for a slot. This may block the write completion that Thread 3 is waiting for, as well as block any other threads that might hold a shared dict_sys.latch while invoking buf_page_get_gen(). Stack trace:

    #3  0x00007f02080d1e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
    #4  0x0000558424afec25 in tpool::cache<tpool::aiocb>::get (this=0x6120000238c0, blocking=true) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/tpool/tpool_structs.h:85
    #5  0x0000558424afdb97 in io_slots::acquire (this=0x6120000238c0) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/os/os0file.cc:96
    #6  0x0000558424afa5fe in os_aio (type=@0x7f01eb30a750: {bpage = 0x0, slot = 0x0, node = 0x606000000e00, type = IORequest::WRITE_ASYNC}, buf=0x631000015000, offset=573440, n=8192) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/os/os0file.cc:3855
    #7  0x0000558424f9272e in fil_space_t::io (this=0x612000023d40, type=@0x7f01eb30a910: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=573440, len=8192, buf=0x631000015000, bpage=0x0) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/fil/fil0fil.cc:2813
    #8  0x0000558424ea69c4 in buf_flush_freed_pages (space=0x612000023d40) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1073
    #9  0x0000558424ea7d21 in buf_flush_space (id=1) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1206
    #10 0x0000558424ea9c0a in buf_do_flush_list_batch (max_n=2000, lsn=23743922) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1433
    #11 0x0000558424eaa73d in buf_flush_list (max_n=2000, lsn=23743922) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1530
    #12 0x0000558424eaf41f in buf_flush_page_cleaner () at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:2448
    

This looks like a genuine deadlock of threads to me.

Generally, no mutex should be held across any I/O calls. An exception might be the log_sys.mutex for some redo log operations, but it definitely is blatantly wrong to hold buf_pool.mutex or buf_pool.flush_list_mutex while submitting some I/O.

Apparently reproducing this requires a very overloaded system so that we can actually run out of the I/O slots.

The test main.parser_bug21114_innodb is creating and dropping very many tables, so the MDEV-15528 logic (buf_flush_freed_pages()) should definitely be able to kick in there.

Comment by Marko Mäkelä [ 2022-03-14 ]

thiru, please review. I also tried protecting buf_flush_freed_pages() with fil_space_t::latch so that we could release space->freed_range_mutex before the end of the function, but it caused hangs of the test encryption.innochecksum.

Generated at Thu Feb 08 09:57:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.