Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27985

buf_flush_freed_pages() causes InnoDB to hang

Details

    Description

      After uploading 1:10.6.7-2 to Debian I saw this test failure on the armel builder:

      ain.parser_bug21114_innodb 'innodb'     w4 [ fail ]
              Test ended at 2022-03-02 07:38:10
       
      CURRENT_TEST: main.parser_bug21114_innodb
      mysqltest: In included file "./include/parser_bug21114.inc": 
      included from /<<PKGBUILDDIR>>/mysql-test/main/parser_bug21114_innodb.test at line 244:
      At line 36: query 'CREATE TABLE bug21114_child(
      pk int not null,
      fk_col1 int not null,
      fk_col2 int not null,
      fk_col3 int not null,
      fk_col4 int not null,
      CONSTRAINT fk_fct FOREIGN KEY (fk_col1, fk_col2)
      REFERENCES $FCT(col1, col2),
      CONSTRAINT fk_fct_space FOREIGN KEY (fk_col3, fk_col4)
      REFERENCES $FCT (col1, col2)
      ) ENGINE $engine_type' failed: <Unknown> (2013): Lost connection to server during query
       
      The result from queries just before the failure was:
      < snip >
      DROP TABLE bug21114_child;
      DROP TABLE ltrim;
      drop table if exists makedate;
      drop table if exists bug21114_child;
      DROP TABLE bug21114_child;
      DROP TABLE makedate;
      drop table if exists maketime;
      drop table if exists bug21114_child;
      DROP TABLE bug21114_child;
      DROP TABLE maketime;
      drop table if exists make_set;
      drop table if exists bug21114_child;
      DROP TABLE bug21114_child;
      DROP TABLE make_set;
      drop table if exists master_pos_wait;
      drop table if exists bug21114_child;
      DROP TABLE bug21114_child;
      DROP TABLE master_pos_wait;
      drop table if exists mbrcontains;
      drop table if exists bug21114_child;
       
      More results from queries before failure can be found in /<<PKGBUILDDIR>>/builddir/mysql-test/var/4/log/parser_bug21114_innodb.log
       
       
      Server [mysqld.1 - pid: 8467, winpid: 8467, exit: 256] failed during test run
      Server log from this test:
      ----------SERVER LOG START-----------
      2022-03-02  7:38:09 188 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
      220302  7:38:09 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.6.7-MariaDB-2-log
      key_buffer_size=1048576
      read_buffer_size=131072
      max_used_connections=3
      max_threads=153
      thread_count=3
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 62381 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0xafb01b20
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0xaf2fbc5c thread_stack 0x49000
       
      Trying to get some variables.
      Some pointers may be invalid and cause the dump to abort.
      Query (0xafb11de8): CREATE TABLE bug21114_child(
      pk int not null,
      fk_col1 int not null,
      fk_col2 int not null,
      fk_col3 int not null,
      fk_col4 int not null,
      CONSTRAINT fk_fct FOREIGN KEY (fk_col1, fk_col2)
      REFERENCES mbrcontains(col1, col2),
      CONSTRAINT fk_fct_space FOREIGN KEY (fk_col3, fk_col4)
      REFERENCES mbrcontains (col1, col2)
      ) ENGINE InnoDb
       
      Connection ID (thread ID): 188
      Status: NOT_KILLED
       
      Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off
       
      The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      Working directory at /<<PKGBUILDDIR>>/builddir/mysql-test/var/4/mysqld.1/data
      Resource Limits:
      Limit                     Soft Limit           Hard Limit           Units     
      Max cpu time              unlimited            unlimited            seconds   
      Max file size             unlimited            unlimited            bytes     
      Max data size             unlimited            unlimited            bytes     
      Max stack size            8388608              unlimited            bytes     
      Max core file size        unlimited            unlimited            bytes     
      Max resident set          unlimited            unlimited            bytes     
      Max processes             96403                96403                processes 
      Max open files            1024                 1024                 files     
      Max locked memory         1584997376           1584997376           bytes     
      Max address space         unlimited            unlimited            bytes     
      Max file locks            unlimited            unlimited            locks     
      Max pending signals       96403                96403                signals   
      Max msgqueue size         819200               819200               bytes     
      Max nice priority         0                    0                    
      Max realtime priority     0                    0                    
      Max realtime timeout      unlimited            unlimited            us        
      Core pattern: core
       
      ----------SERVER LOG END-------------
       
       
       - found 'core' (0/5)
       
      Trying 'dbx' to get a backtrace
       
      Trying 'gdb' to get a backtrace from coredump /<<PKGBUILDDIR>>/builddir/mysql-test/var/4/log/main.parser_bug21114_innodb-innodb/mysqld.1/data/core
      Core generated by '/<<PKGBUILDDIR>>/builddir/sql/mariadbd'
      Output from gdb follows. The first stack trace is from the failing thread.
      The following stack traces are from all threads (so the failing one is
      duplicated).
      --------------------------
       
      warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing
       
      warning: Can't open file anon_inode:[io_uring] which was expanded to anon_inode:[io_uring] during file-backed mapping note processing
      [New LWP 8498]
      [New LWP 8475]
      [New LWP 8582]
      [New LWP 8476]
      [New LWP 8485]
      [New LWP 8487]
      [New LWP 8479]
      [New LWP 8489]
      [New LWP 8486]
      [New LWP 8581]
      [New LWP 8480]
      [New LWP 8900]
      [New LWP 8468]
      [New LWP 8482]
      [New LWP 8929]
      [New LWP 8495]
      [New LWP 8483]
      [New LWP 8580]
      [New LWP 8579]
      [Thread debugging using libthread_db enabled]
      Using host libthread_db library "/lib/arm-linux-gnueabi/libthread_db.so.1".
      Core was generated by `/<<PKGBUILDDIR>>/builddir/sql/mariadbd --defaults'.
      Program terminated with signal SIGABRT, Aborted.
      #0  0xb6aa07a8 in pthread_kill () from /lib/arm-linux-gnueabi/libpthread.so.0
      [Current thread is 1 (Thread 0xaf2fc330 (LWP 8498))]
      #0  0xb6aa07a8 in pthread_kill () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0x00b3ca2c in handle_fatal_signal (sig=<optimized out>) at ./sql/signal_handler.cc:345
      #2  <signal handler called>
      #3  0xb66dcfd0 in raise () from /lib/arm-linux-gnueabi/libc.so.6
      #4  0xb66c5a88 in abort () from /lib/arm-linux-gnueabi/libc.so.6
      #5  0x00000000 in ?? ()
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 19 (Thread 0xab1ff330 (LWP 8579)):
      #0  0xb678b130 in syscall () from /lib/arm-linux-gnueabi/libc.so.6
      #1  0x00f0a590 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:347
      #2  srw_mutex_impl<false>::wait_and_lock (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:461
      #3  0x00f0ad78 in srw_mutex_impl<false>::wr_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:80
      #4  ssux_lock_impl<false>::rd_wait (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:523
      #5  0x00f0af5c in ssux_lock_impl<false>::rd_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:216
      #6  srw_lock_impl<false>::psi_rd_lock (this=0x16a6a00 <dict_sys+64>, file=<optimized out>, line=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:581
      #7  0x00f80e98 in srw_lock_impl<false>::rd_lock (line=834, file=<optimized out>, this=<optimized out>) at ./storage/innobase/include/srw_lock.h:483
      #8  0x00f8686c in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f7e178, mdl=mdl@entry=0x2f3c768) at ./storage/innobase/dict/dict0dict.cc:834
      #9  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f3c608, updated_extern=0xab1febc2, undo_rec=0xadaf9850 "", node=0x2f3c660) at ./storage/innobase/row/row0purge.cc:1028
      #10 row_purge (thr=0x2f3c608, undo_rec=0xadaf9850 "", node=0x2f3c660) at ./storage/innobase/row/row0purge.cc:1206
      #11 row_purge_step (thr=thr@entry=0x2f3c608) at ./storage/innobase/row/row0purge.cc:1258
      #12 0x00eb137c in que_thr_step (thr=0x2f3c608) at ./storage/innobase/que/que0que.cc:653
      #13 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:709
      #14 que_run_threads (thr=thr@entry=0x2f3c608) at ./storage/innobase/que/que0que.cc:729
      #15 0x00f079d4 in srv_task_execute () at ./storage/innobase/srv/srv0srv.cc:1714
      #16 purge_worker_callback () at ./storage/innobase/srv/srv0srv.cc:1957
      #17 0x00fffccc in tpool::task_group::execute (this=0x1f5cab8 <purge_task_group>, t=<optimized out>) at ./tpool/task_group.cc:55
      #18 0x00ffeafc in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91228) at ./tpool/tpool_generic.cc:549
      #19 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #20 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #21 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 18 (Thread 0xaa9fe330 (LWP 8580)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e911a8) at /usr/include/c++/11/bits/std_mutex.h:169
      #5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e911a8) at /usr/include/c++/11/condition_variable:201
      #6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e911a8) at /usr/include/c++/11/condition_variable:111
      #7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e911a8) at /usr/include/c++/11/condition_variable:163
      #8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e911a8) at ./tpool/tpool_generic.cc:449
      #9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e911a8, t=t@entry=0xaa9fdc94) at ./tpool/tpool_generic.cc:502
      #10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e911a8) at ./tpool/tpool_generic.cc:547
      #11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 17 (Thread 0xae4f4330 (LWP 8483)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e91428) at /usr/include/c++/11/bits/std_mutex.h:169
      #5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e91428) at /usr/include/c++/11/condition_variable:201
      #6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e91428) at /usr/include/c++/11/condition_variable:111
      #7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e91428) at /usr/include/c++/11/condition_variable:163
      #8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e91428) at ./tpool/tpool_generic.cc:449
      #9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e91428, t=t@entry=0xae4f3c94) at ./tpool/tpool_generic.cc:502
      #10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91428) at ./tpool/tpool_generic.cc:547
      #11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 16 (Thread 0xafc5c330 (LWP 8495)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9eadc in pthread_cond_wait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0x0081673c in psi_cond_wait (that=0x170bcd8 <COND_manager>, mutex=0x170bd10 <LOCK_manager>, file=<optimized out>, line=<optimized out>) at ./mysys/my_thr_init.c:596
      #4  0x009115f4 in inline_mysql_cond_wait (that=<optimized out>, mutex=<optimized out>, src_file=<optimized out>, src_line=<optimized out>) at ./include/mysql/psi/mysql_thread.h:1070
      #5  handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:103
      #6  0x00d734c4 in pfs_spawn_thread (arg=0x2fb2748) at ./storage/perfschema/pfs.cc:2201
      #7  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #8  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 15 (Thread 0xadca8330 (LWP 8929)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x008167d4 in psi_cond_timedwait (that=0x170a660 <thread_cache>, mutex=0x170a6d0 <thread_cache+112>, abstime=0xadca7c20, file=<optimized out>, line=176) at ./mysys/my_thr_init.c:609
      #5  0x00a1dd94 in inline_mysql_cond_timedwait (src_file=0x10fda00 "./sql/thread_cache.h", src_line=176, abstime=0xadca7c20, mutex=<optimized out>, that=0x170a660 <thread_cache>) at ./include/mysql/psi/mysql_thread.h:1086
      #6  Thread_cache::park (this=0x170a660 <thread_cache>) at ./sql/thread_cache.h:176
      #7  0x00a1d7a4 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x2fcd128, put_in_cache=put_in_cache@entry=true) at ./sql/sql_connect.cc:1431
      #8  0x00a1db40 in handle_one_connection (arg=arg@entry=0x2fcd128) at ./sql/sql_connect.cc:1312
      #9  0x00d734c4 in pfs_spawn_thread (arg=0x3061048) at ./storage/perfschema/pfs.cc:2201
      #10 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #11 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 14 (Thread 0xaecf5330 (LWP 8482)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e914a8) at /usr/include/c++/11/bits/std_mutex.h:169
      #5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e914a8) at /usr/include/c++/11/condition_variable:201
      #6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e914a8) at /usr/include/c++/11/condition_variable:111
      #7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e914a8) at /usr/include/c++/11/condition_variable:163
      #8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e914a8) at ./tpool/tpool_generic.cc:449
      #9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e914a8, t=t@entry=0xaecf4c94) at ./tpool/tpool_generic.cc:502
      #10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e914a8) at ./tpool/tpool_generic.cc:547
      #11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 13 (Thread 0xb6ff1880 (LWP 8468)):
      #0  0xb6782fe0 in poll () from /lib/arm-linux-gnueabi/libc.so.6
      #1  0x0085af4c in poll (__timeout=-1, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/arm-linux-gnueabi/bits/poll2.h:47
      #2  handle_connections_sockets () at ./sql/mysqld.cc:6118
      #3  0x0085c258 in mysqld_main (argc=<optimized out>, argv=<optimized out>) at ./sql/mysqld.cc:5823
      #4  0xb66c5e24 in __libc_start_main () from /lib/arm-linux-gnueabi/libc.so.6
      #5  0x0084f1a8 in _start () at ./sql/my_decimal.h:147
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 12 (Thread 0xadcf3330 (LWP 8900)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x008167d4 in psi_cond_timedwait (that=0x170a660 <thread_cache>, mutex=0x170a6d0 <thread_cache+112>, abstime=0xadcf2c20, file=<optimized out>, line=176) at ./mysys/my_thr_init.c:609
      #5  0x00a1dd94 in inline_mysql_cond_timedwait (src_file=0x10fda00 "./sql/thread_cache.h", src_line=176, abstime=0xadcf2c20, mutex=<optimized out>, that=0x170a660 <thread_cache>) at ./include/mysql/psi/mysql_thread.h:1086
      #6  Thread_cache::park (this=0x170a660 <thread_cache>) at ./sql/thread_cache.h:176
      #7  0x00a1d7a4 in do_handle_one_connection (connect=<optimized out>, connect@entry=0x3060c60, put_in_cache=put_in_cache@entry=true) at ./sql/sql_connect.cc:1431
      #8  0x00a1db40 in handle_one_connection (arg=arg@entry=0x3060c60) at ./sql/sql_connect.cc:1312
      #9  0x00d734c4 in pfs_spawn_thread (arg=0x3061048) at ./storage/perfschema/pfs.cc:2201
      #10 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #11 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 11 (Thread 0xafafd330 (LWP 8480)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00f70e50 in buf_flush_page_cleaner () at ./storage/innobase/buf/buf0flu.cc:2190
      #5  0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #6  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #7  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 10 (Thread 0xaa1fd330 (LWP 8581)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e91128) at /usr/include/c++/11/bits/std_mutex.h:169
      #5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e91128) at /usr/include/c++/11/condition_variable:201
      #6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e91128) at /usr/include/c++/11/condition_variable:111
      #7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e91128) at /usr/include/c++/11/condition_variable:163
      #8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e91128) at ./tpool/tpool_generic.cc:449
      #9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e91128, t=t@entry=0xaa1fcc94) at ./tpool/tpool_generic.cc:502
      #10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91128) at ./tpool/tpool_generic.cc:547
      #11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 9 (Thread 0xad0fe330 (LWP 8486)):
      #0  0xb678b130 in syscall () from /lib/arm-linux-gnueabi/libc.so.6
      #1  0x00f0a590 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:347
      #2  srw_mutex_impl<false>::wait_and_lock (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:461
      #3  0x00f0ad78 in srw_mutex_impl<false>::wr_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:80
      #4  ssux_lock_impl<false>::rd_wait (this=this@entry=0x16a6a04 <dict_sys+68>) at ./storage/innobase/sync/srw_lock.cc:523
      #5  0x00f0af5c in ssux_lock_impl<false>::rd_lock (this=0x16a6a04 <dict_sys+68>) at ./storage/innobase/include/srw_lock.h:216
      #6  srw_lock_impl<false>::psi_rd_lock (this=0x16a6a00 <dict_sys+64>, file=<optimized out>, line=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:581
      #7  0x00f80e98 in srw_lock_impl<false>::rd_lock (line=834, file=<optimized out>, this=<optimized out>) at ./storage/innobase/include/srw_lock.h:483
      #8  0x00f8686c in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f8c1f8, mdl=mdl@entry=0x2f3d280) at ./storage/innobase/dict/dict0dict.cc:834
      #9  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f3c948, updated_extern=0xad0fdaca, undo_rec=0x2f476a0 "", node=0x2f3d178) at ./storage/innobase/row/row0purge.cc:1028
      #10 row_purge (thr=0x2f3c948, undo_rec=0x2f476a0 "", node=0x2f3d178) at ./storage/innobase/row/row0purge.cc:1206
      #11 row_purge_step (thr=thr@entry=0x2f3c948) at ./storage/innobase/row/row0purge.cc:1258
      #12 0x00eb137c in que_thr_step (thr=0x2f3c948) at ./storage/innobase/que/que0que.cc:653
      #13 que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:709
      #14 que_run_threads (thr=0x2f3c948) at ./storage/innobase/que/que0que.cc:729
      #15 0x00f11bf4 in trx_purge (n_tasks=n_tasks@entry=4, truncate=<optimized out>) at ./storage/innobase/trx/trx0purge.cc:1312
      #16 0x00f07cf8 in purge_coordinator_state::do_purge (this=0x1f5c8d8 <purge_state>) at ./storage/innobase/srv/srv0srv.cc:1812
      #17 0x00f078d0 in purge_coordinator_callback () at ./storage/innobase/srv/srv0srv.cc:1966
      #18 0x00fffccc in tpool::task_group::execute (this=0x1f5cb20 <purge_coordinator_task_group>, t=<optimized out>) at ./tpool/task_group.cc:55
      #19 0x00ffeafc in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e91328) at ./tpool/tpool_generic.cc:549
      #20 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #21 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #22 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 8 (Thread 0xafca7330 (LWP 8489)):
      #0  0xb66de008 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      #1  0xb66de168 in sigtimedwait () from /lib/arm-linux-gnueabi/libc.so.6
      #2  0x008519fc in my_sigwait (code=<synthetic pointer>, sig=0xafca6b54, set=0xafca6b58) at ./include/my_pthread.h:195
      #3  signal_hand (arg=arg@entry=0x0) at ./sql/mysqld.cc:3116
      #4  0x00d734c4 in pfs_spawn_thread (arg=0x2fb62f0) at ./storage/perfschema/pfs.cc:2201
      #5  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #6  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 7 (Thread 0xb0efe330 (LWP 8479)):
      #0  0xb678b130 in syscall () from /lib/arm-linux-gnueabi/libc.so.6
      #1  0xb6ecadf0 in ?? () from /usr/lib/arm-linux-gnueabi/liburing.so.2
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 6 (Thread 0xac8fd330 (LWP 8487)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e912a8) at /usr/include/c++/11/bits/std_mutex.h:169
      #5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e912a8) at /usr/include/c++/11/condition_variable:201
      #6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e912a8) at /usr/include/c++/11/condition_variable:111
      #7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e912a8) at /usr/include/c++/11/condition_variable:163
      #8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e912a8) at ./tpool/tpool_generic.cc:449
      #9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e912a8, t=t@entry=0xac8fcc94) at ./tpool/tpool_generic.cc:502
      #10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e912a8) at ./tpool/tpool_generic.cc:547
      #11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 5 (Thread 0xad8ff330 (LWP 8485)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9e6d8 in __GI___pthread_cond_clockwait64.part.0 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f228 in pthread_cond_clockwait () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x00ffcfa0 in std::__condvar::wait_until (__abs_time=..., __clock=1, __m=..., this=0x2e913a8) at /usr/include/c++/11/bits/std_mutex.h:169
      #5  std::condition_variable::__wait_until_impl<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__lock=..., __lock=..., __atime=..., this=0x2e913a8) at /usr/include/c++/11/condition_variable:201
      #6  std::condition_variable::wait_until<std::chrono::duration<long long, std::ratio<1ll, 1000000000ll> > > (__atime=..., __lock=..., this=0x2e913a8) at /usr/include/c++/11/condition_variable:111
      #7  std::condition_variable::wait_for<long long, std::ratio<1ll, 1000ll> > (__rtime=..., __lock=..., this=0x2e913a8) at /usr/include/c++/11/condition_variable:163
      #8  tpool::thread_pool_generic::wait_for_tasks (this=this@entry=0x2e818d0, lk=..., thread_data=thread_data@entry=0x2e913a8) at ./tpool/tpool_generic.cc:449
      #9  0x00ffd610 in tpool::thread_pool_generic::get_task (this=this@entry=0x2e818d0, thread_var=thread_var@entry=0x2e913a8, t=t@entry=0xad8fec94) at ./tpool/tpool_generic.cc:502
      #10 0x00ffeb0c in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e913a8) at ./tpool/tpool_generic.cc:547
      #11 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #12 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #13 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 4 (Thread 0xb1cad330 (LWP 8476)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x008167d4 in psi_cond_timedwait (that=0x1f2a4b8 <COND_checkpoint>, mutex=0x1f2a490 <LOCK_checkpoint>, abstime=abstime@entry=0xb1cacc08, file=<optimized out>, line=line@entry=115) at ./mysys/my_thr_init.c:609
      #5  0x00d0fc2c in inline_mysql_cond_timedwait (src_file=0x125ed5c "./storage/maria/ma_servicethread.c", src_line=115, abstime=0xb1cacc08, mutex=<optimized out>, that=<optimized out>) at ./include/mysql/psi/mysql_thread.h:1086
      #6  my_service_thread_sleep (control=control@entry=0x1692c04 <checkpoint_control>, sleep_time=<optimized out>) at ./storage/maria/ma_servicethread.c:115
      #7  0x00d07424 in ma_checkpoint_background (arg=arg@entry=0x1e) at ./storage/maria/ma_checkpoint.c:725
      #8  0x00d734c4 in pfs_spawn_thread (arg=0x2e735c8) at ./storage/perfschema/pfs.cc:2201
      #9  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #10 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 3 (Thread 0xa99fc330 (LWP 8582)):
      #0  ut_hash_ulint (table_size=<optimized out>, key=<optimized out>) at ./storage/innobase/include/ut0rnd.inl:47
      #1  hash_table_t::calc_hash (fold=<optimized out>, this=<optimized out>) at ./storage/innobase/include/hash0hash.h:231
      #2  dict_sys_t::find_table (id=<optimized out>, this=0x16a69c0 <dict_sys>) at ./storage/innobase/include/dict0dict.h:1470
      #3  dict_acquire_mdl_shared<false> (table=<optimized out>, table@entry=0x2e22750, thd=thd@entry=0x2f616f0, mdl=mdl@entry=0x2f2ff38, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:751
      #4  0x00f867a0 in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f616f0, mdl=mdl@entry=0x2f2ff38) at ./storage/innobase/dict/dict0dict.cc:842
      #5  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f2fdd8, updated_extern=0xa99fbbc2, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1028
      #6  row_purge (thr=0x2f2fdd8, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1206
      #7  row_purge_step (thr=thr@entry=0x2f2fdd8) at ./storage/innobase/row/row0purge.cc:1258
      #8  0x00eb137c in que_thr_step (thr=0x2f2fdd8) at ./storage/innobase/que/que0que.cc:653
      #9  que_run_threads_low (thr=<optimized out>) at ./storage/innobase/que/que0que.cc:709
      #10 que_run_threads (thr=thr@entry=0x2f2fdd8) at ./storage/innobase/que/que0que.cc:729
      #11 0x00f079d4 in srv_task_execute () at ./storage/innobase/srv/srv0srv.cc:1714
      #12 purge_worker_callback () at ./storage/innobase/srv/srv0srv.cc:1957
      #13 0x00fffccc in tpool::task_group::execute (this=0x1f5cab8 <purge_task_group>, t=<optimized out>) at ./tpool/task_group.cc:55
      #14 0x00ffeafc in tpool::thread_pool_generic::worker_main (this=0x2e818d0, thread_var=0x2e910a8) at ./tpool/tpool_generic.cc:549
      #15 0xb6961798 in ?? () from /usr/lib/arm-linux-gnueabi/libstdc++.so.6
      #16 0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #17 0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 2 (Thread 0xb28f8330 (LWP 8475)):
      #0  0xb6aa68a8 in __futex_abstimed_wait_common64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0xb6aa6960 in __futex_abstimed_wait_cancelable64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #2  0xb6a9ef20 in __pthread_cond_timedwait64 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #3  0xb6a9f16c in pthread_cond_timedwait@@GLIBC_2.4 () from /lib/arm-linux-gnueabi/libpthread.so.0
      #4  0x008167d4 in psi_cond_timedwait (that=that@entry=0x1fa1610 <COND_timer>, mutex=mutex@entry=0x1fa15c8 <LOCK_timer>, abstime=0xb28f7c70, file=<optimized out>, line=line@entry=321) at ./mysys/my_thr_init.c:609
      #5  0x01065870 in inline_mysql_cond_timedwait (that=0x1fa1610 <COND_timer>, mutex=0x1fa15c8 <LOCK_timer>, src_file=0x12d5a9c "./mysys/thr_timer.c", src_line=321, abstime=0xb28f7c70) at ./include/mysql/psi/mysql_thread.h:1086
      #6  timer_handler (arg=arg@entry=0x0) at ./mysys/thr_timer.c:321
      #7  0x00d734c4 in pfs_spawn_thread (arg=0x2e0ea80) at ./storage/perfschema/pfs.cc:2201
      #8  0xb6a95e58 in start_thread () from /lib/arm-linux-gnueabi/libpthread.so.0
      #9  0xb678ef98 in ?? () from /lib/arm-linux-gnueabi/libc.so.6
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
       
      Thread 1 (Thread 0xaf2fc330 (LWP 8498)):
      #0  0xb6aa07a8 in pthread_kill () from /lib/arm-linux-gnueabi/libpthread.so.0
      #1  0x00b3ca2c in handle_fatal_signal (sig=<optimized out>) at ./sql/signal_handler.cc:345
      #2  <signal handler called>
      #3  0xb66dcfd0 in raise () from /lib/arm-linux-gnueabi/libc.so.6
      #4  0xb66c5a88 in abort () from /lib/arm-linux-gnueabi/libc.so.6
      #5  0x00000000 in ?? ()
      Backtrace stopped: previous frame identical to this frame (corrupt stack?)
      

      Full log at https://buildd.debian.org/status/fetch.php?pkg=mariadb-10.6&arch=armel&ver=1%3A10.6.7-2&stamp=1646206973&raw=0

      Issue is sporadic - restart fixed it.

      Attachments

        Issue Links

          Activity

            otto Otto Kekäläinen created issue -

            In the full log, I see also the following:

            mariadb-10.6.7

            2022-03-02  7:38:09 188 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
            

            Based on the stack traces, this hang seems to be directly related to the dict_sys.latch. It is not always the case. Yesterday, I analyzed a hang where the dict_sys.latch holder was waiting for page allocation, which in turn was waiting for dirty pages to be written out and evicted from the buffer pool, when using the smallest possible buffer pool and when tens or hundreds of server processes were run concurrently.

            Among the stack traces, we can see the following:

            mariadb-10.6.7

            Thread 3 (Thread 0xa99fc330 (LWP 8582)):
            #0  ut_hash_ulint (table_size=<optimized out>, key=<optimized out>) at ./storage/innobase/include/ut0rnd.inl:47
            #1  hash_table_t::calc_hash (fold=<optimized out>, this=<optimized out>) at ./storage/innobase/include/hash0hash.h:231
            #2  dict_sys_t::find_table (id=<optimized out>, this=0x16a69c0 <dict_sys>) at ./storage/innobase/include/dict0dict.h:1470
            #3  dict_acquire_mdl_shared<false> (table=<optimized out>, table@entry=0x2e22750, thd=thd@entry=0x2f616f0, mdl=mdl@entry=0x2f2ff38, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:751
            #4  0x00f867a0 in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f616f0, mdl=mdl@entry=0x2f2ff38) at ./storage/innobase/dict/dict0dict.cc:842
            #5  0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f2fdd8, updated_extern=0xa99fbbc2, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1028
            

            This thread is holding a shared dict_sys.latch, because in dict_acquire_mdl_shared() we can see the following right before the call:

              dict_sys.freeze(SRW_LOCK_CALL);
              table= dict_sys.find_table(table_id);
            

            Because there exist threads that are waiting for an exclusive latch, further shared latch requests will be blocked until all exclusive latch requests have been served.

            It does not look like the server is actually hung; it is apparently only slow.

            I am not a friend of this watchdog mechanism. In fact, initially in MDEV-21452 I removed the infamous Long semaphore wait logic and subsequently, due to some demand from colleagues, implemented a replacement watchdog that only covers dict_sys.mutex (which in MDEV-24258 was merged into dict_sys.latch).

            I can imagine that there exist underpowered implementations of the ARMv6 or ARMv7 ISA where the watchdog could easily trigger false alarms. Can you try doubling the timeout from the default 10 minutes with something like the following?

            ./mtr --mysqld=--loose-innodb-fatal-semaphore-wait-threshold=1200 …
            

            Or try to specify a smaller number of purge tasks instead of the default 4?

            ./mtr --mysqld=--loose-innodb-purge-threads=1 …
            

            The 10.6 release has only been out there in the wild for a relatively short time. During this time, the watchdog has caught some InnoDB internal hangs (such as MDEV-27414) and a Linux kernel bug (MDEV-26674). I would say that crashes due to false alarms like this have been in the minority.

            I think that if we don’t receive many genuine hang reports due to InnoDB bugs in the coming few years, this watchdog mechanism could be removed, or perhaps better replaced with something else, such as sd_watchdog_enabled(3) and related systemd functions.

            marko Marko Mäkelä added a comment - In the full log, I see also the following: mariadb-10.6.7 2022-03-02 7:38:09 188 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ Based on the stack traces, this hang seems to be directly related to the dict_sys.latch . It is not always the case. Yesterday, I analyzed a hang where the dict_sys.latch holder was waiting for page allocation, which in turn was waiting for dirty pages to be written out and evicted from the buffer pool, when using the smallest possible buffer pool and when tens or hundreds of server processes were run concurrently. Among the stack traces, we can see the following: mariadb-10.6.7 Thread 3 (Thread 0xa99fc330 (LWP 8582)): #0 ut_hash_ulint (table_size=<optimized out>, key=<optimized out>) at ./storage/innobase/include/ut0rnd.inl:47 #1 hash_table_t::calc_hash (fold=<optimized out>, this=<optimized out>) at ./storage/innobase/include/hash0hash.h:231 #2 dict_sys_t::find_table (id=<optimized out>, this=0x16a69c0 <dict_sys>) at ./storage/innobase/include/dict0dict.h:1470 #3 dict_acquire_mdl_shared<false> (table=<optimized out>, table@entry=0x2e22750, thd=thd@entry=0x2f616f0, mdl=mdl@entry=0x2f2ff38, table_op=table_op@entry=DICT_TABLE_OP_NORMAL) at ./storage/innobase/dict/dict0dict.cc:751 #4 0x00f867a0 in dict_table_open_on_id (table_id=<optimized out>, dict_locked=dict_locked@entry=false, table_op=table_op@entry=DICT_TABLE_OP_NORMAL, thd=0x2f616f0, mdl=mdl@entry=0x2f2ff38) at ./storage/innobase/dict/dict0dict.cc:842 #5 0x00ee94ec in row_purge_parse_undo_rec (thr=0x2f2fdd8, updated_extern=0xa99fbbc2, undo_rec=0xa8ff4100 "", node=0x2f2fe30) at ./storage/innobase/row/row0purge.cc:1028 This thread is holding a shared dict_sys.latch , because in dict_acquire_mdl_shared() we can see the following right before the call: dict_sys.freeze(SRW_LOCK_CALL); table= dict_sys.find_table(table_id); Because there exist threads that are waiting for an exclusive latch, further shared latch requests will be blocked until all exclusive latch requests have been served. It does not look like the server is actually hung; it is apparently only slow. I am not a friend of this watchdog mechanism. In fact, initially in MDEV-21452 I removed the infamous Long semaphore wait logic and subsequently, due to some demand from colleagues, implemented a replacement watchdog that only covers dict_sys.mutex (which in MDEV-24258 was merged into dict_sys.latch ). I can imagine that there exist underpowered implementations of the ARMv6 or ARMv7 ISA where the watchdog could easily trigger false alarms. Can you try doubling the timeout from the default 10 minutes with something like the following? ./mtr --mysqld=--loose-innodb-fatal-semaphore-wait-threshold=1200 … Or try to specify a smaller number of purge tasks instead of the default 4? ./mtr --mysqld=--loose-innodb-purge-threads=1 … The 10.6 release has only been out there in the wild for a relatively short time. During this time, the watchdog has caught some InnoDB internal hangs (such as MDEV-27414 ) and a Linux kernel bug ( MDEV-26674 ). I would say that crashes due to false alarms like this have been in the minority. I think that if we don’t receive many genuine hang reports due to InnoDB bugs in the coming few years, this watchdog mechanism could be removed, or perhaps better replaced with something else, such as sd_watchdog_enabled(3) and related systemd functions.
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            Fix Version/s 10.6 [ 24028 ]
            Affects Version/s 10.6.7 [ 26812 ]
            Assignee Marko Mäkelä [ marko ]
            Labels crash performance
            danblack Daniel Black made changes -
            Status Open [ 1 ] Needs Feedback [ 10501 ]

            Note that this issue is sporadic. To verify the proposed fix I would need to build armel many times on Debian builders which currently is inconvenient for me. Based on your assessment the slowness is the cause to sporadic failure, which makes sense and explains why we never see this on Launchpad builders.

            otto Otto Kekäläinen added a comment - Note that this issue is sporadic. To verify the proposed fix I would need to build armel many times on Debian builders which currently is inconvenient for me. Based on your assessment the slowness is the cause to sporadic failure, which makes sense and explains why we never see this on Launchpad builders.

            otto, that is correct. I would suggest that you lower the number of purge threads for this builder, and see if the issue remains. If you did that, we would get a confirmation after some reasonably large number of test runs, in some months or years.

            Can you try to check from the log if the test really hung for at least 10 minutes? Unfortunately the fail line never includes the execution time, like the pass lines do. Theoretically, the current watchdog could deliver some false alarms, and I do not think that we can fix that easily, because I want to keep the underlying data structure and logic as simple as possible.

            Even if MariaDB implemented an interface to the Systemd watchdog, we might have to retain the current code as a fallback when that watchdog is not enabled at runtime. I do not think that mtr would ever start the server with the Systemd watchdog enabled. So, it would not help here.

            marko Marko Mäkelä added a comment - otto , that is correct. I would suggest that you lower the number of purge threads for this builder, and see if the issue remains. If you did that, we would get a confirmation after some reasonably large number of test runs, in some months or years. Can you try to check from the log if the test really hung for at least 10 minutes? Unfortunately the fail line never includes the execution time, like the pass lines do. Theoretically, the current watchdog could deliver some false alarms, and I do not think that we can fix that easily, because I want to keep the underlying data structure and logic as simple as possible. Even if MariaDB implemented an interface to the Systemd watchdog, we might have to retain the current code as a fallback when that watchdog is not enabled at runtime. I do not think that mtr would ever start the server with the Systemd watchdog enabled. So, it would not help here.

            mleich provided a core dump of a 10.9 shutdown hang that could explain this.

            In that hang, we have the following:

            • One thread is waiting for an exclusive dict_sys.latch. That will block all future shared latch acquisitions in other threads (dict_sys_t::freeze()).
            • Thread 3 (a purge task) is holding a shared dict_sys.latch (and thus blocking all the above mentioned threads) and waiting to acquire a write lock on a write-fixed page (which would have to be written out first).
            • The buf_flush_page_cleaner() thread is holding buf_pool.mutex and buf_pool.flush_list_mutex and waiting for a slot. This may block the write completion that Thread 3 is waiting for, as well as block any other threads that might hold a shared dict_sys.latch while invoking buf_page_get_gen(). Stack trace:

              #3  0x00007f02080d1e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6
              #4  0x0000558424afec25 in tpool::cache<tpool::aiocb>::get (this=0x6120000238c0, blocking=true) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/tpool/tpool_structs.h:85
              #5  0x0000558424afdb97 in io_slots::acquire (this=0x6120000238c0) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/os/os0file.cc:96
              #6  0x0000558424afa5fe in os_aio (type=@0x7f01eb30a750: {bpage = 0x0, slot = 0x0, node = 0x606000000e00, type = IORequest::WRITE_ASYNC}, buf=0x631000015000, offset=573440, n=8192) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/os/os0file.cc:3855
              #7  0x0000558424f9272e in fil_space_t::io (this=0x612000023d40, type=@0x7f01eb30a910: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=573440, len=8192, buf=0x631000015000, bpage=0x0) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/fil/fil0fil.cc:2813
              #8  0x0000558424ea69c4 in buf_flush_freed_pages (space=0x612000023d40) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1073
              #9  0x0000558424ea7d21 in buf_flush_space (id=1) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1206
              #10 0x0000558424ea9c0a in buf_do_flush_list_batch (max_n=2000, lsn=23743922) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1433
              #11 0x0000558424eaa73d in buf_flush_list (max_n=2000, lsn=23743922) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1530
              #12 0x0000558424eaf41f in buf_flush_page_cleaner () at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:2448
              

            This looks like a genuine deadlock of threads to me.

            Generally, no mutex should be held across any I/O calls. An exception might be the log_sys.mutex for some redo log operations, but it definitely is blatantly wrong to hold buf_pool.mutex or buf_pool.flush_list_mutex while submitting some I/O.

            Apparently reproducing this requires a very overloaded system so that we can actually run out of the I/O slots.

            The test main.parser_bug21114_innodb is creating and dropping very many tables, so the MDEV-15528 logic (buf_flush_freed_pages()) should definitely be able to kick in there.

            marko Marko Mäkelä added a comment - mleich provided a core dump of a 10.9 shutdown hang that could explain this. In that hang, we have the following: One thread is waiting for an exclusive dict_sys.latch . That will block all future shared latch acquisitions in other threads ( dict_sys_t::freeze() ). Thread 3 (a purge task) is holding a shared dict_sys.latch (and thus blocking all the above mentioned threads) and waiting to acquire a write lock on a write-fixed page (which would have to be written out first). The buf_flush_page_cleaner() thread is holding buf_pool.mutex and buf_pool.flush_list_mutex and waiting for a slot. This may block the write completion that Thread 3 is waiting for, as well as block any other threads that might hold a shared dict_sys.latch while invoking buf_page_get_gen() . Stack trace: #3 0x00007f02080d1e30 in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /lib/x86_64-linux-gnu/libstdc++.so.6 #4 0x0000558424afec25 in tpool::cache<tpool::aiocb>::get (this=0x6120000238c0, blocking=true) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/tpool/tpool_structs.h:85 #5 0x0000558424afdb97 in io_slots::acquire (this=0x6120000238c0) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/os/os0file.cc:96 #6 0x0000558424afa5fe in os_aio (type=@0x7f01eb30a750: {bpage = 0x0, slot = 0x0, node = 0x606000000e00, type = IORequest::WRITE_ASYNC}, buf=0x631000015000, offset=573440, n=8192) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/os/os0file.cc:3855 #7 0x0000558424f9272e in fil_space_t::io (this=0x612000023d40, type=@0x7f01eb30a910: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=573440, len=8192, buf=0x631000015000, bpage=0x0) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/fil/fil0fil.cc:2813 #8 0x0000558424ea69c4 in buf_flush_freed_pages (space=0x612000023d40) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1073 #9 0x0000558424ea7d21 in buf_flush_space (id=1) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1206 #10 0x0000558424ea9c0a in buf_do_flush_list_batch (max_n=2000, lsn=23743922) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1433 #11 0x0000558424eaa73d in buf_flush_list (max_n=2000, lsn=23743922) at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:1530 #12 0x0000558424eaf41f in buf_flush_page_cleaner () at /data/Server/bb-10.9-MDEV-26603-async-redo-writeB/storage/innobase/buf/buf0flu.cc:2448 This looks like a genuine deadlock of threads to me. Generally, no mutex should be held across any I/O calls. An exception might be the log_sys.mutex for some redo log operations, but it definitely is blatantly wrong to hold buf_pool.mutex or buf_pool.flush_list_mutex while submitting some I/O. Apparently reproducing this requires a very overloaded system so that we can actually run out of the I/O slots. The test main.parser_bug21114_innodb is creating and dropping very many tables, so the MDEV-15528 logic ( buf_flush_freed_pages() ) should definitely be able to kick in there.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Labels crash performance crash deadlock
            Priority Minor [ 4 ] Critical [ 2 ]
            Summary main.parser_bug21114_innodb failed on arch 'armel' on Debian buf_flush_freed_pages() causes InnoDB to hang
            marko Marko Mäkelä made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] Confirmed [ 10101 ]

            thiru, please review. I also tried protecting buf_flush_freed_pages() with fil_space_t::latch so that we could release space->freed_range_mutex before the end of the function, but it caused hangs of the test encryption.innochecksum.

            marko Marko Mäkelä added a comment - thiru , please review. I also tried protecting buf_flush_freed_pages() with fil_space_t::latch so that we could release space->freed_range_mutex before the end of the function, but it caused hangs of the test encryption.innochecksum .
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Status Confirmed [ 10101 ] In Review [ 10002 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Status In Review [ 10002 ] Stalled [ 10000 ]
            thiru Thirunarayanan Balathandayuthapani made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2022-03-15 16:20:04.0 2022-03-15 16:20:04.442
            marko Marko Mäkelä made changes -
            Fix Version/s 10.9.0 [ 27113 ]
            Fix Version/s 10.5.16 [ 27508 ]
            Fix Version/s 10.6.8 [ 27506 ]
            Fix Version/s 10.7.4 [ 27504 ]
            Fix Version/s 10.8.3 [ 27502 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -

            People

              marko Marko Mäkelä
              otto Otto Kekäläinen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.