Details

    Description

      $ mysql-test/mtr --repeat=30 --parallel=4 main.innodb_ext_key 
      Logging: /home/dan/repos/mariadb-server-10.6/mysql-test/mariadb-test-run.pl  --repeat=30 --parallel=4 main.innodb_ext_key
      vardir: /home/dan/repos/build-mariadb-server-10.6/mysql-test/var
      Checking leftover processes...
      Removing old var directory...
      Creating var directory '/home/dan/repos/build-mariadb-server-10.6/mysql-test/var'...
      Checking supported features...
      MariaDB Version 10.6.5-MariaDB
       - SSL connections supported
       - binaries built with wsrep patch
      Collecting tests...
      Installing system database...
       
      ==============================================================================
       
      TEST                                  WORKER RESULT   TIME (ms) or COMMENT
      --------------------------------------------------------------------------
       
      worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
      worker[4] Using MTR_BUILD_THREAD 301, with reserved ports 16020..16039
      worker[2] Using MTR_BUILD_THREAD 302, with reserved ports 16040..16059
      worker[3] Using MTR_BUILD_THREAD 303, with reserved ports 16060..16079
      main.innodb_ext_key 'innodb,off,unoptimized' w4 [ pass ]   2918
      main.innodb_ext_key 'covering,innodb,off' w3 [ pass ]   2958
      main.innodb_ext_key 'innodb,on,unoptimized' w2 [ pass ]   2918
      main.innodb_ext_key 'covering,innodb,on' w1 [ pass ]   2972
      main.innodb_ext_key 'covering,innodb,off' w3 [ 2 pass ]   2707
      main.innodb_ext_key 'innodb,on,unoptimized' w2 [ 2 pass ]   2722
      main.innodb_ext_key 'covering,innodb,on' w1 [ 2 pass ]   2616
      main.innodb_ext_key 'covering,innodb,off' w3 [ 3 pass ]   2654
      main.innodb_ext_key 'covering,innodb,on' w1 [ 3 pass ]   2653
      main.innodb_ext_key 'innodb,on,unoptimized' w2 [ 3 pass ]   2733
      main.innodb_ext_key 'covering,innodb,off' w3 [ 4 pass ]   2400
      main.innodb_ext_key 'covering,innodb,on' w1 [ 4 pass ]   2426
      main.innodb_ext_key 'innodb,on,unoptimized' w2 [ 4 pass ]   2507
      main.innodb_ext_key 'covering,innodb,off' w3 [ 5 pass ]   2419
      main.innodb_ext_key 'covering,innodb,on' w1 [ 5 pass ]   2371
      main.innodb_ext_key 'covering,innodb,off' w3 [ 6 pass ]   5280
      main.innodb_ext_key 'covering,innodb,on' w1 [ 6 pass ]   5267
      main.innodb_ext_key 'covering,innodb,off' w3 [ 7 pass ]   2155
      main.innodb_ext_key 'covering,innodb,on' w1 [ 7 pass ]   2143
      main.innodb_ext_key 'covering,innodb,on' w1 [ 8 pass ]   2792
      main.innodb_ext_key 'covering,innodb,off' w3 [ 8 pass ]   2825
      main.innodb_ext_key 'covering,innodb,on' w1 [ 9 pass ]   2069
      main.innodb_ext_key 'covering,innodb,off' w3 [ 9 pass ]   2031
      main.innodb_ext_key 'covering,innodb,on' w1 [ 10 pass ]   1834
      main.innodb_ext_key 'covering,innodb,off' w3 [ 10 pass ]   1848
      main.innodb_ext_key 'covering,innodb,on' w1 [ 11 pass ]   3932
      main.innodb_ext_key 'covering,innodb,off' w3 [ 11 pass ]   3929
      main.innodb_ext_key 'covering,innodb,on' w1 [ 12 pass ]   2203
      main.innodb_ext_key 'covering,innodb,off' w3 [ 12 pass ]   2303
      main.innodb_ext_key 'covering,innodb,on' w1 [ 13 pass ]   6001
      main.innodb_ext_key 'covering,innodb,off' w3 [ 13 pass ]   5990
      main.innodb_ext_key 'covering,innodb,on' w1 [ 14 pass ]   1789
      main.innodb_ext_key 'covering,innodb,off' w3 [ 14 pass ]   1782
      main.innodb_ext_key 'covering,innodb,on' w1 [ 15 pass ]   1881
      main.innodb_ext_key 'covering,innodb,off' w3 [ 15 pass ]   1896
      main.innodb_ext_key 'covering,innodb,on' w1 [ 16 pass ]   2187
      main.innodb_ext_key 'covering,innodb,off' w3 [ 16 pass ]   2261
      main.innodb_ext_key 'covering,innodb,on' w1 [ 17 pass ]   2011
      main.innodb_ext_key 'covering,innodb,off' w3 [ 17 pass ]   2048
      main.innodb_ext_key 'covering,innodb,on' w1 [ 18 pass ]   1838
      main.innodb_ext_key 'covering,innodb,off' w3 [ 18 pass ]   1862
      main.innodb_ext_key 'covering,innodb,on' w1 [ 19 pass ]   1850
      main.innodb_ext_key 'covering,innodb,off' w3 [ 19 pass ]   1761
      main.innodb_ext_key 'covering,innodb,on' w1 [ 20 pass ]   1855
      main.innodb_ext_key 'covering,innodb,off' w3 [ 20 pass ]   1907
      main.innodb_ext_key 'covering,innodb,on' w1 [ 21 pass ]   2080
      main.innodb_ext_key 'covering,innodb,off' w3 [ 21 pass ]   2134
      main.innodb_ext_key 'covering,innodb,on' w1 [ 22 pass ]   1986
      main.innodb_ext_key 'covering,innodb,off' w3 [ 22 pass ]   2019
      main.innodb_ext_key 'covering,innodb,on' w1 [ 23 pass ]   1969
      main.innodb_ext_key 'covering,innodb,off' w3 [ 23 pass ]   1855
      main.innodb_ext_key 'covering,innodb,on' w1 [ 24 pass ]   1899
      main.innodb_ext_key 'covering,innodb,off' w3 [ 24 pass ]   1803
      main.innodb_ext_key 'covering,innodb,on' w1 [ 25 pass ]   2050
      main.innodb_ext_key 'covering,innodb,off' w3 [ 25 pass ]   2034
      main.innodb_ext_key 'covering,innodb,on' w1 [ 26 pass ]   2068
      main.innodb_ext_key 'covering,innodb,off' w3 [ 26 pass ]   2081
      main.innodb_ext_key 'covering,innodb,on' w1 [ 27 pass ]   2056
      main.innodb_ext_key 'covering,innodb,off' w3 [ 27 pass ]   2210
      main.innodb_ext_key 'covering,innodb,on' w1 [ 28 pass ]   1868
      main.innodb_ext_key 'covering,innodb,off' w3 [ 28 pass ]   1855
      main.innodb_ext_key 'covering,innodb,on' w1 [ 29 pass ]   1852
      main.innodb_ext_key 'covering,innodb,off' w3 [ 29 pass ]   1786
      main.innodb_ext_key 'covering,innodb,off' w3 [ 30 pass ]   2099
      main.innodb_ext_key 'covering,innodb,on' w1 [ 30 pass ]   2205
       
       
       
      main.innodb_ext_key 'innodb,off,unoptimized' w4 [ 2 fail ]
              Test ended at 2021-09-07 22:28:31
       
      CURRENT_TEST: main.innodb_ext_key
      mysqltest: At line 588: query 'CREATE TABLE t1 (
      pk INT,
      f1 VARCHAR(3),
      f2 VARCHAR(1024),
      PRIMARY KEY (pk),
      KEY(f2)
      ) ENGINE=InnoDB CHARSET utf8 ROW_FORMAT= DYNAMIC' failed: <Unknown> (2013): Lost connection to server during query
       
      The result from queries just before the failure was:
      < snip >
      id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
      1	SIMPLE	t1	ref	index_date_updated	index_date_updated	5	const	#	Using index condition
      # This used to work from the start:
      explain 
      select * from t2 force index(index_date_updated) 
      where index_date_updated= 10 and index_id < 800;
      id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
      1	SIMPLE	t2	ref	index_date_updated	index_date_updated	5	const	#	Using index condition
      drop table t0,t1,t2;
      #
      # MDEV-11196: Error:Run-Time Check Failure #2 - Stack around the variable 'key_buff'
      # was corrupted, server crashes in opt_sum_query
      SET @save_optimizer_use_condition_selectivity=@@optimizer_use_condition_selectivity, @@optimizer_use_condition_selectivity=4;
      CREATE TABLE t1 (
      pk INT,
      f1 VARCHAR(3),
      f2 VARCHAR(1024),
      PRIMARY KEY (pk),
      KEY(f2)
      ) ENGINE=InnoDB CHARSET utf8 ROW_FORMAT= DYNAMIC;
       
      More results from queries before failure can be found in /home/dan/repos/build-mariadb-server-10.6/mysql-test/var/4/log/innodb_ext_key,off.log
       
       
      Server [mysqld.1 - pid: 1635380, winpid: 1635380, exit: 256] failed during test run
      Server log from this test:
      ----------SERVER LOG START-----------
      2021-09-07 22:28:29 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
      210907 22:28:29 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.6.5-MariaDB-log
      

      build from 10.6-40ae9c5d108d52444288e007ce9565919aa91299

      Thread 1 (Thread 0x7fa054ff9640 (LWP 1635530)):
      #0  __pthread_kill (threadid=<optimized out>, signo=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:54
              pd = <optimized out>
              tid = 1635530
              pid = 1635383
              val = 0
      #1  0x00000000008f7bc6 in handle_fatal_signal (sig=6) at /home/dan/repos/mariadb-server-10.6/sql/signal_handler.cc:345
              tm = {tm_sec = 29, tm_min = 28, tm_hour = 22, tm_mday = 7, tm_mon = 8, tm_year = 121, tm_wday = 2, tm_yday = 249, tm_isdst = 0, tm_gmtoff = 36000, tm_zone = 0x2b87e50 "AEST"}
              print_invalid_query_pointer = <optimized out>
              curr_time = 1631017709
              thd = 0x0
      #2  <signal handler called>
      No locals.
      #3  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
              set = {__val = {552967, 2147516417, 140326178332088, 46234680, 140326597527936, 13394987, 17938038, 729, 1946169716, 140326597527816, 9223512778638098432, 46234680, 1, 1, 31952248, 46263432}}
              pid = <optimized out>
              tid = <optimized out>
              ret = <optimized out>
      #4  0x00007fa08f4128a4 in __GI_abort () at abort.c:79
              save_stage = 1
              act = {__sigaction_handler = {sa_handler = 0x7fa054ff8a08, sa_sigaction = 0x7fa054ff8a08}, sa_mask = {__val = {140327578090128, 140326379457952, 0, 14, 158, 17937887, 158, 140326597528128, 11548080394786522880, 0, 18446744073709551272, 0, 140327478049406, 140327478049407, 47667776, 140326597528088}}, sa_flags = -1891137592, sa_restorer = 0x7fa054ff8910}
              sigs = {__val = {32, 23245968, 140327478049406, 140327478049407, 140326597528000, 17918744, 140326597527808, 10287617, 140326597528048, 10283534, 109, 140326111055120, 158, 0, 140326111055120, 0}}
      #5  0x0000000000c8557f in ib::fatal::~fatal (this=0x7fa054ff8a60) at /home/dan/repos/mariadb-server-10.6/storage/innobase/ut/ut0ut.cc:519
      No locals.
      #6  0x0000000000c5d720 in srv_monitor_task () at /home/dan/repos/mariadb-server-10.6/storage/innobase/srv/srv0srv.cc:1293
              waited = <optimized out>
              start = <optimized out>
              old_lsn = 8539879
              new_lsn = <optimized out>
              now = <optimized out>
              threshold = <optimized out>
      #7  0x0000000000d4310c in tpool::thread_pool_generic::timer_generic::run (this=0x2ea5970) at /home/dan/repos/mariadb-server-10.6/tpool/tpool_generic.cc:313
              expected = false
      #8  0x0000000000d439b7 in tpool::task::execute (this=0x2ea59b0) at /home/dan/repos/mariadb-server-10.6/tpool/task.cc:52
      No locals.
      #9  0x0000000000d41596 in tpool::thread_pool_generic::worker_main (this=0x2d663a0, thread_var=0x2d75a40) at /home/dan/repos/mariadb-server-10.6/tpool/tpool_generic.cc:550
              task = 0x2
      #10 0x00007fa08f6afc84 in execute_native_thread_routine () from /lib64/libstdc++.so.6
      No symbol table info available.
      #11 0x00007fa08f807299 in start_thread (arg=0x7fa054ff9640) at pthread_create.c:481
              ret = <optimized out>
              pd = 0x7fa054ff9640
              unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140326597531200, -1804064771434488301, 140327478049406, 140327478049407, 0, 140326597531200, 1852674176714399251, 1853147301372616211}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
              not_first_call = 0
      #12 0x00007fa08f4ec353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      No locals.
      (gdb)  p  dict_sys.latch
      $1 = {pfs_psi = 0x7fa08b2eccc0, lock = {writer = {lock = {<std::__atomic_base<unsigned int>> = {_M_i = 2147483650}, <No data fields>}}, readers = {<std::__atomic_base<unsigned int>> = {_M_i = 2147483648}, <No data fields>}}}
      
      

      Have noticed in debug and non-debug release

      Attachments

        1. CMakeCache.txt
          137 kB
        2. gcc_build_gdb_output.log
          50 kB
        3. gdb_output.log
          151 kB
        4. gdb.txt
          103 kB
        5. gdb-gcc-liburing-0.7-4.fc34.txt
          94 kB

        Issue Links

          Activity

            danblack Daniel Black added a comment -

            CMakeCache.txt:CMAKE_CXX_COMPILER:STRING=/usr/lib64/ccache/clang++
            CMakeCache.txt:CMAKE_CXX_COMPILER_AR:FILEPATH=/usr/bin/llvm-ar
            CMakeCache.txt:CMAKE_CXX_COMPILER_RANLIB:FILEPATH=/usr/bin/llvm-ranlib
            CMakeCache.txt:CMAKE_C_COMPILER:STRING=/usr/lib64/ccache/clang
            CMakeCache.txt:CMAKE_C_COMPILER_AR:FILEPATH=/usr/bin/llvm-ar
            CMakeCache.txt:CMAKE_C_COMPILER_RANLIB:FILEPATH=/usr/bin/llvm-ranlib
             
            $ /usr/lib64/ccache/clang++ --version
            clang version 12.0.1 (Fedora 12.0.1-1.fc34)
            Target: x86_64-unknown-linux-gnu
            Thread model: posix
            InstalledDir: /usr/bin
             
            $  /usr/bin/llvm-ar --version
            LLVM (http://llvm.org/):
              LLVM version 12.0.1
              Optimized build.
              Default target: x86_64-unknown-linux-gnu
              Host CPU: skylake
            
            

            danblack Daniel Black added a comment - CMakeCache.txt:CMAKE_CXX_COMPILER:STRING=/usr/lib64/ccache/clang++ CMakeCache.txt:CMAKE_CXX_COMPILER_AR:FILEPATH=/usr/bin/llvm-ar CMakeCache.txt:CMAKE_CXX_COMPILER_RANLIB:FILEPATH=/usr/bin/llvm-ranlib CMakeCache.txt:CMAKE_C_COMPILER:STRING=/usr/lib64/ccache/clang CMakeCache.txt:CMAKE_C_COMPILER_AR:FILEPATH=/usr/bin/llvm-ar CMakeCache.txt:CMAKE_C_COMPILER_RANLIB:FILEPATH=/usr/bin/llvm-ranlib   $ /usr/lib64/ccache/clang++ --version clang version 12.0.1 (Fedora 12.0.1-1.fc34) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /usr/bin   $ /usr/bin/llvm-ar --version LLVM (http://llvm.org/): LLVM version 12.0.1 Optimized build. Default target: x86_64-unknown-linux-gnu Host CPU: skylake
            danblack Daniel Black added a comment -

            same with gcc-11.2.1

            main.innodb_ext_key 'covering,innodb,on' w3 [ 28 pass ]   1544
            main.innodb_ext_key 'covering,innodb,on' w3 [ 29 pass ]   1575
            main.innodb_ext_key 'covering,innodb,on' w3 [ 30 pass ]   1245
            worker[2] Trying to dump core for [mysqltest - pid: 1654860, winpid: 1654860]
            worker[2] Trying to dump core for [mysqld.1 - pid: 1653466, winpid: 1653466]
            main.innodb_ext_key 'innodb,off,unoptimized' w2 [ 21 fail ]  timeout after 900 seconds
                    Test ended at 2021-09-07 23:31:08
             
            Test case timeout after 900 seconds
             
             == /home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/2/tmp/analyze-timeout-mysqld.1.err ==
            mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/tmp' (111)
             
             
             - saving '/home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/2/log/main.innodb_ext_key-innodb,off,unoptimized/' to '/home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/log/main.innodb_ext_key-innodb,off,unoptimized/'
            --------------------------------------------------------------------------
            The servers were restarted 0 times
            Spent 292.342 of 1011 seconds executing testcases
             
            Failure: Failed 1/111 tests, 99.10% were successful.
             
            gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1)
            Copyright (C) 2021 Free Software Foundation, Inc.
            This is free software; see the source for copying conditions.  There is NO
            warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
            
            

            danblack Daniel Black added a comment - same with gcc-11.2.1 main.innodb_ext_key 'covering,innodb,on' w3 [ 28 pass ] 1544 main.innodb_ext_key 'covering,innodb,on' w3 [ 29 pass ] 1575 main.innodb_ext_key 'covering,innodb,on' w3 [ 30 pass ] 1245 worker[2] Trying to dump core for [mysqltest - pid: 1654860, winpid: 1654860] worker[2] Trying to dump core for [mysqld.1 - pid: 1653466, winpid: 1653466] main.innodb_ext_key 'innodb,off,unoptimized' w2 [ 21 fail ] timeout after 900 seconds Test ended at 2021-09-07 23:31:08   Test case timeout after 900 seconds   == /home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/2/tmp/analyze-timeout-mysqld.1.err == mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/tmp' (111)     - saving '/home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/2/log/main.innodb_ext_key-innodb,off,unoptimized/' to '/home/dan/repos/build-mariadb-server-10.6-gcc/mysql-test/var/log/main.innodb_ext_key-innodb,off,unoptimized/' -------------------------------------------------------------------------- The servers were restarted 0 times Spent 292.342 of 1011 seconds executing testcases   Failure: Failed 1/111 tests, 99.10% were successful.   gcc (GCC) 11.2.1 20210728 (Red Hat 11.2.1-1) Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

            danblack, can you provide a stack trace of all threads, either from the test case timeout or from a crash with the srv_monitor_task watchdog? I may need more information from that session to determine the entire waits-for chain. The watchdog is only a messenger here, and the 15-minute test case timeout suggests that it was not a false alarm.

            I was unable to reproduce this hang myself so far.

            marko Marko Mäkelä added a comment - danblack , can you provide a stack trace of all threads, either from the test case timeout or from a crash with the srv_monitor_task watchdog? I may need more information from that session to determine the entire waits-for chain. The watchdog is only a messenger here, and the 15-minute test case timeout suggests that it was not a false alarm. I was unable to reproduce this hang myself so far.

            Sorry, I initially overlooked gdb.txt. The InnoDB threads are the following:

            • Thread 18 (purge_worker_callback) is waiting for an exclusive latch on page 10 (probably the root and leaf page of the SYS_COLUMNS table).
            • Thread 14 (aio_uring::thread_routine) is waiting for something in io_uring_wait_cqe().
            • Thread 12 (buf_flush_page_cleaner) is waiting for buf_dblwr.cond in buf_dblwr_t::flush_buffered_writes().
            • Thread 11 (purge_coordinator_callback) is waiting for the purge worker (Thread 18) to finish.
            • Thread 10 (CREATE TABLE t1) is waiting for an exclusive latch on page 7 (the DICT_HDR page) in dict_hdr_get_new_id().
            • Thread 9 (purge_worker_callback) is waiting for an exclusive latch on page 11 (probably the root and leaf page of the SYS_INDEXES table).
            • Thread 2 (srv_master_callback) is waiting for exclusive dict_sys.latch in dict_sys_t::evict_table_LRU(). This wait could be tripping the watchdog.
            • Thread 1 (srv_monitor_task) is the watchdog that commits suicide due to a long wait on dict_sys.latch.

            It would seem to me that the culprit is whatever is blocking Thread 12. The only place where the condition variable would be signaled is buf_dblwr_t::write_completed(), and its only caller is buf_page_write_complete(), whose only caller is fil_aio_callback(), whose only caller is io_callback, registered in os_aio(). I see that Thread 14 is executing something io_uring related.

            Can you repeat this with a different version of liburing? Or if you build with libaio instead of io_uring? (Be sure that libaio-dev is available, and cmake -DCMAKE_DISABLE_FIND_PACKAGE_URING=1.)

            marko Marko Mäkelä added a comment - Sorry, I initially overlooked gdb.txt . The InnoDB threads are the following: Thread 18 ( purge_worker_callback ) is waiting for an exclusive latch on page 10 (probably the root and leaf page of the SYS_COLUMNS table). Thread 14 ( aio_uring::thread_routine ) is waiting for something in io_uring_wait_cqe() . Thread 12 ( buf_flush_page_cleaner ) is waiting for buf_dblwr.cond in buf_dblwr_t::flush_buffered_writes() . Thread 11 ( purge_coordinator_callback ) is waiting for the purge worker (Thread 18) to finish. Thread 10 ( CREATE TABLE t1 ) is waiting for an exclusive latch on page 7 (the DICT_HDR page) in dict_hdr_get_new_id() . Thread 9 ( purge_worker_callback ) is waiting for an exclusive latch on page 11 (probably the root and leaf page of the SYS_INDEXES table). Thread 2 ( srv_master_callback ) is waiting for exclusive dict_sys.latch in dict_sys_t::evict_table_LRU() . This wait could be tripping the watchdog. Thread 1 ( srv_monitor_task ) is the watchdog that commits suicide due to a long wait on dict_sys.latch . It would seem to me that the culprit is whatever is blocking Thread 12. The only place where the condition variable would be signaled is buf_dblwr_t::write_completed() , and its only caller is buf_page_write_complete() , whose only caller is fil_aio_callback() , whose only caller is io_callback , registered in os_aio() . I see that Thread 14 is executing something io_uring related. Can you repeat this with a different version of liburing ? Or if you build with libaio instead of io_uring ? (Be sure that libaio-dev is available, and cmake -DCMAKE_DISABLE_FIND_PACKAGE_URING=1 .)
            danblack Daniel Black added a comment -

            a) Have tried many times on libaio-0.3.111-11.fc34 and any failure isn't repeatable

            b) previous liburing was a 2.0 version from git from 6 months ago circa commit d3eb6af47886656e811e8f4e274ac361d0efa5bf on the liburing

            I've since purged this out and am back using the distro version liburing-0.7-4.fc34

            c) The hang still occured: gdb-gcc-liburing-0.7-4.fc34.txt for backtrace.

            Notably in a uring syscall (though a different one)

            Thread 2 (Thread 0x7f5c83fff640 (LWP 1708513)):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            No locals.
            #1  0x00007f5c8ee76a32 in __io_uring_get_cqe () from /lib64/liburing.so.1
            No symbol table info available.
            #2  0x0000555fda356a35 in io_uring_wait_cqe_nr (wait_nr=1, cqe_ptr=0x7f5c83ffec38, ring=0x555fdc8a4cb8) at /usr/include/liburing.h:494
            No locals.
            #3  io_uring_wait_cqe (cqe_ptr=0x7f5c83ffec38, ring=0x555fdc8a4cb8) at /usr/include/liburing.h:514
            No locals.
            #4  (anonymous namespace)::aio_uring::thread_routine (aio=0x555fdc8a4cb0) at /home/dan/repos/mariadb-server-10.6/tpool/aio_liburing.cc:137
                    ret = <optimized out>
                    cqe = 0x7f5c8ef4f990
                    iocb = <optimized out>
                    res = <optimized out>
            #5  0x00007f5c8e83dc84 in execute_native_thread_routine () from /lib64/libstdc++.so.6
            No symbol table info available.
            #6  0x00007f5c8e995299 in start_thread (arg=0x7f5c83fff640) at pthread_create.c:481
                    ret = <optimized out>
                    pd = 0x7f5c83fff640
                    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140035328308800, -1066475383518023308, 140722199995678, 140722199995679, 0, 140035328308800, 1119735726208299380, 1119711274098602356}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
                    not_first_call = 0
            #7  0x00007f5c8e536353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            

            One more test coming, a build on the libio_uring 2.0 master branch.

            danblack Daniel Black added a comment - a) Have tried many times on libaio-0.3.111-11.fc34 and any failure isn't repeatable b) previous liburing was a 2.0 version from git from 6 months ago circa commit d3eb6af47886656e811e8f4e274ac361d0efa5bf on the liburing I've since purged this out and am back using the distro version liburing-0.7-4.fc34 c) The hang still occured: gdb-gcc-liburing-0.7-4.fc34.txt for backtrace. Notably in a uring syscall (though a different one) Thread 2 (Thread 0x7f5c83fff640 (LWP 1708513)): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 No locals. #1 0x00007f5c8ee76a32 in __io_uring_get_cqe () from /lib64/liburing.so.1 No symbol table info available. #2 0x0000555fda356a35 in io_uring_wait_cqe_nr (wait_nr=1, cqe_ptr=0x7f5c83ffec38, ring=0x555fdc8a4cb8) at /usr/include/liburing.h:494 No locals. #3 io_uring_wait_cqe (cqe_ptr=0x7f5c83ffec38, ring=0x555fdc8a4cb8) at /usr/include/liburing.h:514 No locals. #4 (anonymous namespace)::aio_uring::thread_routine (aio=0x555fdc8a4cb0) at /home/dan/repos/mariadb-server-10.6/tpool/aio_liburing.cc:137 ret = <optimized out> cqe = 0x7f5c8ef4f990 iocb = <optimized out> res = <optimized out> #5 0x00007f5c8e83dc84 in execute_native_thread_routine () from /lib64/libstdc++.so.6 No symbol table info available. #6 0x00007f5c8e995299 in start_thread (arg=0x7f5c83fff640) at pthread_create.c:481 ret = <optimized out> pd = 0x7f5c83fff640 unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140035328308800, -1066475383518023308, 140722199995678, 140722199995679, 0, 140035328308800, 1119735726208299380, 1119711274098602356}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = 0 #7 0x00007f5c8e536353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 One more test coming, a build on the libio_uring 2.0 master branch.
            danblack Daniel Black added a comment -

            latest liburing appears to hang like the original gdb backtraces.

            fd=0 looks odd here because its mean to be a file descriptor returned by io_uring_setup, which shouldn't be 0.

            liburing-a357750b1547a7c5f4caac49bb0c556e4d4bb8c6

            Thread 4 (Thread 0x7efede617640 (LWP 1727423) "mariadbd"):
            #0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
            No locals.
            #1  0x00007efee4c91350 in __sys_io_uring_enter2 (fd=0, to_submit=1, min_complete=3828653581, flags=0, sig=0x8, sz=<optimized out>) at syscall.c:64
            No locals.
            #2  0x00007efee4c90ca4 in _io_uring_get_cqe (ring=0x5558a7938cb8, cqe_ptr=0x7efede616c38, data=data@entry=0x7efede616c10) at queue.c:122
                    flags = 3828653581
                    nr_available = <optimized out>
                    ret = <optimized out>
                    need_enter = <optimized out>
                    cq_overflow_flush = <optimized out>
                    cqe = <optimized out>
                    err = <optimized out>
            #3  0x00007efee4c90bb7 in __io_uring_get_cqe (ring=0x8, cqe_ptr=0x0, submit=<optimized out>, wait_nr=3828653581, sigmask=0x0) at queue.c:150
                    data = {submit = 0, wait_nr = 1, get_flags = 0, sz = 8, arg = 0x0}
            #4  0x00005558a4a0ca45 in io_uring_wait_cqe_nr (wait_nr=1, cqe_ptr=0x7efede616c38, ring=0x5558a7938cb8) at /usr/local/include/liburing.h:745
            No locals.
            #5  io_uring_wait_cqe (cqe_ptr=0x7efede616c38, ring=0x5558a7938cb8) at /usr/local/include/liburing.h:765
            No locals.
            #6  (anonymous namespace)::aio_uring::thread_routine (aio=0x5558a7938cb0) at /home/dan/repos/mariadb-server-10.6/tpool/aio_liburing.cc:137
                    ret = <optimized out>
                    cqe = 0x7efee4d6cf20
                    iocb = <optimized out>
                    res = <optimized out>
            #7  0x00007efee4656c84 in execute_native_thread_routine () from /lib64/libstdc++.so.6
            No symbol table info available.
            #8  0x00007efee47ae299 in start_thread (arg=0x7efede617640) at pthread_create.c:481
                    ret = <optimized out>
                    pd = 0x7efede617640
                    unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139633117722176, -1501876349270139299, 140736435846142, 140736435846143, 0, 139633117722176, 1645502336706109021, 1645374754647006813}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
                    not_first_call = 0
            #9  0x00007efee434f353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            No locals.
            

            danblack Daniel Black added a comment - latest liburing appears to hang like the original gdb backtraces. fd=0 looks odd here because its mean to be a file descriptor returned by io_uring_setup, which shouldn't be 0. liburing-a357750b1547a7c5f4caac49bb0c556e4d4bb8c6 Thread 4 (Thread 0x7efede617640 (LWP 1727423) "mariadbd"): #0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 No locals. #1 0x00007efee4c91350 in __sys_io_uring_enter2 (fd=0, to_submit=1, min_complete=3828653581, flags=0, sig=0x8, sz=<optimized out>) at syscall.c:64 No locals. #2 0x00007efee4c90ca4 in _io_uring_get_cqe (ring=0x5558a7938cb8, cqe_ptr=0x7efede616c38, data=data@entry=0x7efede616c10) at queue.c:122 flags = 3828653581 nr_available = <optimized out> ret = <optimized out> need_enter = <optimized out> cq_overflow_flush = <optimized out> cqe = <optimized out> err = <optimized out> #3 0x00007efee4c90bb7 in __io_uring_get_cqe (ring=0x8, cqe_ptr=0x0, submit=<optimized out>, wait_nr=3828653581, sigmask=0x0) at queue.c:150 data = {submit = 0, wait_nr = 1, get_flags = 0, sz = 8, arg = 0x0} #4 0x00005558a4a0ca45 in io_uring_wait_cqe_nr (wait_nr=1, cqe_ptr=0x7efede616c38, ring=0x5558a7938cb8) at /usr/local/include/liburing.h:745 No locals. #5 io_uring_wait_cqe (cqe_ptr=0x7efede616c38, ring=0x5558a7938cb8) at /usr/local/include/liburing.h:765 No locals. #6 (anonymous namespace)::aio_uring::thread_routine (aio=0x5558a7938cb0) at /home/dan/repos/mariadb-server-10.6/tpool/aio_liburing.cc:137 ret = <optimized out> cqe = 0x7efee4d6cf20 iocb = <optimized out> res = <optimized out> #7 0x00007efee4656c84 in execute_native_thread_routine () from /lib64/libstdc++.so.6 No symbol table info available. #8 0x00007efee47ae299 in start_thread (arg=0x7efede617640) at pthread_create.c:481 ret = <optimized out> pd = 0x7efede617640 unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139633117722176, -1501876349270139299, 140736435846142, 140736435846143, 0, 139633117722176, 1645502336706109021, 1645374754647006813}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}} not_first_call = 0 #9 0x00007efee434f353 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 No locals.

            I am unable to reproduce any hang on my Debian Sid (unstable) system, using cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo:

            10.6 106b16a5af7a8e29889e0ef8930bd2901a9ad00e

            ./mtr --parallel=auto --repeat=100 main.innodb_ext_key{,,,,,,,,,,,,,,,}
            …
            main.innodb_ext_key 'covering,innodb,on' w30 [ 100 pass ]    383
            main.innodb_ext_key 'covering,innodb,on' w43 [ 100 pass ]    379
            --------------------------------------------------------------------------
            The servers were restarted 18 times
            Spent 4162.514 of 153 seconds executing testcases
             
            Completed: All 6400 tests were successful.
            

            This was with liburing1 version 0.7-3, GCC 11.2.0 and

            $ cat /proc/version 
            Linux version 5.10.0-8-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.46-4 (2021-08-03)
            

            marko Marko Mäkelä added a comment - I am unable to reproduce any hang on my Debian Sid (unstable) system, using cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo : 10.6 106b16a5af7a8e29889e0ef8930bd2901a9ad00e ./mtr --parallel=auto --repeat=100 main.innodb_ext_key{,,,,,,,,,,,,,,,} … main.innodb_ext_key 'covering,innodb,on' w30 [ 100 pass ] 383 main.innodb_ext_key 'covering,innodb,on' w43 [ 100 pass ] 379 -------------------------------------------------------------------------- The servers were restarted 18 times Spent 4162.514 of 153 seconds executing testcases   Completed: All 6400 tests were successful. This was with liburing1 version 0.7-3, GCC 11.2.0 and $ cat /proc/version Linux version 5.10.0-8-amd64 (debian-kernel@lists.debian.org) (gcc-10 (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2) #1 SMP Debian 5.10.46-4 (2021-08-03)
            danblack Daniel Black added a comment -

            updated /tmp/gdb_output.log with a -O0 -g build of liburing latest. At least its showing fd 8 now, fd 0 in mtr is /dev/null.

            5.13.16-200.fc34.x86_64

            Stack makes sense that its waiting on a io_uring event. I need to look more tomorrow.

            danblack Daniel Black added a comment - updated /tmp/gdb_output.log with a -O0 -g build of liburing latest. At least its showing fd 8 now, fd 0 in mtr is /dev/null. 5.13.16-200.fc34.x86_64 Stack makes sense that its waiting on a io_uring event. I need to look more tomorrow.

            Thread 14 in gdb.txt is stuck in io_uring_wait_cqe. This looks related to MDEV-26674 (Debian with Linux 5.14 kernel). I never saw such trouble with the 5.10 kernel that I was using previously.

            marko Marko Mäkelä added a comment - Thread 14 in gdb.txt is stuck in io_uring_wait_cqe. This looks related to MDEV-26674 (Debian with Linux 5.14 kernel). I never saw such trouble with the 5.10 kernel that I was using previously.
            danblack Daniel Black added a comment - - edited

            Also reproducible on same build in 5.10 kernel (per MDEV-26674).

            5.15rc appears to have fixes.

            5.15.0-0.rc2.20210923git58e2cf5d7946.21.fc36.x86_64

            $ mysql-test/mtr --repeat=30 --parallel=1 main.innodb_ext_key
             
            main.innodb_ext_key 'innodb,off,unoptimized' [ 19 pass ]    378
            main.innodb_ext_key 'innodb,off,unoptimized' [ 20 pass ]    386
            main.innodb_ext_key 'innodb,off,unoptimized' [ 21 pass ]    383
            main.innodb_ext_key 'innodb,off,unoptimized' [ 22 pass ]    396
             
             
            main.innodb_ext_key 'innodb,off,unoptimized' [ 23 pass ]  60210
            ....
             
            The servers were restarted 3 times
            Spent 106.395 of 146 seconds executing testcases
             
            Completed: All 120 tests were successful.
            

            That one very anomalous timing wasn't reproducible.

            bigger and longer - repeated twice

            $ mysql-test/mtr --repeat=30 --parallel=8 main.innodb_ext_key{,,,,,,,}
             
            ...
            --------------------------------------------------------------------------
            The servers were restarted 12 times
            Spent 1063.687 of 200 seconds executing testcases
             
            Completed: All 960 tests were successful.
            

            danblack Daniel Black added a comment - - edited Also reproducible on same build in 5.10 kernel (per MDEV-26674 ). 5.15rc appears to have fixes. 5.15.0-0.rc2.20210923git58e2cf5d7946.21.fc36.x86_64 $ mysql-test/mtr --repeat=30 --parallel=1 main.innodb_ext_key   main.innodb_ext_key 'innodb,off,unoptimized' [ 19 pass ] 378 main.innodb_ext_key 'innodb,off,unoptimized' [ 20 pass ] 386 main.innodb_ext_key 'innodb,off,unoptimized' [ 21 pass ] 383 main.innodb_ext_key 'innodb,off,unoptimized' [ 22 pass ] 396     main.innodb_ext_key 'innodb,off,unoptimized' [ 23 pass ] 60210 ....   The servers were restarted 3 times Spent 106.395 of 146 seconds executing testcases   Completed: All 120 tests were successful. That one very anomalous timing wasn't reproducible. bigger and longer - repeated twice $ mysql-test/mtr --repeat=30 --parallel=8 main.innodb_ext_key{,,,,,,,}   ... -------------------------------------------------------------------------- The servers were restarted 12 times Spent 1063.687 of 200 seconds executing testcases   Completed: All 960 tests were successful.

            Even after rerun. I could not reproduce this with my Ubuntu Hirsute Hippo, 5.11.0-37-generic and 10.6 (I verified that liburing 0.7-3ubuntu1 was in ldd).

            jplindst Jan Lindström (Inactive) added a comment - - edited Even after rerun. I could not reproduce this with my Ubuntu Hirsute Hippo, 5.11.0-37-generic and 10.6 (I verified that liburing 0.7-3ubuntu1 was in ldd).
            danblack Daniel Black added a comment -

            Duplicate of MDEV-26674

            danblack Daniel Black added a comment - Duplicate of MDEV-26674

            People

              danblack Daniel Black
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.