Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32681

Test case innodb.undo_truncate_recover frequent failure

    XMLWordPrintable

Details

    Description

      innodb.undo_truncate_recover '16k,2,innodb,strict_full_crc32' w15 [ fail ]
              Test ended at 2023-11-03 15:59:19
       
      CURRENT_TEST: innodb.undo_truncate_recover
      --- /home/buildbot/amd64-debian-11-msan/build/mysql-test/suite/innodb/r/undo_truncate_recover.result	2023-11-03 15:04:02.000000000 +0000
      +++ /home/buildbot/amd64-debian-11-msan/build/mysql-test/suite/innodb/r/undo_truncate_recover.reject	2023-11-03 15:59:19.194596785 +0000
      @@ -11,5 +11,5 @@
       drop table t1;
       call mtr.add_suppression("InnoDB: innodb_undo_tablespaces=0 disables dedicated undo log tablespaces");
       SET GLOBAL innodb_fast_shutdown=0;
      -FOUND 1 /ib_undo_trunc/ in mysqld.1.err
      +NOT FOUND /ib_undo_trunc/ in mysqld.1.err
       # restart: with restart_parameters
      

      When it happens, the following appear in the mysqld.1.err:

      2023-11-03 15:58:19 0 [Note] InnoDB: Truncating .//undo001
      2023-11-03 15:58:19 0 [Warning] InnoDB: Difficult to find free blocks in the buffer pool (21 search iterations)! 21 failed attempts to flush a page! Consider increasing innodb_buffer_pool_size. Pending flushes (fsync) log: 0; buffer pool: 0. 1491 OS file reads, 1420 OS file writes, 231 OS fsyncs.
      231103 15:59:17 [ERROR] mysqld got signal 6 ;
      

      I think the signal 6 might be because of a timeout waiting for the server to shut down (the server is manually restarted in the test case at this point).

      The server is at this point stuck during shutdown in trx_purge_truncate_history(), trying to create the rsegs for the truncated tablespace, but not able to get a buffer pool page, as per the printout in the error log:

      #12 0x00007f662f46d90c in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #13 0x0000559380bb0bab in tpool::waitable_task::wait (this=0x55938203ece0 <purge_coordinator_task>, lk=...) at /kvm/src/my/sn-mdev30255/tpool/task.cc:79
      #14 0x0000559380bb0bf9 in tpool::waitable_task::wait (this=0x55938203ece0 <purge_coordinator_task>) at /kvm/src/my/sn-mdev30255/tpool/task.cc:85
      #15 0x00005593809b6216 in srv_purge_shutdown () at /kvm/src/my/sn-mdev30255/storage/innobase/srv/srv0srv.cc:2048
      #16 0x00005593809be4a3 in innodb_preshutdown () at /kvm/src/my/sn-mdev30255/storage/innobase/srv/srv0start.cc:2003
      #17 0x000055938043c47c in plugin_pre_shutdown (plugin=0x559382a31d78) at /kvm/src/my/sn-mdev30255/sql/handler.cc:1000
      #18 0x00005593800ff526 in plugin_foreach_with_mask (thd=0x0, func=0x55938043c43b <plugin_pre_shutdown(THD*, plugin_ref, void*)>, type=1, state_mask=10, arg=0x0) at /kvm/src/my/sn-mdev30255/sql/sql_plugin.cc:2553
      #19 0x000055938043c4a8 in ha_pre_shutdown () at /kvm/src/my/sn-mdev30255/sql/handler.cc:1007
      #20 0x000055937ffa6c10 in mysqld_main (argc=170, argv=0x559382731150) at /kvm/src/my/sn-mdev30255/sql/mysqld.cc:5734
      #21 0x000055937ff9d465 in main (argc=29, argv=0x7fff57493358) at /kvm/src/my/sn-mdev30255/sql/main.cc:25
       
      Thread 2 (Thread 1407920.1407930 (mariadbd)):
      #0  buf_LRU_get_free_block (have_mutex=false) at /kvm/src/my/sn-mdev30255/storage/innobase/buf/buf0lru.cc:486
      #1  0x0000559380b2e530 in fsp_page_create (space=0x5593828ab850, offset=49, mtr=0x7f66248f55f0) at /kvm/src/my/sn-mdev30255/storage/innobase/fsp/fsp0fsp.cc:1080
      #2  0x0000559380b2eb05 in fsp_alloc_free_page (space=0x5593828ab850, hint=0, mtr=0x7f66248f55f0, init_mtr=0x7f66248f55f0) at /kvm/src/my/sn-mdev30255/storage/innobase/fsp/fsp0fsp.cc:1197
      #3  0x0000559380b323db in fseg_alloc_free_page_low (space=0x5593828ab850, seg_inode=0x7f662e0762b2 "", iblock=0x7f662dbd3020, hint=0, direction=111 'o', has_done_reservation=false, mtr=0x7f66248f55f0, init_mtr=0x7f66248f55f0) at /kvm/src/my/sn-mdev30255/storage/innobase/fsp/fsp0fsp.cc:2136
      #4  0x0000559380b30a27 in fseg_create (space=0x5593828ab850, byte_offset=62, mtr=0x7f66248f55f0, has_done_reservation=false, block=0x0) at /kvm/src/my/sn-mdev30255/storage/innobase/fsp/fsp0fsp.cc:1759
      #5  0x00005593809fda94 in trx_rseg_header_create (space=0x5593828ab850, rseg_id=93, max_trx_id=37, sys_header=0x7f662dbd2c00, mtr=0x7f66248f55f0) at /kvm/src/my/sn-mdev30255/storage/innobase/trx/trx0rseg.cc:321
      #6  0x00005593809debaa in trx_purge_truncate_history () at /kvm/src/my/sn-mdev30255/storage/innobase/trx/trx0purge.cc:737
      #7  0x00005593809b57f1 in srv_do_purge (n_total_purged=0x7f66248f5b18) at /kvm/src/my/sn-mdev30255/storage/innobase/srv/srv0srv.cc:1847
      #8  0x00005593809b5d92 in purge_coordinator_callback_low () at /kvm/src/my/sn-mdev30255/storage/innobase/srv/srv0srv.cc:1945
      #9  0x00005593809b5e8f in purge_coordinator_callback () at /kvm/src/my/sn-mdev30255/storage/innobase/srv/srv0srv.cc:1974
      #10 0x0000559380bb0678 in tpool::task_group::execute (this=0x55938203ec40 <purge_coordinator_task_group>, t=0x55938203ece0 <purge_coordinator_task>) at /kvm/src/my/sn-mdev30255/tpool/task_group.cc:55
      #11 0x0000559380bb09ce in tpool::task::execute (this=0x55938203ece0 <purge_coordinator_task>) at /kvm/src/my/sn-mdev30255/tpool/task.cc:47
      #12 0x0000559380baa527 in tpool::thread_pool_generic::worker_main (this=0x55938279d090, thread_var=0x55938279d5c0) at /kvm/src/my/sn-mdev30255/tpool/tpool_generic.cc:599
      #13 0x0000559380bb0418 in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__f=@0x7f6614001798: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x559380baa492 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __t=@0x7f6614001790: 0x55938279d090) at /usr/include/c++/10/bits/invoke.h:73
      #14 0x0000559380bb0308 in std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=@0x7f6614001798: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x559380baa492 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>) at /usr/include/c++/10/bits/invoke.h:95
      #15 0x0000559380bb023b in std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=0x7f6614001788) at /usr/include/c++/10/thread:264
      #16 0x0000559380bb01d8 in std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (this=0x7f6614001788) at /usr/include/c++/10/thread:271
      #17 0x0000559380bb01bc in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=0x7f6614001780) at /usr/include/c++/10/thread:215
      #18 0x00007f662f472ed0 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
      #19 0x00007f662f22bea7 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #20 0x00007f662f149a2f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      I was able to reproduce with --rr, have the run saved currently.

      I see that the code is locking all dirty buffer pool pages in the undo tablespace, then moving on to re-creating all rseg header pages. I wonder if it managed to lock all pages in the buffer pool, thus locking itself out from creating new pages? Maybe there should be a check if the code is trying to lock more than say 50% of buffer pool pages, and if so flush some pages explicitly or abort and try again later. Or better if the pages can be dropped from the buffer pool without flushing, but not sure if that can be done in a crash/recover -safe way.

      In 10.5, the test failure is easy to trigger with 16k pages, I can see it hangs when it has locked + created around 500 pages (with 8M buffer pool). Cannot be reproduced with 16M buffer pool.

      In 10.6, the failure no longer occurs. Not sure if this is accidental or if the code is fixed. In 10.6 the code still locks all modified pages in the to-be-truncated undo tablespace, but the number of pages is smaller so the code doesn't reach 500 pages locked+created which triggered the hang in 10.5.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.