Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30940

innodb.lock_move_wait_lock_race hangs sporadically

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6
    • 10.6
    • Tests
    • None

    Description

      There is a hang in this test:
      CURRENT_TEST: innodb.lock_move_wait_lock_race

      mysqltest: At line 57: query 'DROP TABLE t' failed: <Unknown> (2013): Lost connection to server during query
      

      probably caused by this:

      #8  0x0000562390d6d325 in debug_sync (thd=0x7f030c000d58, sync_point_name=<optimized out>, name_len=23) at /mariadb/10.6/sql/debug_sync.cc:1760
      #9  0x000056239114dfaa in lock_sys_t::cancel<false> (trx=trx@entry=0x7f0390880180, lock=0x7f0390880280) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:5792
      #10 0x0000562391149a25 in lock_wait (thr=thr@entry=0x7f030c05eb18) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1891
      #11 0x00005623911e1547 in row_mysql_handle_errors (new_err=new_err@entry=0x7f0388620030, trx=trx@entry=0x7f0390880180, thr=thr@entry=0x7f030c05eb18, savept=savept@entry=0x0) at /mariadb/10.6/storage/innobase/row/row0mysql.cc:681
      #12 0x000056239120d9f4 in row_search_mvcc (buf=buf@entry=0x7f030c064e98 "\377", mode=PAGE_CUR_GE, prebuilt=0x7f030c05e480, match_mode=<optimized out>, direction=direction@entry=0) at /mariadb/10.6/storage/innobase/row/row0sel.cc:5856
      #13 0x0000562390fec356 in ha_innobase::index_read (this=0x7f030c052300, buf=0x7f030c064e98 "\377", key_ptr=0x7f030c060310 "\n", key_len=4, find_flag=HA_READ_KEY_EXACT) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:9060
      in UPDATE t SET c = "abcdefghij" WHERE pk = 10
      

      And this:

      #8  0x0000562390d6d325 in debug_sync (thd=0x7f0314000d58, sync_point_name=<optimized out>, name_len=34) at /mariadb/10.6/sql/debug_sync.cc:1760
      #9  0x0000562391137aba in operator() (__closure=__closure@entry=0x7f038b484066) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:3206
      #10 0x000056239114137c in detail::scope_exit<lock_rec_store_on_page_infimum(const buf_block_t*, const rec_t*)::<lambda()> >::~scope_exit (this=0x7f038b484066, __in_chrg=<optimized out>) at /mariadb/10.6/include/scope.h:50
      #11 lock_rec_store_on_page_infimum (block=block@entry=0x7f038b815950, rec=<optimized out>) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:3212
      #12 0x00005623910969db in btr_cur_pessimistic_update (flags=flags@entry=10, cursor=cursor@entry=0x7f030804d708, offsets=offsets@entry=0x7f038b4841a8, offsets_heap=offsets_heap@entry=0x7f038b484248, entry_heap=entry_heap@entry=0x7f03141eaee0, big_rec=big_rec@entry=0x7f038b4841b8, update=0x7f030804d7e0, cmpl_info=1, thr=0x7f03141ee6f0, trx_id=2517, mtr=0x7f038b4844b0) at /mariadb/10.6/storage/innobase/btr/btr0cur.cc:4094
      

      in UPDATE t SET c = NULL WHERE pk = 10.

      The test was killed not to wait for debug sync point timeout.

      Attachments

        Issue Links

          Activity

            Failed to reproduce it with:

            ./mtr --max-test-fail=1 --debug-sync-timeout=10 --parallel=256 innodb.lock_move_wait_lock_race{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} --repeat=1000
            

            But according to the backtraces, it looks like trx2_moved_locks signal was lost. lock_rec_store_on_page_infimum() signals with trx2_moved_locks about it was entered, and waits for trx2_cont signal to continue. But trx2_cont is sent only after another connection receives trx2_moved_locks.

            So, it looks like it hangs on SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks";. But why mtr reports connection was lost on table deletion?

            The second backtrace shows that lock_rec_store_on_page_infimum_end sync point was reached, and it should emit trx2_moved_locks. But it was not caught.
            There must not be signals overwriting, because a sequence of SET DEBUG_SYNC=s in default connection serializes sync points reaching:

            --connection default                                                            
            SET DEBUG_SYNC="now WAIT_FOR trx3_start_waiting";                               
            COMMIT;                                                                         
            SET DEBUG_SYNC="now WAIT_FOR trx2_wait_end";                                    
            SET DEBUG_SYNC="now SIGNAL trx3_cont_waiting";                                  
            SET DEBUG_SYNC="now WAIT_FOR trx3_cancel_enter";                                
            SET DEBUG_SYNC="now SIGNAL trx2_cont_upd";                                      
            SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks";
            

            But the only reason I see why signal was emitted, but was not caught, is that it was overwritten by some other signal. It seems there were some changes in debug sync points, they could affect the test somehow, they need to be reviewed.

            There was some suspicion about MDEV-29593 influence. But it did not find the way how MDEV-29593 could influence. lock_rec_store_on_page_infimum() is invoked before page_cur_delete_rec(page_cursor, *offsets, mtr); in btr_cur_pessimistic_update() . And it's hanging before the record deleting. Besides, btr_cur_pessimistic_update() must be invoked under MTR_MEMO_PAGE_X_FIX.

            vlad.lesin Vladislav Lesin added a comment - Failed to reproduce it with: ./mtr --max-test-fail=1 --debug-sync-timeout=10 --parallel=256 innodb.lock_move_wait_lock_race{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} --repeat=1000 But according to the backtraces, it looks like trx2_moved_locks signal was lost. lock_rec_store_on_page_infimum() signals with trx2_moved_locks about it was entered, and waits for trx2_cont signal to continue. But trx2_cont is sent only after another connection receives trx2_moved_locks. So, it looks like it hangs on SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks";. But why mtr reports connection was lost on table deletion? The second backtrace shows that lock_rec_store_on_page_infimum_end sync point was reached, and it should emit trx2_moved_locks. But it was not caught. There must not be signals overwriting, because a sequence of SET DEBUG_SYNC=s in default connection serializes sync points reaching: --connection default SET DEBUG_SYNC="now WAIT_FOR trx3_start_waiting"; COMMIT; SET DEBUG_SYNC="now WAIT_FOR trx2_wait_end"; SET DEBUG_SYNC="now SIGNAL trx3_cont_waiting"; SET DEBUG_SYNC="now WAIT_FOR trx3_cancel_enter"; SET DEBUG_SYNC="now SIGNAL trx2_cont_upd"; SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks"; But the only reason I see why signal was emitted, but was not caught, is that it was overwritten by some other signal. It seems there were some changes in debug sync points, they could affect the test somehow, they need to be reviewed. There was some suspicion about MDEV-29593 influence. But it did not find the way how MDEV-29593 could influence. lock_rec_store_on_page_infimum() is invoked before page_cur_delete_rec(page_cursor, *offsets, mtr); in btr_cur_pessimistic_update() . And it's hanging before the record deleting. Besides, btr_cur_pessimistic_update() must be invoked under MTR_MEMO_PAGE_X_FIX.

            I got this again on 10.6 today, exactly as in the Description. One innobase_kill_query is blocked by lock_sys.wait_mutex that is held by an UPDATE t SET c = "abcdefghij" WHERE pk = 10. That transaction had not written any undo log (trx->undo_no==0) and not holding any other record locks. If I got it right, it was waiting for a lock on the page infimum record on the clustered index root page.

            The UPDATE t SET c = NULL WHERE pk = 10 had only written one undo log record.

            All history up to these two transactions had been purged. vlad.lesin has a hypothesis that an unexpected transaction due to dict_stats_save() could ruin the DEBUG_SYNC flow. I will try if the following would fix this occasional failure:

            diff --git a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test
            index 3a04c7127c8..0f88f8d9d0f 100644
            --- a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test
            +++ b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test
            @@ -3,7 +3,8 @@
             --source include/have_debug.inc
             --source include/have_debug_sync.inc
             
            -CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) ENGINE=InnoDB;
            +CREATE TABLE t (pk int PRIMARY KEY, c varchar(10))
            +STATS_PERSISTENT=0 ENGINE=InnoDB;
             INSERT INTO t VALUES (10, "0123456789");
             
             --connection default
            

            marko Marko Mäkelä added a comment - I got this again on 10.6 today, exactly as in the Description. One innobase_kill_query is blocked by lock_sys.wait_mutex that is held by an UPDATE t SET c = "abcdefghij" WHERE pk = 10 . That transaction had not written any undo log ( trx->undo_no==0 ) and not holding any other record locks. If I got it right, it was waiting for a lock on the page infimum record on the clustered index root page. The UPDATE t SET c = NULL WHERE pk = 10 had only written one undo log record. All history up to these two transactions had been purged. vlad.lesin has a hypothesis that an unexpected transaction due to dict_stats_save() could ruin the DEBUG_SYNC flow. I will try if the following would fix this occasional failure: diff --git a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test index 3a04c7127c8..0f88f8d9d0f 100644 --- a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test +++ b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test @@ -3,7 +3,8 @@ --source include/have_debug.inc --source include/have_debug_sync.inc -CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) ENGINE=InnoDB; +CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) +STATS_PERSISTENT=0 ENGINE=InnoDB; INSERT INTO t VALUES (10, "0123456789"); --connection default

            It continue hanging even after adding STATS_PERSISTENT=0. See https://buildbot.mariadb.net/buildbot/builders/winx64-debug/builds/43053:

            innodb.lock_move_wait_lock_race 'innodb' w1 [ pass ]  301033
            

            vlad.lesin Vladislav Lesin added a comment - It continue hanging even after adding STATS_PERSISTENT=0. See https://buildbot.mariadb.net/buildbot/builders/winx64-debug/builds/43053: innodb.lock_move_wait_lock_race 'innodb' w1 [ pass ] 301033

            People

              vlad.lesin Vladislav Lesin
              vlad.lesin Vladislav Lesin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.