[MDEV-30940] innodb.lock_move_wait_lock_race hangs sporadically - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.6
Fix Version/s: 10.6
Component/s: Tests
Labels:
None

Description

There is a hang in this test:
CURRENT_TEST: innodb.lock_move_wait_lock_race

mysqltest: At line 57: query 'DROP TABLE t' failed: <Unknown> (2013): Lost connection to server during query

probably caused by this:

#8  0x0000562390d6d325 in debug_sync (thd=0x7f030c000d58, sync_point_name=<optimized out>, name_len=23) at /mariadb/10.6/sql/debug_sync.cc:1760

#9  0x000056239114dfaa in lock_sys_t::cancel<false> (trx=trx@entry=0x7f0390880180, lock=0x7f0390880280) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:5792

#10 0x0000562391149a25 in lock_wait (thr=thr@entry=0x7f030c05eb18) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1891

#11 0x00005623911e1547 in row_mysql_handle_errors (new_err=new_err@entry=0x7f0388620030, trx=trx@entry=0x7f0390880180, thr=thr@entry=0x7f030c05eb18, savept=savept@entry=0x0) at /mariadb/10.6/storage/innobase/row/row0mysql.cc:681

#12 0x000056239120d9f4 in row_search_mvcc (buf=buf@entry=0x7f030c064e98 "\377", mode=PAGE_CUR_GE, prebuilt=0x7f030c05e480, match_mode=<optimized out>, direction=direction@entry=0) at /mariadb/10.6/storage/innobase/row/row0sel.cc:5856

#13 0x0000562390fec356 in ha_innobase::index_read (this=0x7f030c052300, buf=0x7f030c064e98 "\377", key_ptr=0x7f030c060310 "\n", key_len=4, find_flag=HA_READ_KEY_EXACT) at /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:9060

in UPDATE t SET c = "abcdefghij" WHERE pk = 10

And this:

#8  0x0000562390d6d325 in debug_sync (thd=0x7f0314000d58, sync_point_name=<optimized out>, name_len=34) at /mariadb/10.6/sql/debug_sync.cc:1760

#9  0x0000562391137aba in operator() (__closure=__closure@entry=0x7f038b484066) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:3206

#10 0x000056239114137c in detail::scope_exit<lock_rec_store_on_page_infimum(const buf_block_t*, const rec_t*)::<lambda()> >::~scope_exit (this=0x7f038b484066, __in_chrg=<optimized out>) at /mariadb/10.6/include/scope.h:50

#11 lock_rec_store_on_page_infimum (block=block@entry=0x7f038b815950, rec=<optimized out>) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:3212

#12 0x00005623910969db in btr_cur_pessimistic_update (flags=flags@entry=10, cursor=cursor@entry=0x7f030804d708, offsets=offsets@entry=0x7f038b4841a8, offsets_heap=offsets_heap@entry=0x7f038b484248, entry_heap=entry_heap@entry=0x7f03141eaee0, big_rec=big_rec@entry=0x7f038b4841b8, update=0x7f030804d7e0, cmpl_info=1, thr=0x7f03141ee6f0, trx_id=2517, mtr=0x7f038b4844b0) at /mariadb/10.6/storage/innobase/btr/btr0cur.cc:4094

in UPDATE t SET c = NULL WHERE pk = 10.

The test was killed not to wait for debug sync point timeout.

Attachments

Issue Links

relates to

MDEV-27701 Race on trx->lock.wait_lock between lock_rec_move() and lock_sys_t::cancel()

Closed

Activity

Ascending order - Click to sort in descending order

Vladislav Lesin added a comment - 2023-03-28 05:52

Failed to reproduce it with:

./mtr --max-test-fail=1 --debug-sync-timeout=10 --parallel=256 innodb.lock_move_wait_lock_race{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} --repeat=1000

But according to the backtraces, it looks like trx2_moved_locks signal was lost. lock_rec_store_on_page_infimum() signals with trx2_moved_locks about it was entered, and waits for trx2_cont signal to continue. But trx2_cont is sent only after another connection receives trx2_moved_locks.

So, it looks like it hangs on SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks";. But why mtr reports connection was lost on table deletion?

The second backtrace shows that lock_rec_store_on_page_infimum_end sync point was reached, and it should emit trx2_moved_locks. But it was not caught.
There must not be signals overwriting, because a sequence of SET DEBUG_SYNC=s in default connection serializes sync points reaching:

--connection default

SET DEBUG_SYNC="now WAIT_FOR trx3_start_waiting";

COMMIT;

SET DEBUG_SYNC="now WAIT_FOR trx2_wait_end";

SET DEBUG_SYNC="now SIGNAL trx3_cont_waiting";

SET DEBUG_SYNC="now WAIT_FOR trx3_cancel_enter";

SET DEBUG_SYNC="now SIGNAL trx2_cont_upd";

SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks";

But the only reason I see why signal was emitted, but was not caught, is that it was overwritten by some other signal. It seems there were some changes in debug sync points, they could affect the test somehow, they need to be reviewed.

There was some suspicion about ~~MDEV-29593~~ influence. But it did not find the way how ~~MDEV-29593~~ could influence. lock_rec_store_on_page_infimum() is invoked before page_cur_delete_rec(page_cursor, *offsets, mtr); in btr_cur_pessimistic_update() . And it's hanging before the record deleting. Besides, btr_cur_pessimistic_update() must be invoked under MTR_MEMO_PAGE_X_FIX.

Vladislav Lesin added a comment - 2023-03-28 05:52 Failed to reproduce it with: ./mtr --max-test-fail=1 --debug-sync-timeout=10 --parallel=256 innodb.lock_move_wait_lock_race{,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,} --repeat=1000 But according to the backtraces, it looks like trx2_moved_locks signal was lost. lock_rec_store_on_page_infimum() signals with trx2_moved_locks about it was entered, and waits for trx2_cont signal to continue. But trx2_cont is sent only after another connection receives trx2_moved_locks. So, it looks like it hangs on SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks";. But why mtr reports connection was lost on table deletion? The second backtrace shows that lock_rec_store_on_page_infimum_end sync point was reached, and it should emit trx2_moved_locks. But it was not caught. There must not be signals overwriting, because a sequence of SET DEBUG_SYNC=s in default connection serializes sync points reaching: --connection default SET DEBUG_SYNC="now WAIT_FOR trx3_start_waiting"; COMMIT; SET DEBUG_SYNC="now WAIT_FOR trx2_wait_end"; SET DEBUG_SYNC="now SIGNAL trx3_cont_waiting"; SET DEBUG_SYNC="now WAIT_FOR trx3_cancel_enter"; SET DEBUG_SYNC="now SIGNAL trx2_cont_upd"; SET DEBUG_SYNC="now WAIT_FOR trx2_moved_locks"; But the only reason I see why signal was emitted, but was not caught, is that it was overwritten by some other signal. It seems there were some changes in debug sync points, they could affect the test somehow, they need to be reviewed. There was some suspicion about MDEV-29593 influence. But it did not find the way how MDEV-29593 could influence. lock_rec_store_on_page_infimum() is invoked before page_cur_delete_rec(page_cursor, *offsets, mtr); in btr_cur_pessimistic_update() . And it's hanging before the record deleting. Besides, btr_cur_pessimistic_update() must be invoked under MTR_MEMO_PAGE_X_FIX.

Marko Mäkelä added a comment - 2024-01-17 10:47

I got this again on 10.6 today, exactly as in the Description. One innobase_kill_query is blocked by lock_sys.wait_mutex that is held by an UPDATE t SET c = "abcdefghij" WHERE pk = 10. That transaction had not written any undo log (trx->undo_no==0) and not holding any other record locks. If I got it right, it was waiting for a lock on the page infimum record on the clustered index root page.

The UPDATE t SET c = NULL WHERE pk = 10 had only written one undo log record.

All history up to these two transactions had been purged. vlad.lesin has a hypothesis that an unexpected transaction due to dict_stats_save() could ruin the DEBUG_SYNC flow. I will try if the following would fix this occasional failure:

diff --git a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test

index 3a04c7127c8..0f88f8d9d0f 100644

--- a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test

+++ b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test

@@ -3,7 +3,8 @@

 --source include/have_debug.inc

 --source include/have_debug_sync.inc

-CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) ENGINE=InnoDB;

+CREATE TABLE t (pk int PRIMARY KEY, c varchar(10))

+STATS_PERSISTENT=0 ENGINE=InnoDB;

 INSERT INTO t VALUES (10, "0123456789");

 --connection default

Marko Mäkelä added a comment - 2024-01-17 10:47 I got this again on 10.6 today, exactly as in the Description. One innobase_kill_query is blocked by lock_sys.wait_mutex that is held by an UPDATE t SET c = "abcdefghij" WHERE pk = 10 . That transaction had not written any undo log ( trx->undo_no==0 ) and not holding any other record locks. If I got it right, it was waiting for a lock on the page infimum record on the clustered index root page. The UPDATE t SET c = NULL WHERE pk = 10 had only written one undo log record. All history up to these two transactions had been purged. vlad.lesin has a hypothesis that an unexpected transaction due to dict_stats_save() could ruin the DEBUG_SYNC flow. I will try if the following would fix this occasional failure: diff --git a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test index 3a04c7127c8..0f88f8d9d0f 100644 --- a/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test +++ b/mysql-test/suite/innodb/t/lock_move_wait_lock_race.test @@ -3,7 +3,8 @@ --source include/have_debug.inc --source include/have_debug_sync.inc -CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) ENGINE=InnoDB; +CREATE TABLE t (pk int PRIMARY KEY, c varchar(10)) +STATS_PERSISTENT=0 ENGINE=InnoDB; INSERT INTO t VALUES (10, "0123456789"); --connection default

Vladislav Lesin added a comment - 2024-02-14 15:25

It continue hanging even after adding STATS_PERSISTENT=0. See https://buildbot.mariadb.net/buildbot/builders/winx64-debug/builds/43053:

innodb.lock_move_wait_lock_race 'innodb' w1 [ pass ]  301033

Vladislav Lesin added a comment - 2024-02-14 15:25 It continue hanging even after adding STATS_PERSISTENT=0. See https://buildbot.mariadb.net/buildbot/builders/winx64-debug/builds/43053: innodb.lock_move_wait_lock_race 'innodb' w1 [ pass ] 301033

MariaDB Server

innodb.lock_move_wait_lock_race hangs sporadically

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration