Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.6.0, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
Description
MDEV-24671 introduced a race condition in the function innobase_kill_query(), which is responsible for interrupting a lock wait for the target of a KILL QUERY or KILL CONNECTION statement.
This can severely affect optimistic (and aggressive) parallel replication. If the race is triggered, conflicts are not resolved correctly and parallel replication will be blocked until --innodb-lock-wait-timeout. This will be seen in SHOW PROCESSLIST as one worker being in the "killed" state and some other worker stuck in a query.
A user reported a hang of parallel replication due to this, and knielsen spotted the data race: If the target transaction starts a lock wait roughly at the same time as innobase_kill_query() is invoked, then trx->lock.wait_lock could be read as nullptr and the lock wait would not be interrupted. Therefore, we need to acquire lock_sys.wait_mutex before checking if a lock wait needs to be aborted.
Attached mdev32096_testcase.patch is an (ugly) ./mtr testcase that triggers the problem.
Attachments
Issue Links
- causes
-
MDEV-32530 Race condition in lock_wait_rpl_report()
-
- Closed
-
- is caused by
-
MDEV-24671 Assertion failure in lock_wait_table_reserve_slot()
-
- Closed
-
knielsen, thank you for mdev32096_testcase.patch
. I created a simpler one. It requires a new DEBUG_SYNC point, because right after row_search_rec_loop we would invoke trx_is_interrupted():
diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index ba2b60b4c5b..b11bd7d4cd1 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -1555,6 +1555,10 @@ lock_rec_lock(
ut_ad(~mode & (LOCK_GAP | LOCK_REC_NOT_GAP));
ut_ad(dict_index_is_clust(index) || !dict_index_is_online_ddl(index));
DBUG_EXECUTE_IF("innodb_report_deadlock", return DB_DEADLOCK;);
+#ifdef ENABLED_DEBUG_SYNC
+ if (trx->mysql_thd)
+ DEBUG_SYNC_C("lock_rec");
+#endif
ut_ad((LOCK_MODE_MASK & mode) != LOCK_S ||
Without the fix, the test case hangs where it is expected:
#5 0x000056195aee2bcb in safe_cond_wait (cond=0x7f5281c2e710, mp=0x56195ba3da40 <lock_sys+192>, file=0x561959f5d480 "/mariadb/10.6/storage/innobase/lock/lock0lock.cc", line=1919) at /mariadb/10.6/mysys/thr_mutex.c:492
#6 0x000056195ad5fef4 in lock_wait (thr=thr@entry=0x7f522400d1d8) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1919
Here is the test case:
--source include/have_innodb.inc
--source include/have_debug_sync.inc
# infinite timeout
KILL QUERY @id;
--error ER_QUERY_INTERRUPTED
reap;
disconnect con1;