[MDEV-32096] Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.0, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3
Component/s: Locking, Replication, Storage Engine - InnoDB
Labels:
- performance

Description

~~MDEV-24671~~ introduced a race condition in the function innobase_kill_query(), which is responsible for interrupting a lock wait for the target of a KILL QUERY or KILL CONNECTION statement.

This can severely affect optimistic (and aggressive) parallel replication. If the race is triggered, conflicts are not resolved correctly and parallel replication will be blocked until --innodb-lock-wait-timeout. This will be seen in SHOW PROCESSLIST as one worker being in the "killed" state and some other worker stuck in a query.

A user reported a hang of parallel replication due to this, and knielsen spotted the data race: If the target transaction starts a lock wait roughly at the same time as innobase_kill_query() is invoked, then trx->lock.wait_lock could be read as nullptr and the lock wait would not be interrupted. Therefore, we need to acquire lock_sys.wait_mutex before checking if a lock wait needs to be aborted.

Attached mdev32096_testcase.patch is an (ugly) ./mtr testcase that triggers the problem.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev32096_testcase.patch
4 kB
2023-09-05 11:19

Issue Links

causes

MDEV-32530 Race condition in lock_wait_rpl_report()

Closed

is caused by

MDEV-24671 Assertion failure in lock_wait_table_reserve_slot()

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2023-09-05 12:20 - edited

knielsen, thank you for mdev32096_testcase.patch. I created a simpler one. It requires a new DEBUG_SYNC point, because right after row_search_rec_loop we would invoke trx_is_interrupted():

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

index ba2b60b4c5b..b11bd7d4cd1 100644

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -1555,6 +1555,10 @@ lock_rec_lock(

   ut_ad(~mode & (LOCK_GAP | LOCK_REC_NOT_GAP));

   ut_ad(dict_index_is_clust(index) || !dict_index_is_online_ddl(index));

   DBUG_EXECUTE_IF("innodb_report_deadlock", return DB_DEADLOCK;);

+#ifdef ENABLED_DEBUG_SYNC

+  if (trx->mysql_thd)

+    DEBUG_SYNC_C("lock_rec");

+#endif

   ut_ad((LOCK_MODE_MASK & mode) != LOCK_S ||

         lock_table_has(trx, index->table, LOCK_IS));

Without the fix, the test case hangs where it is expected:

#5  0x000056195aee2bcb in safe_cond_wait (cond=0x7f5281c2e710, mp=0x56195ba3da40 <lock_sys+192>, file=0x561959f5d480 "/mariadb/10.6/storage/innobase/lock/lock0lock.cc", line=1919) at /mariadb/10.6/mysys/thr_mutex.c:492

#6  0x000056195ad5fef4 in lock_wait (thr=thr@entry=0x7f522400d1d8) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1919

Here is the test case:

--source include/have_innodb.inc

--source include/have_debug_sync.inc

CREATE TABLE t(a INT PRIMARY KEY) ENGINE=InnoDB;

INSERT INTO t VALUES(1);

BEGIN;

SELECT * FROM t FOR UPDATE;

connect (con1, localhost, root,,);

BEGIN;

# infinite timeout

SET innodb_lock_wait_timeout=100000000;

let $ID= `SELECT @id := CONNECTION_ID()`;

SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed';

send SELECT * FROM t FOR UPDATE;

connection default;

let $ignore= `SELECT @id := $ID`;

SET DEBUG_SYNC='now WAIT_FOR locking';

KILL QUERY @id;

SET DEBUG_SYNC='now SIGNAL proceed';

connection con1;

--error ER_QUERY_INTERRUPTED

reap;

disconnect con1;

connection default;

SET DEBUG_SYNC=RESET;

COMMIT;

DROP TABLE t;

Marko Mäkelä added a comment - 2023-09-05 12:20 - edited knielsen , thank you for mdev32096_testcase.patch . I created a simpler one. It requires a new DEBUG_SYNC point, because right after row_search_rec_loop we would invoke trx_is_interrupted() : diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index ba2b60b4c5b..b11bd7d4cd1 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -1555,6 +1555,10 @@ lock_rec_lock( ut_ad(~mode & (LOCK_GAP | LOCK_REC_NOT_GAP)); ut_ad(dict_index_is_clust(index) || !dict_index_is_online_ddl(index)); DBUG_EXECUTE_IF("innodb_report_deadlock", return DB_DEADLOCK;); +#ifdef ENABLED_DEBUG_SYNC + if (trx->mysql_thd) + DEBUG_SYNC_C("lock_rec"); +#endif ut_ad((LOCK_MODE_MASK & mode) != LOCK_S || lock_table_has(trx, index->table, LOCK_IS)); Without the fix, the test case hangs where it is expected: #5 0x000056195aee2bcb in safe_cond_wait (cond=0x7f5281c2e710, mp=0x56195ba3da40 <lock_sys+192>, file=0x561959f5d480 "/mariadb/10.6/storage/innobase/lock/lock0lock.cc", line=1919) at /mariadb/10.6/mysys/thr_mutex.c:492 #6 0x000056195ad5fef4 in lock_wait (thr=thr@entry=0x7f522400d1d8) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1919 Here is the test case: --source include/have_innodb.inc --source include/have_debug_sync.inc CREATE TABLE t(a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t VALUES (1); BEGIN ; SELECT * FROM t FOR UPDATE ; connect (con1, localhost, root,,); BEGIN ; # infinite timeout SET innodb_lock_wait_timeout=100000000; let $ID= ` SELECT @id := CONNECTION_ID()`; SET DEBUG_SYNC= 'lock_rec SIGNAL locking WAIT_FOR proceed' ; send SELECT * FROM t FOR UPDATE ; connection default ; let $ ignore = ` SELECT @id := $ID`; SET DEBUG_SYNC= 'now WAIT_FOR locking' ; KILL QUERY @id; SET DEBUG_SYNC= 'now SIGNAL proceed' ; connection con1; --error ER_QUERY_INTERRUPTED reap; disconnect con1; connection default ; SET DEBUG_SYNC=RESET; COMMIT ; DROP TABLE t;

Vladislav Lesin added a comment - 2023-09-06 08:07

The fix looks good to me.

I have some proposal of how to avoid unnecessary cursor restoring under some circumstances.

In the case if innobase_kill_query() was finished after lock_rec_enqueue_waiting() call and before the first trx->lock.wait_lock check in lock_wait() in parallel thread(suppose we change lock_rec debug sync point to lock_wait_start in your test), lock_wait() returns DB_SUCCESS as trx->lock.was_chosen_as_deadlock_victim was not set(see the following code).

  lock_t *wait_lock= trx->lock.wait_lock;

  if (!wait_lock)

    /* The lock has already been released or this transaction

    was chosen as a deadlock victim: no need to wait */

    trx->error_state=

      trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS;

    return trx->error_state;

And then row_search_mvcc() jumps to rec_loop label, as row_mysql_handle_errors() returns true, where trx_is_interrupted(trx) is checked.

dberr_t

row_search_mvcc(...)

...

rec_loop:

        DEBUG_SYNC_C("row_search_rec_loop");

        if (trx_is_interrupted(trx)) {

                if (!spatial_search) {

                        btr_pcur_store_position(pcur, &mtr);

                err = DB_INTERRUPTED;

                goto normal_return;

...

        if (row_mysql_handle_errors(&err, trx, thr, NULL)) {

...

                sel_restore_position_for_mysql(...);

...

                goto rec_loop;

        thr->lock_state = QUE_THR_LOCK_NOLOCK;

        goto func_exit;

...

Maybe we could make the path shorter and check if the transaction was interrupted in the above block in lock_wait(), so that lock_wait() would return DB_INTERRUPTED instead of DB_SUCCESS? Such a way we could avoid unnecessary cursor restoring.

Vladislav Lesin added a comment - 2023-09-06 08:07 The fix looks good to me. I have some proposal of how to avoid unnecessary cursor restoring under some circumstances. In the case if innobase_kill_query() was finished after lock_rec_enqueue_waiting() call and before the first trx->lock.wait_lock check in lock_wait() in parallel thread(suppose we change lock_rec debug sync point to lock_wait_start in your test), lock_wait() returns DB_SUCCESS as trx->lock.was_chosen_as_deadlock_victim was not set(see the following code). lock_t *wait_lock= trx->lock.wait_lock; if (!wait_lock) { /* The lock has already been released or this transaction was chosen as a deadlock victim: no need to wait */ trx->error_state= trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS; return trx->error_state; } And then row_search_mvcc() jumps to rec_loop label, as row_mysql_handle_errors() returns true, where trx_is_interrupted(trx) is checked. dberr_t row_search_mvcc(...) { ... rec_loop: DEBUG_SYNC_C( "row_search_rec_loop" ); if (trx_is_interrupted(trx)) { if (!spatial_search) { btr_pcur_store_position(pcur, &mtr); } err = DB_INTERRUPTED; goto normal_return; } ... if (row_mysql_handle_errors(&err, trx, thr, NULL)) { ... sel_restore_position_for_mysql(...); ... goto rec_loop; } thr->lock_state = QUE_THR_LOCK_NOLOCK; goto func_exit; ... } Maybe we could make the path shorter and check if the transaction was interrupted in the above block in lock_wait(), so that lock_wait() would return DB_INTERRUPTED instead of DB_SUCCESS? Such a way we could avoid unnecessary cursor restoring.

Marko Mäkelä added a comment - 2023-09-06 09:48

vlad.lesin, thank you, your suggestion makes sense. The following would implement it:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

index 268c094ffde..098113ea971 100644

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -1853,8 +1853,10 @@ dberr_t lock_wait(que_thr_t *thr)

     /* The lock has already been released or this transaction

     was chosen as a deadlock victim: no need to wait */

-    trx->error_state=

-      trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS;

+    if (trx->lock.was_chosen_as_deadlock_victim)

+      trx->error_state= DB_DEADLOCK;

+    else if (trx->error_state == DB_LOCK_WAIT)

+      trx->error_state= DB_SUCCESS;

     return trx->error_state;

I tested it with the following:

diff --git a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test

index 2bbe2b3b139..5dbeb9b6a12 100644

--- a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test

+++ b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test

@@ -11,7 +11,7 @@ BEGIN;

 # infinite timeout

 SET innodb_lock_wait_timeout=100000000;

 let $ID= `SELECT @id := CONNECTION_ID()`;

-SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed';

+SET DEBUG_SYNC='lock_wait_start SIGNAL locking WAIT_FOR proceed';

 send SELECT * FROM t FOR UPDATE;

 connection default;

Between the execution of lock_wait() getting past the lock_wait_start point and the server shutdown, there was no call to trx_is_interrupted(). The trx->error_state (of DB_INTERRUPTED, set by innobase_kill_query()) is being returned “as is”.

Marko Mäkelä added a comment - 2023-09-06 09:48 vlad.lesin , thank you, your suggestion makes sense. The following would implement it: diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index 268c094ffde..098113ea971 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -1853,8 +1853,10 @@ dberr_t lock_wait(que_thr_t *thr) { /* The lock has already been released or this transaction was chosen as a deadlock victim: no need to wait */ - trx->error_state= - trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS; + if (trx->lock.was_chosen_as_deadlock_victim) + trx->error_state= DB_DEADLOCK; + else if (trx->error_state == DB_LOCK_WAIT) + trx->error_state= DB_SUCCESS; return trx->error_state; } I tested it with the following: diff --git a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test index 2bbe2b3b139..5dbeb9b6a12 100644 --- a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test +++ b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test @@ -11,7 +11,7 @@ BEGIN; # infinite timeout SET innodb_lock_wait_timeout=100000000; let $ID= `SELECT @id := CONNECTION_ID()`; -SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed'; +SET DEBUG_SYNC='lock_wait_start SIGNAL locking WAIT_FOR proceed'; send SELECT * FROM t FOR UPDATE; connection default; Between the execution of lock_wait() getting past the lock_wait_start point and the server shutdown, there was no call to trx_is_interrupted() . The trx->error_state (of DB_INTERRUPTED , set by innobase_kill_query() ) is being returned “as is”.

MariaDB Server

Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration