[MDEV-32096] Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait Created: 2023-09-05  Updated: 2023-11-20  Resolved: 2023-09-11

Status: Closed
Project: MariaDB Server
Component/s: Locking, Replication, Storage Engine - InnoDB
Affects Version/s: 10.6.0, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2
Fix Version/s: 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 1
Labels: performance

Attachments: File mdev32096_testcase.patch    
Issue Links:
Problem/Incident
causes MDEV-32530 Race condition in lock_wait_rpl_report() Closed
is caused by MDEV-24671 Assertion failure in lock_wait_table_... Closed
Relates

 Description   

MDEV-24671 introduced a race condition in the function innobase_kill_query(), which is responsible for interrupting a lock wait for the target of a KILL QUERY or KILL CONNECTION statement.

This can severely affect optimistic (and aggressive) parallel replication. If the race is triggered, conflicts are not resolved correctly and parallel replication will be blocked until --innodb-lock-wait-timeout. This will be seen in SHOW PROCESSLIST as one worker being in the "killed" state and some other worker stuck in a query.

A user reported a hang of parallel replication due to this, and knielsen spotted the data race: If the target transaction starts a lock wait roughly at the same time as innobase_kill_query() is invoked, then trx->lock.wait_lock could be read as nullptr and the lock wait would not be interrupted. Therefore, we need to acquire lock_sys.wait_mutex before checking if a lock wait needs to be aborted.

Attached mdev32096_testcase.patch is an (ugly) ./mtr testcase that triggers the problem.



 Comments   
Comment by Marko Mäkelä [ 2023-09-05 ]

knielsen, thank you for mdev32096_testcase.patch. I created a simpler one. It requires a new DEBUG_SYNC point, because right after row_search_rec_loop we would invoke trx_is_interrupted():

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index ba2b60b4c5b..b11bd7d4cd1 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -1555,6 +1555,10 @@ lock_rec_lock(
   ut_ad(~mode & (LOCK_GAP | LOCK_REC_NOT_GAP));
   ut_ad(dict_index_is_clust(index) || !dict_index_is_online_ddl(index));
   DBUG_EXECUTE_IF("innodb_report_deadlock", return DB_DEADLOCK;);
+#ifdef ENABLED_DEBUG_SYNC
+  if (trx->mysql_thd)
+    DEBUG_SYNC_C("lock_rec");
+#endif
 
   ut_ad((LOCK_MODE_MASK & mode) != LOCK_S ||
         lock_table_has(trx, index->table, LOCK_IS));

Without the fix, the test case hangs where it is expected:

#5  0x000056195aee2bcb in safe_cond_wait (cond=0x7f5281c2e710, mp=0x56195ba3da40 <lock_sys+192>, file=0x561959f5d480 "/mariadb/10.6/storage/innobase/lock/lock0lock.cc", line=1919) at /mariadb/10.6/mysys/thr_mutex.c:492
#6  0x000056195ad5fef4 in lock_wait (thr=thr@entry=0x7f522400d1d8) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1919

Here is the test case:

--source include/have_innodb.inc
--source include/have_debug_sync.inc
 
CREATE TABLE t(a INT PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t VALUES(1);
BEGIN;
SELECT * FROM t FOR UPDATE;
 
connect (con1, localhost, root,,);
BEGIN;
# infinite timeout
SET innodb_lock_wait_timeout=100000000;
let $ID= `SELECT @id := CONNECTION_ID()`;
SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed';
send SELECT * FROM t FOR UPDATE;
 
connection default;
let $ignore= `SELECT @id := $ID`;
 
SET DEBUG_SYNC='now WAIT_FOR locking';
KILL QUERY @id;
SET DEBUG_SYNC='now SIGNAL proceed';
 
connection con1;
--error ER_QUERY_INTERRUPTED
reap;
disconnect con1;
 
connection default;
SET DEBUG_SYNC=RESET;
 
COMMIT;
DROP TABLE t;

Comment by Vladislav Lesin [ 2023-09-06 ]

The fix looks good to me.

I have some proposal of how to avoid unnecessary cursor restoring under some circumstances.

In the case if innobase_kill_query() was finished after lock_rec_enqueue_waiting() call and before the first trx->lock.wait_lock check in lock_wait() in parallel thread(suppose we change lock_rec debug sync point to lock_wait_start in your test), lock_wait() returns DB_SUCCESS as trx->lock.was_chosen_as_deadlock_victim was not set(see the following code).

  lock_t *wait_lock= trx->lock.wait_lock;                                       
                                                                                
  if (!wait_lock)                                                               
  {                                                                             
    /* The lock has already been released or this transaction                   
    was chosen as a deadlock victim: no need to wait */                         
    trx->error_state=                                                           
      trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS;       
    return trx->error_state;                                                    
  }

And then row_search_mvcc() jumps to rec_loop label, as row_mysql_handle_errors() returns true, where trx_is_interrupted(trx) is checked.

dberr_t                                                                         
row_search_mvcc(...)                                                            
{                                                                               
...                                                                             
rec_loop:                                                                       
        DEBUG_SYNC_C("row_search_rec_loop");                                    
        if (trx_is_interrupted(trx)) {                                          
                if (!spatial_search) {                                          
                        btr_pcur_store_position(pcur, &mtr);                    
                }                                                               
                err = DB_INTERRUPTED;                                           
                goto normal_return;                                             
        }                                                                       
...                                                                             
        if (row_mysql_handle_errors(&err, trx, thr, NULL)) {                    
...
                sel_restore_position_for_mysql(...);
...                                                                             
                goto rec_loop;                                                  
        }                                                                       
                                                                                
        thr->lock_state = QUE_THR_LOCK_NOLOCK;                                  
                                                                                
        goto func_exit;                                                         
...                                                                             
}

Maybe we could make the path shorter and check if the transaction was interrupted in the above block in lock_wait(), so that lock_wait() would return DB_INTERRUPTED instead of DB_SUCCESS? Such a way we could avoid unnecessary cursor restoring.

Comment by Marko Mäkelä [ 2023-09-06 ]

vlad.lesin, thank you, your suggestion makes sense. The following would implement it:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
index 268c094ffde..098113ea971 100644
--- a/storage/innobase/lock/lock0lock.cc
+++ b/storage/innobase/lock/lock0lock.cc
@@ -1853,8 +1853,10 @@ dberr_t lock_wait(que_thr_t *thr)
   {
     /* The lock has already been released or this transaction
     was chosen as a deadlock victim: no need to wait */
-    trx->error_state=
-      trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS;
+    if (trx->lock.was_chosen_as_deadlock_victim)
+      trx->error_state= DB_DEADLOCK;
+    else if (trx->error_state == DB_LOCK_WAIT)
+      trx->error_state= DB_SUCCESS;
     return trx->error_state;
   }
 

I tested it with the following:

diff --git a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test
index 2bbe2b3b139..5dbeb9b6a12 100644
--- a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test
+++ b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test
@@ -11,7 +11,7 @@ BEGIN;
 # infinite timeout
 SET innodb_lock_wait_timeout=100000000;
 let $ID= `SELECT @id := CONNECTION_ID()`;
-SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed';
+SET DEBUG_SYNC='lock_wait_start SIGNAL locking WAIT_FOR proceed';
 send SELECT * FROM t FOR UPDATE;
 
 connection default;

Between the execution of lock_wait() getting past the lock_wait_start point and the server shutdown, there was no call to trx_is_interrupted(). The trx->error_state (of DB_INTERRUPTED, set by innobase_kill_query()) is being returned “as is”.

Generated at Thu Feb 08 10:28:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.