Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32096

Parallel replication lags because innobase_kill_query() may fail to interrupt a lock wait

Details

    Description

      MDEV-24671 introduced a race condition in the function innobase_kill_query(), which is responsible for interrupting a lock wait for the target of a KILL QUERY or KILL CONNECTION statement.

      This can severely affect optimistic (and aggressive) parallel replication. If the race is triggered, conflicts are not resolved correctly and parallel replication will be blocked until --innodb-lock-wait-timeout. This will be seen in SHOW PROCESSLIST as one worker being in the "killed" state and some other worker stuck in a query.

      A user reported a hang of parallel replication due to this, and knielsen spotted the data race: If the target transaction starts a lock wait roughly at the same time as innobase_kill_query() is invoked, then trx->lock.wait_lock could be read as nullptr and the lock wait would not be interrupted. Therefore, we need to acquire lock_sys.wait_mutex before checking if a lock wait needs to be aborted.

      Attached mdev32096_testcase.patch is an (ugly) ./mtr testcase that triggers the problem.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            knielsen, thank you for mdev32096_testcase.patch. I created a simpler one. It requires a new DEBUG_SYNC point, because right after row_search_rec_loop we would invoke trx_is_interrupted():

            diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
            index ba2b60b4c5b..b11bd7d4cd1 100644
            --- a/storage/innobase/lock/lock0lock.cc
            +++ b/storage/innobase/lock/lock0lock.cc
            @@ -1555,6 +1555,10 @@ lock_rec_lock(
               ut_ad(~mode & (LOCK_GAP | LOCK_REC_NOT_GAP));
               ut_ad(dict_index_is_clust(index) || !dict_index_is_online_ddl(index));
               DBUG_EXECUTE_IF("innodb_report_deadlock", return DB_DEADLOCK;);
            +#ifdef ENABLED_DEBUG_SYNC
            +  if (trx->mysql_thd)
            +    DEBUG_SYNC_C("lock_rec");
            +#endif
             
               ut_ad((LOCK_MODE_MASK & mode) != LOCK_S ||
                     lock_table_has(trx, index->table, LOCK_IS));
            

            Without the fix, the test case hangs where it is expected:

            #5  0x000056195aee2bcb in safe_cond_wait (cond=0x7f5281c2e710, mp=0x56195ba3da40 <lock_sys+192>, file=0x561959f5d480 "/mariadb/10.6/storage/innobase/lock/lock0lock.cc", line=1919) at /mariadb/10.6/mysys/thr_mutex.c:492
            #6  0x000056195ad5fef4 in lock_wait (thr=thr@entry=0x7f522400d1d8) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1919
            

            Here is the test case:

            --source include/have_innodb.inc
            --source include/have_debug_sync.inc
             
            CREATE TABLE t(a INT PRIMARY KEY) ENGINE=InnoDB;
            INSERT INTO t VALUES(1);
            BEGIN;
            SELECT * FROM t FOR UPDATE;
             
            connect (con1, localhost, root,,);
            BEGIN;
            # infinite timeout
            SET innodb_lock_wait_timeout=100000000;
            let $ID= `SELECT @id := CONNECTION_ID()`;
            SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed';
            send SELECT * FROM t FOR UPDATE;
             
            connection default;
            let $ignore= `SELECT @id := $ID`;
             
            SET DEBUG_SYNC='now WAIT_FOR locking';
            KILL QUERY @id;
            SET DEBUG_SYNC='now SIGNAL proceed';
             
            connection con1;
            --error ER_QUERY_INTERRUPTED
            reap;
            disconnect con1;
             
            connection default;
            SET DEBUG_SYNC=RESET;
             
            COMMIT;
            DROP TABLE t;
            

            marko Marko Mäkelä added a comment - - edited knielsen , thank you for mdev32096_testcase.patch . I created a simpler one. It requires a new DEBUG_SYNC point, because right after row_search_rec_loop we would invoke trx_is_interrupted() : diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index ba2b60b4c5b..b11bd7d4cd1 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -1555,6 +1555,10 @@ lock_rec_lock( ut_ad(~mode & (LOCK_GAP | LOCK_REC_NOT_GAP)); ut_ad(dict_index_is_clust(index) || !dict_index_is_online_ddl(index)); DBUG_EXECUTE_IF("innodb_report_deadlock", return DB_DEADLOCK;); +#ifdef ENABLED_DEBUG_SYNC + if (trx->mysql_thd) + DEBUG_SYNC_C("lock_rec"); +#endif ut_ad((LOCK_MODE_MASK & mode) != LOCK_S || lock_table_has(trx, index->table, LOCK_IS)); Without the fix, the test case hangs where it is expected: #5 0x000056195aee2bcb in safe_cond_wait (cond=0x7f5281c2e710, mp=0x56195ba3da40 <lock_sys+192>, file=0x561959f5d480 "/mariadb/10.6/storage/innobase/lock/lock0lock.cc", line=1919) at /mariadb/10.6/mysys/thr_mutex.c:492 #6 0x000056195ad5fef4 in lock_wait (thr=thr@entry=0x7f522400d1d8) at /mariadb/10.6/storage/innobase/lock/lock0lock.cc:1919 Here is the test case: --source include/have_innodb.inc --source include/have_debug_sync.inc   CREATE TABLE t(a INT PRIMARY KEY ) ENGINE=InnoDB; INSERT INTO t VALUES (1); BEGIN ; SELECT * FROM t FOR UPDATE ;   connect (con1, localhost, root,,); BEGIN ; # infinite timeout SET innodb_lock_wait_timeout=100000000; let $ID= ` SELECT @id := CONNECTION_ID()`; SET DEBUG_SYNC= 'lock_rec SIGNAL locking WAIT_FOR proceed' ; send SELECT * FROM t FOR UPDATE ;   connection default ; let $ ignore = ` SELECT @id := $ID`;   SET DEBUG_SYNC= 'now WAIT_FOR locking' ; KILL QUERY @id; SET DEBUG_SYNC= 'now SIGNAL proceed' ;   connection con1; --error ER_QUERY_INTERRUPTED reap; disconnect con1;   connection default ; SET DEBUG_SYNC=RESET;   COMMIT ; DROP TABLE t;

            The fix looks good to me.

            I have some proposal of how to avoid unnecessary cursor restoring under some circumstances.

            In the case if innobase_kill_query() was finished after lock_rec_enqueue_waiting() call and before the first trx->lock.wait_lock check in lock_wait() in parallel thread(suppose we change lock_rec debug sync point to lock_wait_start in your test), lock_wait() returns DB_SUCCESS as trx->lock.was_chosen_as_deadlock_victim was not set(see the following code).

              lock_t *wait_lock= trx->lock.wait_lock;                                       
                                                                                            
              if (!wait_lock)                                                               
              {                                                                             
                /* The lock has already been released or this transaction                   
                was chosen as a deadlock victim: no need to wait */                         
                trx->error_state=                                                           
                  trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS;       
                return trx->error_state;                                                    
              }
            

            And then row_search_mvcc() jumps to rec_loop label, as row_mysql_handle_errors() returns true, where trx_is_interrupted(trx) is checked.

            dberr_t                                                                         
            row_search_mvcc(...)                                                            
            {                                                                               
            ...                                                                             
            rec_loop:                                                                       
                    DEBUG_SYNC_C("row_search_rec_loop");                                    
                    if (trx_is_interrupted(trx)) {                                          
                            if (!spatial_search) {                                          
                                    btr_pcur_store_position(pcur, &mtr);                    
                            }                                                               
                            err = DB_INTERRUPTED;                                           
                            goto normal_return;                                             
                    }                                                                       
            ...                                                                             
                    if (row_mysql_handle_errors(&err, trx, thr, NULL)) {                    
            ...
                            sel_restore_position_for_mysql(...);
            ...                                                                             
                            goto rec_loop;                                                  
                    }                                                                       
                                                                                            
                    thr->lock_state = QUE_THR_LOCK_NOLOCK;                                  
                                                                                            
                    goto func_exit;                                                         
            ...                                                                             
            }
            

            Maybe we could make the path shorter and check if the transaction was interrupted in the above block in lock_wait(), so that lock_wait() would return DB_INTERRUPTED instead of DB_SUCCESS? Such a way we could avoid unnecessary cursor restoring.

            vlad.lesin Vladislav Lesin added a comment - The fix looks good to me. I have some proposal of how to avoid unnecessary cursor restoring under some circumstances. In the case if innobase_kill_query() was finished after lock_rec_enqueue_waiting() call and before the first trx->lock.wait_lock check in lock_wait() in parallel thread(suppose we change lock_rec debug sync point to lock_wait_start in your test), lock_wait() returns DB_SUCCESS as trx->lock.was_chosen_as_deadlock_victim was not set(see the following code). lock_t *wait_lock= trx->lock.wait_lock; if (!wait_lock) { /* The lock has already been released or this transaction was chosen as a deadlock victim: no need to wait */ trx->error_state= trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS; return trx->error_state; } And then row_search_mvcc() jumps to rec_loop label, as row_mysql_handle_errors() returns true, where trx_is_interrupted(trx) is checked. dberr_t row_search_mvcc(...) { ... rec_loop: DEBUG_SYNC_C( "row_search_rec_loop" ); if (trx_is_interrupted(trx)) { if (!spatial_search) { btr_pcur_store_position(pcur, &mtr); } err = DB_INTERRUPTED; goto normal_return; } ... if (row_mysql_handle_errors(&err, trx, thr, NULL)) { ... sel_restore_position_for_mysql(...); ... goto rec_loop; } thr->lock_state = QUE_THR_LOCK_NOLOCK; goto func_exit; ... } Maybe we could make the path shorter and check if the transaction was interrupted in the above block in lock_wait(), so that lock_wait() would return DB_INTERRUPTED instead of DB_SUCCESS? Such a way we could avoid unnecessary cursor restoring.

            vlad.lesin, thank you, your suggestion makes sense. The following would implement it:

            diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc
            index 268c094ffde..098113ea971 100644
            --- a/storage/innobase/lock/lock0lock.cc
            +++ b/storage/innobase/lock/lock0lock.cc
            @@ -1853,8 +1853,10 @@ dberr_t lock_wait(que_thr_t *thr)
               {
                 /* The lock has already been released or this transaction
                 was chosen as a deadlock victim: no need to wait */
            -    trx->error_state=
            -      trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS;
            +    if (trx->lock.was_chosen_as_deadlock_victim)
            +      trx->error_state= DB_DEADLOCK;
            +    else if (trx->error_state == DB_LOCK_WAIT)
            +      trx->error_state= DB_SUCCESS;
                 return trx->error_state;
               }
             
            

            I tested it with the following:

            diff --git a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test
            index 2bbe2b3b139..5dbeb9b6a12 100644
            --- a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test
            +++ b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test
            @@ -11,7 +11,7 @@ BEGIN;
             # infinite timeout
             SET innodb_lock_wait_timeout=100000000;
             let $ID= `SELECT @id := CONNECTION_ID()`;
            -SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed';
            +SET DEBUG_SYNC='lock_wait_start SIGNAL locking WAIT_FOR proceed';
             send SELECT * FROM t FOR UPDATE;
             
             connection default;
            

            Between the execution of lock_wait() getting past the lock_wait_start point and the server shutdown, there was no call to trx_is_interrupted(). The trx->error_state (of DB_INTERRUPTED, set by innobase_kill_query()) is being returned “as is”.

            marko Marko Mäkelä added a comment - vlad.lesin , thank you, your suggestion makes sense. The following would implement it: diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index 268c094ffde..098113ea971 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -1853,8 +1853,10 @@ dberr_t lock_wait(que_thr_t *thr) { /* The lock has already been released or this transaction was chosen as a deadlock victim: no need to wait */ - trx->error_state= - trx->lock.was_chosen_as_deadlock_victim ? DB_DEADLOCK : DB_SUCCESS; + if (trx->lock.was_chosen_as_deadlock_victim) + trx->error_state= DB_DEADLOCK; + else if (trx->error_state == DB_LOCK_WAIT) + trx->error_state= DB_SUCCESS; return trx->error_state; } I tested it with the following: diff --git a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test index 2bbe2b3b139..5dbeb9b6a12 100644 --- a/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test +++ b/mysql-test/suite/innodb/t/innodb_lock_wait_timeout.test @@ -11,7 +11,7 @@ BEGIN; # infinite timeout SET innodb_lock_wait_timeout=100000000; let $ID= `SELECT @id := CONNECTION_ID()`; -SET DEBUG_SYNC='lock_rec SIGNAL locking WAIT_FOR proceed'; +SET DEBUG_SYNC='lock_wait_start SIGNAL locking WAIT_FOR proceed'; send SELECT * FROM t FOR UPDATE; connection default; Between the execution of lock_wait() getting past the lock_wait_start point and the server shutdown, there was no call to trx_is_interrupted() . The trx->error_state (of DB_INTERRUPTED , set by innobase_kill_query() ) is being returned “as is”.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.