Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31840

Parallel replication undetected deadlocks with outside transaction

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.14
    • 10.6
    • Replication
    • None

    Description

      Consider (parallel) replicated transactions T1 and T2 that must commit in this order, along with extra transaction U that is not ordered with T1 and T2 (U can be a user-transaction or replicated in a separate domain_id). Suppose further that row locks cause U to wait on T2 and T1 to wait on U:

      T1 -> U -> T2

      Since T2 must wait_for_prior_commit on T1, this is a deadlock. It will not be caught by the InnoDB deadlock checker as this does not know about wait_for_prior_commit.

      In 10.4 this was resolved by a deadlock kill. This seems to be somewhat accidental. It looks like InnoDB 10.4 would traverse the wait-for graph in the deadlock detector and report to parallel replication all transitive waits in the graph, ie. in this case it will report both direct T1->U and indirect T1->T2. The latter will cause T2 to be deadlock killed and the deadlock resolved.

      In 10.6 these transitive waits seem to be no longer reported, at least from first testing (see below for testcase). This means that replication will hang until we get a lock wait timeout.

      It wasn't really intended in my original design that the storage engine would be required to report all transitive waits in the wait-for graph. It might be expensive to add, though on the other hand a lock wait is already expensive.

      I'm not sure how important this problem is either. Normally, on a slave, we would not expect there to be user (non-replicated) transactions, much less such that conflict with replicated transactions; nor are such conflicts expected between different GTID domain IDs. Still, it's not nice to have deadlocks that are not detected by the server.

      Of course, the root of the problem really is the lack of a server-wide deadlock detector, so that InnoDB and parallel replication each have only part of the picture, and neither can solve the problem 100%. I think there will be other similar cases of lack of deadlock detection eg. with milti-engine transactions.

      I have not decided what should be the resolution of this issue, but for now at least making sure that the issue is documented.

      Test case:

      --source include/have_innodb.inc
      --source include/have_binlog_format_row.inc
      --source include/master-slave.inc
       
      # Try to test the following scenario:
      # T1 and T2 replicated transactions must commit in this order.
      # U external transaction.
      # U ends up waiting for T2 on row lock, then T1 waits on U.
      # T2 will then wait_for_prior_commit on T1, but there's no deadlock inside
      # InnoDB. Will transitive lock wait reporting from Inno cause a deadlock
      # kill of T2, or will we hang until --innodb-lock-wait-timeout?
       
      --connection master
      ALTER TABLE mysql.gtid_slave_pos ENGINE=innodb;
      CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1,0), (2,0), (3,0), (4,0), (5,0);
      CREATE TABLE t2 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
      INSERT INTO t2 VALUES (1,0), (2,0), (3,0), (4,0), (5,0);
      --sync_slave_with_master
       
      --source include/stop_slave.inc
      SET @old_wait_timeout= @@GLOBAL.innodb_lock_wait_timeout;
      SET GLOBAL innodb_lock_wait_timeout= 3;
      SET @old_parallel= @@GLOBAL.slave_parallel_threads;
      SET GLOBAL slave_parallel_threads=2;
      SET @old_mod= @@GLOBAL.slave_parallel_mode;
      SET GLOBAL slave_parallel_mode= optimistic;
      CHANGE MASTER TO master_use_gtid=slave_pos;
       
      --connection master
       
      # T1.
      BEGIN;
      UPDATE t2 SET b=b+1 WHERE a=1;
      UPDATE t1 SET b=b+1 WHERE a=1;
      COMMIT;
       
      # T2.
      BEGIN;
      UPDATE t2 SET b=b+1 WHERE a=2;
      UPDATE t1 SET b=b+1 WHERE a=3;
      COMMIT;
       
      SELECT * FROM t1 ORDER BY a;
      SELECT * FROM t2 ORDER BY a;
      --source include/save_master_gtid.inc
       
      # Something to block T1
      --let $rpl_connection_name= slave2
      --let $rpl_server_number= 2
      --source include/rpl_connect.inc
      --connection slave2
      BEGIN;
      UPDATE t2 SET b=2 WHERE a=1;
       
      --connection slave
      --source include/start_slave.inc
       
      # Wait for T2 to have locked its row.
      # ToDo: Proper wait.
      --sleep 0.5
       
      # U.
      --let $rpl_connection_name= slave3
      --let $rpl_server_number= 2
      --source include/rpl_connect.inc
      --connection slave3
      BEGIN;
      UPDATE t1 SET b=2 WHERE a=1;
      send UPDATE t1 SET b=2 WHERE a=3;
       
      --connection slave
      # Wait for U to be waiting.
      # ToDo: Proper wait.
      --sleep 0.5
       
      # Release T1 so we get the wait T1 -> U -> T2 -> T1.
      --connection slave2
      ROLLBACK;
       
      --connection slave3
      #--error ER_LOCK_WAIT_TIMEOUT
      reap;
      ROLLBACK;
       
      --connection slave
      --let $slave_timeout= 10
      --echo $master_pos
      --echo $slave_timeout
      eval SELECT master_gtid_wait('$master_pos', $slave_timeout);
      --source include/sync_with_master_gtid.inc
      SELECT * FROM t1 ORDER BY a;
      SELECT * FROM t2 ORDER BY a;
       
      # Cleanup
      --connection master
      DROP TABLE t1,t2;
      --source include/save_master_gtid.inc
       
      --connection slave
      --source include/sync_with_master_gtid.inc
      --source include/stop_slave.inc
      SET GLOBAL innodb_lock_wait_timeout= @old_wait_timeout;
      SET GLOBAL slave_parallel_threads= @old_parallel;
      SET GLOBAL slave_parallel_mode= @old_mod;
      --source include/start_slave.inc
       
      --source include/rpl_end.inc
      

      Attachments

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.