Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5914

Parallel replication deadlock due to InnoDB lock conflicts

    XMLWordPrintable

Details

    Description

      The fundamental assumption with parallel replication is the following: If two
      transactions T1 and T2 commit in parallel on the master (they are
      group-committed together), then they can be executed in parallel on the slave
      without any risk of conflicting locks. The slave will commit them in the same
      order (eg. T1 first T2 second), so it is critical that T2 will not take any
      locks that would block T1, or a deadlock occurs.

      Unfortunately, this assumption turns out to be invalid.

      Consider this table and two transactions T1, T2:

      CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
      T1: INSERT INTO t1 VALUES (7, NULL);
      T2: DELETE FROM t1 WHERE b <= 3;

      If T1 runs first and then T2, there is no blocking, and they can group commit
      together. But if T2 runs first, then it takes a gap lock on the index on b
      which blocks the insert of a row with B=NULL.

      Thus, the bug is when they run in T1,T2 order on the master, they group commit,
      slave tries to run them in parallel. T2 happens to take the gap lock first, T1
      waits for T2 to commit, then T2 waits for T1 to commit -> deadlock.

      Another example of this is with an UPDATE and a DELETE:

      UPDATE t1 SET secondary=NULL WHERE primary=1
      DELETE t1 WHERE secondary <= 3

      Two possible solutions are being considered:

      1. Run the slaves in READ COMMITTED mode. This however means that binlog may
      not be serialised correctly if there are multiple multi-source master
      connections and/or users doing direct updates on the slave, which happen to
      run in parallel with conflicting gap locks.

      2. Modify InnoDB locking so that two transactions that run in parallel due to
      group commit on the master will not wait for the gap lock of each other, but
      will still use the gap locks normally with respect to other
      transactions. This however requires a rather risky modification of InnoDB
      locking that needs to be fully assessed for correctness

      This bug is one of the problems reported in MDEV-5863.

      Here is a test case. It may need to be run multiple times to trigger the error:

      --source include/have_innodb.inc
      --source include/have_debug_sync.inc
      --source include/have_debug_sync.inc
      --source include/master-slave.inc
       
      --connection slave
      --source include/stop_slave.inc
       
      --connection master
      CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
       
      # Create a group commit with INSERT and DELETE, in that order.
      --connect (con1,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
      SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued1 WAIT_FOR master_cont1';
      send INSERT INTO t1 VALUES (7, NULL);
      --connection master
      SET debug_sync='now WAIT_FOR master_queued1';
       
      --connect (con2,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
      SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued2';
      send DELETE FROM t1 WHERE b <= 3;
       
      --connection master
      SET debug_sync='now WAIT_FOR master_queued2';
      SET debug_sync='now SIGNAL master_cont1';
       
      --connection con1
      REAP;
      --connection con2
      REAP;
      SET debug_sync='RESET';
      --save_master_pos
       
      --connection slave
      SET @old_parallel= @@GLOBAL.slave_parallel_threads;
      SET GLOBAL slave_parallel_threads=16;
      --source include/start_slave.inc
      --sync_with_master
       
      SELECT * FROM t1 ORDER BY a;
       
      --source include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=@old_parallel;
      --source include/start_slave.inc
       
      --connection master
      DROP TABLE t1;
      --source include/rpl_end.inc

      Attachments

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.