Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.0.9
-
None
Description
The fundamental assumption with parallel replication is the following: If two
transactions T1 and T2 commit in parallel on the master (they are
group-committed together), then they can be executed in parallel on the slave
without any risk of conflicting locks. The slave will commit them in the same
order (eg. T1 first T2 second), so it is critical that T2 will not take any
locks that would block T1, or a deadlock occurs.
Unfortunately, this assumption turns out to be invalid.
Consider this table and two transactions T1, T2:
CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB;
|
INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
|
T1: INSERT INTO t1 VALUES (7, NULL);
|
T2: DELETE FROM t1 WHERE b <= 3;
|
If T1 runs first and then T2, there is no blocking, and they can group commit
together. But if T2 runs first, then it takes a gap lock on the index on b
which blocks the insert of a row with B=NULL.
Thus, the bug is when they run in T1,T2 order on the master, they group commit,
slave tries to run them in parallel. T2 happens to take the gap lock first, T1
waits for T2 to commit, then T2 waits for T1 to commit -> deadlock.
Another example of this is with an UPDATE and a DELETE:
UPDATE t1 SET secondary=NULL WHERE primary=1
|
DELETE t1 WHERE secondary <= 3
|
Two possible solutions are being considered:
1. Run the slaves in READ COMMITTED mode. This however means that binlog may
not be serialised correctly if there are multiple multi-source master
connections and/or users doing direct updates on the slave, which happen to
run in parallel with conflicting gap locks.
2. Modify InnoDB locking so that two transactions that run in parallel due to
group commit on the master will not wait for the gap lock of each other, but
will still use the gap locks normally with respect to other
transactions. This however requires a rather risky modification of InnoDB
locking that needs to be fully assessed for correctness
This bug is one of the problems reported in MDEV-5863.
Here is a test case. It may need to be run multiple times to trigger the error:
--source include/have_innodb.inc
|
--source include/have_debug_sync.inc
|
--source include/have_debug_sync.inc
|
--source include/master-slave.inc
|
|
--connection slave
|
--source include/stop_slave.inc
|
|
--connection master
|
CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB;
|
INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
|
|
# Create a group commit with INSERT and DELETE, in that order.
|
--connect (con1,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
|
SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued1 WAIT_FOR master_cont1';
|
send INSERT INTO t1 VALUES (7, NULL);
|
--connection master
|
SET debug_sync='now WAIT_FOR master_queued1';
|
|
--connect (con2,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
|
SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued2';
|
send DELETE FROM t1 WHERE b <= 3;
|
|
--connection master
|
SET debug_sync='now WAIT_FOR master_queued2';
|
SET debug_sync='now SIGNAL master_cont1';
|
|
--connection con1
|
REAP;
|
--connection con2
|
REAP;
|
SET debug_sync='RESET';
|
--save_master_pos
|
|
--connection slave
|
SET @old_parallel= @@GLOBAL.slave_parallel_threads;
|
SET GLOBAL slave_parallel_threads=16;
|
--source include/start_slave.inc
|
--sync_with_master
|
|
SELECT * FROM t1 ORDER BY a;
|
|
--source include/stop_slave.inc
|
SET GLOBAL slave_parallel_threads=@old_parallel;
|
--source include/start_slave.inc
|
|
--connection master
|
DROP TABLE t1;
|
--source include/rpl_end.inc
|
Attachments
Issue Links
- relates to
-
MDEV-6063 overlocking next-record in REPEATABLE-READ with delete from where x <= num
- Closed
-
MDEV-5863 "Lock wait timeout exceeded", lost events, Assertion `!table || !table->in_use || ta ble->in_use == _current_thd()' failure upon parallel replicating
- Closed
-
MDEV-5941 Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on DML with slave parallel threads, SBR
- Closed