The fundamental assumption with parallel replication is the following: If two
transactions T1 and T2 commit in parallel on the master (they are
group-committed together), then they can be executed in parallel on the slave
without any risk of conflicting locks. The slave will commit them in the same
order (eg. T1 first T2 second), so it is critical that T2 will not take any
locks that would block T1, or a deadlock occurs.
Unfortunately, this assumption turns out to be invalid.
Consider this table and two transactions T1, T2:
If T1 runs first and then T2, there is no blocking, and they can group commit
together. But if T2 runs first, then it takes a gap lock on the index on b
which blocks the insert of a row with B=NULL.
Thus, the bug is when they run in T1,T2 order on the master, they group commit,
slave tries to run them in parallel. T2 happens to take the gap lock first, T1
waits for T2 to commit, then T2 waits for T1 to commit -> deadlock.
Another example of this is with an UPDATE and a DELETE:
Two possible solutions are being considered:
1. Run the slaves in READ COMMITTED mode. This however means that binlog may
not be serialised correctly if there are multiple multi-source master
connections and/or users doing direct updates on the slave, which happen to
run in parallel with conflicting gap locks.
2. Modify InnoDB locking so that two transactions that run in parallel due to
group commit on the master will not wait for the gap lock of each other, but
will still use the gap locks normally with respect to other
transactions. This however requires a rather risky modification of InnoDB
locking that needs to be fully assessed for correctness
This bug is one of the problems reported in
Here is a test case. It may need to be run multiple times to trigger the error: