Parallel replication uses the wait_for_commit facility to ensure that binlog
order on the slave is the same as on the master.
However, there is a bug in this for statements such as GRANT, which are
written directly into the binlog, rather than go through the group commit
First, the code for directly writing to binlog, in MYSQL_BIN_LOG::write(), was
missing a call to wait_for_prior_commit(), so they could happen completely
independently of earlier commits. I have a patch to add this missing call,
however it does not completely solve the problem.
The group commit works in the following way. The first thread registers itself
as the leader in the group commit queue. Then it wakes up any following
threads that may be waiting for it to commit, even though the commit has not
happened yet. When a later thread reaches the group commit code, it notices
that there is already a leader. So it does not attempt to commit itself,
instead it just adds itself to the queue. So despite being woken up too early,
the later thread will be committed in-order because there is only one leader.
However, the directly written statements bypass this group commit code. This
means that if they are woken up early, then can race a group commit leader for
the LOCK_log mutex, and if they win the race they can write to the binlog
ahead of the leader, causing the wrong binlog order.
I think a possible solution is to make the direct write transactions also go
through the group commit code. This would additionally make them benefit from
potentially reduced fsync(), though that may be less important due to rarity
of direct write statements. So the write-to-file part of
MYSQL_BIN_LOG::write() must be pulled out into a separate function, which is
then called in the non-direct case. In the direct case, the thread must then
go through MYSQL_BIN_LOG::write_transaction_to_binlog_events() to either
become the leader and do the group commit itself, or queue up as a
participant. Finally, the MYSQL_BIN_LOG::write_transaction_to_binlog_events()
code must be extended to be able to also handle the direct write case, calling
the pulled-out function instead of flushing the binlog cache to the main
This solution will need some work, but sounds like a possibility.
The problem occurs only as a rare race, but it can be triggered a few times an
hour with an rqg test like this:
The gtid-strict-mode makes the slave fail when an out-of-order binlog write is
about to happen.