[MDEV-6775] Wrong binlog order in parallel replication - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.0.14
Fix Version/s: 10.0.15
Component/s: Replication
Labels:
- parallelslave
- replication

Description

Parallel replication uses the wait_for_commit facility to ensure that binlog
order on the slave is the same as on the master.

However, there is a bug in this for statements such as GRANT, which are
written directly into the binlog, rather than go through the group commit
code.

First, the code for directly writing to binlog, in MYSQL_BIN_LOG::write(), was
missing a call to wait_for_prior_commit(), so they could happen completely
independently of earlier commits. I have a patch to add this missing call,
however it does not completely solve the problem.

The group commit works in the following way. The first thread registers itself
as the leader in the group commit queue. Then it wakes up any following
threads that may be waiting for it to commit, even though the commit has not
happened yet. When a later thread reaches the group commit code, it notices
that there is already a leader. So it does not attempt to commit itself,
instead it just adds itself to the queue. So despite being woken up too early,
the later thread will be committed in-order because there is only one leader.

However, the directly written statements bypass this group commit code. This
means that if they are woken up early, then can race a group commit leader for
the LOCK_log mutex, and if they win the race they can write to the binlog
ahead of the leader, causing the wrong binlog order.

I think a possible solution is to make the direct write transactions also go
through the group commit code. This would additionally make them benefit from
potentially reduced fsync(), though that may be less important due to rarity
of direct write statements. So the write-to-file part of
MYSQL_BIN_LOG::write() must be pulled out into a separate function, which is
then called in the non-direct case. In the direct case, the thread must then
go through MYSQL_BIN_LOG::write_transaction_to_binlog_events() to either
become the leader and do the group commit itself, or queue up as a
participant. Finally, the MYSQL_BIN_LOG::write_transaction_to_binlog_events()
code must be extended to be able to also handle the direct write case, calling
the pulled-out function instead of flushing the binlog cache to the main
binlog file.

This solution will need some work, but sounds like a possibility.

The problem occurs only as a rare race, but it can be triggered a few times an
hour with an rqg test like this:

    perl ./runall-new.pl --grammar=conf/replication/replication-ddl_sql.yy --gendata=conf/replication/replication-ddl_data.zz --redefine=conf/mariadb/general-workarounds.yy --threads=8 --duration=600 --queries=100M --rpl_mode=row --mysqld=--slave-parallel-threads=64 --mysqld=--slave-parallel-mode=domain,transactional --mysqld=--log-bin=mysql-bin --mysqld=--log-slave-updates --mysqld=--binlog-format=row --mysqld=--gtid-strict-mode=1 --engine=InnoDB --use-gtid=slave_pos --basedir=$HOME/my/10.0/work-10.0-mdev6676/bld --vardir=/dev/shm/a

The gtid-strict-mode makes the slave fail when an out-of-order binlog write is
about to happen.

Attachments

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2014-10-06 16:44

Maybe another fix could be to not wake up other transactions until after the
commit is complete?

I found another related problem, actually seen as a very rare race/failure
in test case rpl.rpl_parallel:

rpl.rpl_parallel 'row,xtradb'            w1 [ fail ]

CURRENT_TEST: rpl.rpl_parallel

--- /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.result	2014-09-05 14:22:34.244677000 +0200

+++ /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.reject	2014-10-02 16:55:51.201110361 +0200

@@ -826,7 +826,7 @@

 3	NULL

4	4

 5	NULL

-6	NULL

+6	6

 SET @last_gtid= 'GTID';

 SELECT IF(@@gtid_slave_pos LIKE CONCAT('%',@last_gtid,'%'), "GTID found ok",

 CONCAT("GTID ", @last_gtid, " not found in gtid_slave_pos=", @@gtid_slave_pos))

Here we have two transactions:

UPDATE t4 SET b=NULL WHERE a=6;

DELETE FROM t4 WHERE b <= 1;

The failure suggests that the slave sees the DELETE but not the UPDATE. In
fact the DELETE does not modify any rows in this case, and is not binlogged in
row mode, I think, so seems plausible that it could be woken up early during
group commit of the UPDATE, and the slave could complete the DELETE and update
the slave position before the UPDATE is binlogged and committed.

In GTID mode, this would actually be a bug, as then in case of crash we could
lose the UPDATE.

Kristian Nielsen added a comment - 2014-10-06 16:44 Maybe another fix could be to not wake up other transactions until after the commit is complete? I found another related problem, actually seen as a very rare race/failure in test case rpl.rpl_parallel: rpl.rpl_parallel 'row,xtradb' w1 [ fail ] CURRENT_TEST: rpl.rpl_parallel --- /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.result 2014-09-05 14:22:34.244677000 +0200 +++ /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.reject 2014-10-02 16:55:51.201110361 +0200 @@ -826,7 +826,7 @@ 3 NULL 4 4 5 NULL -6 NULL +6 6 SET @last_gtid= 'GTID'; SELECT IF(@@gtid_slave_pos LIKE CONCAT('%',@last_gtid,'%'), "GTID found ok", CONCAT("GTID ", @last_gtid, " not found in gtid_slave_pos=", @@gtid_slave_pos)) Here we have two transactions: UPDATE t4 SET b=NULL WHERE a=6; DELETE FROM t4 WHERE b <= 1; The failure suggests that the slave sees the DELETE but not the UPDATE. In fact the DELETE does not modify any rows in this case, and is not binlogged in row mode, I think, so seems plausible that it could be woken up early during group commit of the UPDATE, and the slave could complete the DELETE and update the slave position before the UPDATE is binlogged and committed. In GTID mode, this would actually be a bug, as then in case of crash we could lose the UPDATE.

Kristian Nielsen added a comment - 2014-10-14 15:41

I now have a patch for this that looks good.

Test case: http://lists.askmonty.org/pipermail/commits/2014-October/006782.html
Patch: http://lists.askmonty.org/pipermail/commits/2014-October/006786.html

Kristian Nielsen added a comment - 2014-10-14 15:41 I now have a patch for this that looks good. Test case: http://lists.askmonty.org/pipermail/commits/2014-October/006782.html Patch: http://lists.askmonty.org/pipermail/commits/2014-October/006786.html

Kristian Nielsen added a comment - 2014-11-13 15:04 - edited

Pushed to 10.0.15:

http://lists.askmonty.org/pipermail/commits/2014-November/006976.html

Kristian Nielsen added a comment - 2014-11-13 15:04 - edited Pushed to 10.0.15: http://lists.askmonty.org/pipermail/commits/2014-November/006976.html

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 2014-09-24 15:11

Updated:: 2014-11-13 15:12

Resolved:: 2014-11-13 15:04

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server