Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6775

Wrong binlog order in parallel replication

Details

    Description

      Parallel replication uses the wait_for_commit facility to ensure that binlog
      order on the slave is the same as on the master.

      However, there is a bug in this for statements such as GRANT, which are
      written directly into the binlog, rather than go through the group commit
      code.

      First, the code for directly writing to binlog, in MYSQL_BIN_LOG::write(), was
      missing a call to wait_for_prior_commit(), so they could happen completely
      independently of earlier commits. I have a patch to add this missing call,
      however it does not completely solve the problem.

      The group commit works in the following way. The first thread registers itself
      as the leader in the group commit queue. Then it wakes up any following
      threads that may be waiting for it to commit, even though the commit has not
      happened yet. When a later thread reaches the group commit code, it notices
      that there is already a leader. So it does not attempt to commit itself,
      instead it just adds itself to the queue. So despite being woken up too early,
      the later thread will be committed in-order because there is only one leader.

      However, the directly written statements bypass this group commit code. This
      means that if they are woken up early, then can race a group commit leader for
      the LOCK_log mutex, and if they win the race they can write to the binlog
      ahead of the leader, causing the wrong binlog order.

      I think a possible solution is to make the direct write transactions also go
      through the group commit code. This would additionally make them benefit from
      potentially reduced fsync(), though that may be less important due to rarity
      of direct write statements. So the write-to-file part of
      MYSQL_BIN_LOG::write() must be pulled out into a separate function, which is
      then called in the non-direct case. In the direct case, the thread must then
      go through MYSQL_BIN_LOG::write_transaction_to_binlog_events() to either
      become the leader and do the group commit itself, or queue up as a
      participant. Finally, the MYSQL_BIN_LOG::write_transaction_to_binlog_events()
      code must be extended to be able to also handle the direct write case, calling
      the pulled-out function instead of flushing the binlog cache to the main
      binlog file.

      This solution will need some work, but sounds like a possibility.

      The problem occurs only as a rare race, but it can be triggered a few times an
      hour with an rqg test like this:

          perl ./runall-new.pl --grammar=conf/replication/replication-ddl_sql.yy --gendata=conf/replication/replication-ddl_data.zz --redefine=conf/mariadb/general-workarounds.yy --threads=8 --duration=600 --queries=100M --rpl_mode=row --mysqld=--slave-parallel-threads=64 --mysqld=--slave-parallel-mode=domain,transactional --mysqld=--log-bin=mysql-bin --mysqld=--log-slave-updates --mysqld=--binlog-format=row --mysqld=--gtid-strict-mode=1 --engine=InnoDB --use-gtid=slave_pos --basedir=$HOME/my/10.0/work-10.0-mdev6676/bld --vardir=/dev/shm/a

      The gtid-strict-mode makes the slave fail when an out-of-order binlog write is
      about to happen.

      Attachments

        Activity

          Maybe another fix could be to not wake up other transactions until after the
          commit is complete?

          I found another related problem, actually seen as a very rare race/failure
          in test case rpl.rpl_parallel:

          rpl.rpl_parallel 'row,xtradb'            w1 [ fail ]
           
          CURRENT_TEST: rpl.rpl_parallel
          --- /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.result	2014-09-05 14:22:34.244677000 +0200
          +++ /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.reject	2014-10-02 16:55:51.201110361 +0200
          @@ -826,7 +826,7 @@
           3	NULL
           4	4
           5	NULL
          -6	NULL
          +6	6
           SET @last_gtid= 'GTID';
           SELECT IF(@@gtid_slave_pos LIKE CONCAT('%',@last_gtid,'%'), "GTID found ok",
           CONCAT("GTID ", @last_gtid, " not found in gtid_slave_pos=", @@gtid_slave_pos))

          Here we have two transactions:

          UPDATE t4 SET b=NULL WHERE a=6;
          DELETE FROM t4 WHERE b <= 1;

          The failure suggests that the slave sees the DELETE but not the UPDATE. In
          fact the DELETE does not modify any rows in this case, and is not binlogged in
          row mode, I think, so seems plausible that it could be woken up early during
          group commit of the UPDATE, and the slave could complete the DELETE and update
          the slave position before the UPDATE is binlogged and committed.

          In GTID mode, this would actually be a bug, as then in case of crash we could
          lose the UPDATE.

          knielsen Kristian Nielsen added a comment - Maybe another fix could be to not wake up other transactions until after the commit is complete? I found another related problem, actually seen as a very rare race/failure in test case rpl.rpl_parallel: rpl.rpl_parallel 'row,xtradb' w1 [ fail ]   CURRENT_TEST: rpl.rpl_parallel --- /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.result 2014-09-05 14:22:34.244677000 +0200 +++ /home/knielsen/my/10.0/work-10.0-mdev6676/mysql-test/suite/rpl/r/rpl_parallel.reject 2014-10-02 16:55:51.201110361 +0200 @@ -826,7 +826,7 @@ 3 NULL 4 4 5 NULL -6 NULL +6 6 SET @last_gtid= 'GTID'; SELECT IF(@@gtid_slave_pos LIKE CONCAT('%',@last_gtid,'%'), "GTID found ok", CONCAT("GTID ", @last_gtid, " not found in gtid_slave_pos=", @@gtid_slave_pos)) Here we have two transactions: UPDATE t4 SET b=NULL WHERE a=6; DELETE FROM t4 WHERE b <= 1; The failure suggests that the slave sees the DELETE but not the UPDATE. In fact the DELETE does not modify any rows in this case, and is not binlogged in row mode, I think, so seems plausible that it could be woken up early during group commit of the UPDATE, and the slave could complete the DELETE and update the slave position before the UPDATE is binlogged and committed. In GTID mode, this would actually be a bug, as then in case of crash we could lose the UPDATE.
          knielsen Kristian Nielsen added a comment - I now have a patch for this that looks good. Test case: http://lists.askmonty.org/pipermail/commits/2014-October/006782.html Patch: http://lists.askmonty.org/pipermail/commits/2014-October/006786.html
          knielsen Kristian Nielsen added a comment - - edited Pushed to 10.0.15: http://lists.askmonty.org/pipermail/commits/2014-November/006976.html

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.