Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6775

Wrong binlog order in parallel replication




      Parallel replication uses the wait_for_commit facility to ensure that binlog
      order on the slave is the same as on the master.

      However, there is a bug in this for statements such as GRANT, which are
      written directly into the binlog, rather than go through the group commit

      First, the code for directly writing to binlog, in MYSQL_BIN_LOG::write(), was
      missing a call to wait_for_prior_commit(), so they could happen completely
      independently of earlier commits. I have a patch to add this missing call,
      however it does not completely solve the problem.

      The group commit works in the following way. The first thread registers itself
      as the leader in the group commit queue. Then it wakes up any following
      threads that may be waiting for it to commit, even though the commit has not
      happened yet. When a later thread reaches the group commit code, it notices
      that there is already a leader. So it does not attempt to commit itself,
      instead it just adds itself to the queue. So despite being woken up too early,
      the later thread will be committed in-order because there is only one leader.

      However, the directly written statements bypass this group commit code. This
      means that if they are woken up early, then can race a group commit leader for
      the LOCK_log mutex, and if they win the race they can write to the binlog
      ahead of the leader, causing the wrong binlog order.

      I think a possible solution is to make the direct write transactions also go
      through the group commit code. This would additionally make them benefit from
      potentially reduced fsync(), though that may be less important due to rarity
      of direct write statements. So the write-to-file part of
      MYSQL_BIN_LOG::write() must be pulled out into a separate function, which is
      then called in the non-direct case. In the direct case, the thread must then
      go through MYSQL_BIN_LOG::write_transaction_to_binlog_events() to either
      become the leader and do the group commit itself, or queue up as a
      participant. Finally, the MYSQL_BIN_LOG::write_transaction_to_binlog_events()
      code must be extended to be able to also handle the direct write case, calling
      the pulled-out function instead of flushing the binlog cache to the main
      binlog file.

      This solution will need some work, but sounds like a possibility.

      The problem occurs only as a rare race, but it can be triggered a few times an
      hour with an rqg test like this:

          perl ./runall-new.pl --grammar=conf/replication/replication-ddl_sql.yy --gendata=conf/replication/replication-ddl_data.zz --redefine=conf/mariadb/general-workarounds.yy --threads=8 --duration=600 --queries=100M --rpl_mode=row --mysqld=--slave-parallel-threads=64 --mysqld=--slave-parallel-mode=domain,transactional --mysqld=--log-bin=mysql-bin --mysqld=--log-slave-updates --mysqld=--binlog-format=row --mysqld=--gtid-strict-mode=1 --engine=InnoDB --use-gtid=slave_pos --basedir=$HOME/my/10.0/work-10.0-mdev6676/bld --vardir=/dev/shm/a

      The gtid-strict-mode makes the slave fail when an out-of-order binlog write is
      about to happen.




            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            0 Vote for this issue
            0 Start watching this issue



              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.