Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33954

--slave-skip-errors=all Incompatible with Optimistic Parallel Replication

    XMLWordPrintable

Details

    Description

      After fixing MDEV-27512, there remains a bug with optimistic parallel replication, and deadlocks in row format.

      From knielsen:

      Now, --slave-skip-errors is obviously a dangerous option, but presumably the
      use case is to make the slave continue at all costs in case an unexpected
      error occurs, rather than stop the slave and break replication.

      But optimistic parallel replication by design can introduce many different
      transient errors due to conflicts, that are then handled by rolling back and
      retrying the offending transactions. These errors are expected, and they
      do not cause slave to stop or replication to break.

      However, it appears that --slave_skip_errors also affects such transient
      errors due to optimistic parallel replication, and will cause any such
      transaction to be silently ignored! This must be very wrong, it will cause
      massive replication divergence.

      It seems to me this is the real bug here. When an error is encountered
      during optimistic apply of a transaction (is_parallel_retry_error() returns
      true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should
      not be subject to --slave-skip-errors. The transaction should be rolled
      back as normal, and wait_for_prior_commit() done. Then after setting
      rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error,
      --slave-skip-errors can apply.

      I put together a quick test that seems to show this behaviour, included
      below. This tests replicates correctly without replication stopping with
      error. But running it with --mysqld=--slave-skip-errors=all, it replicates
      incorrectly, skipping lots of transactions. The test is somewhat contrieved,
      but I think it shows the real problem, that --slave-skip-errors can randomly
      cause transactions to be skipped or not depending on if optimistic parallel
      replication triggers a matching transient error or not.

      So in summary, it looks like there is a real problem here, that optimistic
      parallel replication is not working correctly with --slave-skip-errors,
      transient errors incorrectly causes conflicts to skip transactions rather
      than retrying them. This will cause replication to diverge even when no real
      errors occur.

      --source include/have_innodb.inc
      --source include/have_binlog_format_row.inc
      --source include/master-slave.inc
       
      --connection master
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
      CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
       
      --sync_slave_with_master
       
      --source include/stop_slave.inc
      CHANGE MASTER TO master_use_gtid=slave_pos;
      SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;
      SET GLOBAL innodb_lock_wait_timeout= 5;
      SET @old_parallel= @@GLOBAL.slave_parallel_threads;
      SET @old_mode= @@GLOBAL.slave_parallel_mode;
      SET GLOBAL slave_parallel_mode= aggressive;
      SET GLOBAL slave_parallel_threads= 20;
       
      --connection master
      UPDATE t1 SET b=b+1 WHERE a=6;
       
      --disable_query_log
      let $i= 0;
      while ($i < 40) {
        eval UPDATE t1 SET b=b+1 WHERE a=2;
        inc $i;
      }
      --enable_query_log
       
      SELECT * FROM t1 ORDER BY a;
      --save_master_pos
       
      --connection slave1
      # Block first worker, and recursively pause all following workers that get
      # temporary errors before they can retry.
      BEGIN;
      SELECT * FROM t1 WHERE a=6 FOR UPDATE;
       
      --connection slave
      # Cause initial row not found error.
      SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2;
       
      --source include/start_slave.inc
       
      --sleep 2
      # Now following workers should be waiting for prior commit before retrying.
      # Remove the row not found error.
      SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7;
       
      --connection slave1
      ROLLBACK;
       
      --connection slave
      --sync_with_master
       
      SELECT * FROM t1 ORDER BY a;
       
      # Cleanup
      --connection slave
      --source include/stop_slave.inc
      SET GLOBAL innodb_lock_wait_timeout= @old_timeout;
      SET GLOBAL slave_parallel_threads= @old_parallel;
      SET GLOBAL slave_parallel_mode= @old_mode;
      --source include/start_slave.inc
       
      --connection default
      DROP TABLE t1;
       
      --source include/rpl_end.inc
      

      Attachments

        Issue Links

          Activity

            People

              bnestere Brandon Nesterenko
              bnestere Brandon Nesterenko
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.