Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33954

--slave-skip-errors=all Incompatible with Optimistic Parallel Replication

Details

    Description

      After fixing MDEV-27512, there remains a bug with optimistic parallel replication, and deadlocks in row format.

      From knielsen:

      Now, --slave-skip-errors is obviously a dangerous option, but presumably the
      use case is to make the slave continue at all costs in case an unexpected
      error occurs, rather than stop the slave and break replication.

      But optimistic parallel replication by design can introduce many different
      transient errors due to conflicts, that are then handled by rolling back and
      retrying the offending transactions. These errors are expected, and they
      do not cause slave to stop or replication to break.

      However, it appears that --slave_skip_errors also affects such transient
      errors due to optimistic parallel replication, and will cause any such
      transaction to be silently ignored! This must be very wrong, it will cause
      massive replication divergence.

      It seems to me this is the real bug here. When an error is encountered
      during optimistic apply of a transaction (is_parallel_retry_error() returns
      true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should
      not be subject to --slave-skip-errors. The transaction should be rolled
      back as normal, and wait_for_prior_commit() done. Then after setting
      rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error,
      --slave-skip-errors can apply.

      I put together a quick test that seems to show this behaviour, included
      below. This tests replicates correctly without replication stopping with
      error. But running it with --mysqld=--slave-skip-errors=all, it replicates
      incorrectly, skipping lots of transactions. The test is somewhat contrieved,
      but I think it shows the real problem, that --slave-skip-errors can randomly
      cause transactions to be skipped or not depending on if optimistic parallel
      replication triggers a matching transient error or not.

      So in summary, it looks like there is a real problem here, that optimistic
      parallel replication is not working correctly with --slave-skip-errors,
      transient errors incorrectly causes conflicts to skip transactions rather
      than retrying them. This will cause replication to diverge even when no real
      errors occur.

      --source include/have_innodb.inc
      --source include/have_binlog_format_row.inc
      --source include/master-slave.inc
       
      --connection master
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
      CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
       
      --sync_slave_with_master
       
      --source include/stop_slave.inc
      CHANGE MASTER TO master_use_gtid=slave_pos;
      SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;
      SET GLOBAL innodb_lock_wait_timeout= 5;
      SET @old_parallel= @@GLOBAL.slave_parallel_threads;
      SET @old_mode= @@GLOBAL.slave_parallel_mode;
      SET GLOBAL slave_parallel_mode= aggressive;
      SET GLOBAL slave_parallel_threads= 20;
       
      --connection master
      UPDATE t1 SET b=b+1 WHERE a=6;
       
      --disable_query_log
      let $i= 0;
      while ($i < 40) {
        eval UPDATE t1 SET b=b+1 WHERE a=2;
        inc $i;
      }
      --enable_query_log
       
      SELECT * FROM t1 ORDER BY a;
      --save_master_pos
       
      --connection slave1
      # Block first worker, and recursively pause all following workers that get
      # temporary errors before they can retry.
      BEGIN;
      SELECT * FROM t1 WHERE a=6 FOR UPDATE;
       
      --connection slave
      # Cause initial row not found error.
      SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2;
       
      --source include/start_slave.inc
       
      --sleep 2
      # Now following workers should be waiting for prior commit before retrying.
      # Remove the row not found error.
      SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7;
       
      --connection slave1
      ROLLBACK;
       
      --connection slave
      --sync_with_master
       
      SELECT * FROM t1 ORDER BY a;
       
      # Cleanup
      --connection slave
      --source include/stop_slave.inc
      SET GLOBAL innodb_lock_wait_timeout= @old_timeout;
      SET GLOBAL slave_parallel_threads= @old_parallel;
      SET GLOBAL slave_parallel_mode= @old_mode;
      --source include/start_slave.inc
       
      --connection default
      DROP TABLE t1;
       
      --source include/rpl_end.inc
      

      Attachments

        Issue Links

          Activity

            I think there are some errors that must never be ignored, even with --slave-skip-errors=all.
            In particular 1964 (ER_PRIOR_COMMIT_FAILED). If we stop replication due to an error in transaction T1, we cannot ignore that and continue committing transactions T2, T3, ...
            We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly.

            Also, deadlock kill errors must not be ignored, as they are central to the workings of optimistic parallel replication. Maybe none of the normal temporary errors such as deadlock or lock wait timeout should be ignorable? It seems more correct to handle the error and retry the transaction, resulting in the transaction eventually committing without error. Only if the max number of retries is reached can such errors be ignored from --slave-skip-errors.

            knielsen Kristian Nielsen added a comment - I think there are some errors that must never be ignored, even with --slave-skip-errors=all. In particular 1964 (ER_PRIOR_COMMIT_FAILED). If we stop replication due to an error in transaction T1, we cannot ignore that and continue committing transactions T2, T3, ... We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly. Also, deadlock kill errors must not be ignored, as they are central to the workings of optimistic parallel replication. Maybe none of the normal temporary errors such as deadlock or lock wait timeout should be ignorable? It seems more correct to handle the error and retry the transaction, resulting in the transaction eventually committing without error. Only if the max number of retries is reached can such errors be ignored from --slave-skip-errors.

            Hi knielsen,

            To your concern of

            We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly.

            A bit of a long response here (in short, there is a problem).

            The only event types which check ignored values are Row, Query and Incident log events. The only places where we send wake-up messages are in finish_event_group, trx_group_commit_leader, and write_transaction_to_binlog_events.

            The problems seem to lie exclusively with Query_log_events. Your specific scenario is possible (if I understand correctly) only with non-transactional tables, as they use COMMIT/ROLLBACK Query events to end transactions, which during binlog_group_commit, will signal an error to future threads; and then when back at Query_log_event::do_apply_event(), will then be ignored. However, as non-transactional tables aren’t really valid with parallel replication, this doesn’t seem to be a big concern.

            But DDLs (and probably admin commands, XA, etc) seem to produce different replication-breaking behaviors (I imagine), as these binlog during their execution. So if binlogging the event fails, it can produce a broken binlog (e.g. a standalone GTID event if its proceeding Query_log_event::write() fails). I’m sure there are more issues here as well, I just haven’t explored the breadth here.

            Rows_log_events and Incident_log_events seem safe.

            Then also, Xid_log_event’s do not check against ignored errors, which I wonder if they should? This might already be filed though, I haven't checked.

            bnestere Brandon Nesterenko added a comment - Hi knielsen , To your concern of We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly. A bit of a long response here (in short, there is a problem). The only event types which check ignored values are Row, Query and Incident log events. The only places where we send wake-up messages are in finish_event_group, trx_group_commit_leader, and write_transaction_to_binlog_events. The problems seem to lie exclusively with Query_log_events. Your specific scenario is possible (if I understand correctly) only with non-transactional tables, as they use COMMIT/ROLLBACK Query events to end transactions, which during binlog_group_commit, will signal an error to future threads; and then when back at Query_log_event::do_apply_event(), will then be ignored. However, as non-transactional tables aren’t really valid with parallel replication, this doesn’t seem to be a big concern. But DDLs (and probably admin commands, XA, etc) seem to produce different replication-breaking behaviors (I imagine), as these binlog during their execution. So if binlogging the event fails, it can produce a broken binlog (e.g. a standalone GTID event if its proceeding Query_log_event::write() fails). I’m sure there are more issues here as well, I just haven’t explored the breadth here. Rows_log_events and Incident_log_events seem safe. Then also, Xid_log_event’s do not check against ignored errors, which I wonder if they should? This might already be filed though, I haven't checked.

            knielsen Hi! Any feedback on the above? It would be great to see this issue resolved from a testing perspective. Thank you

            Roel Roel Van de Paar added a comment - knielsen Hi! Any feedback on the above? It would be great to see this issue resolved from a testing perspective. Thank you

            Well, let me repeat the primary feedback: Don't use --slave-skip-errors in your normal automated testing! Seriously. You should never use --slave-skip-errors in testing, except if testing the correctness of that specific option - and there are a lot more important areas of replication to prioritise the efforts on than a marginal feature such as --slave-skip-errors.

            As to Brandon's comments, my point was that the code must be consistent in the handling of an error. If the error is ignored due to --slave-skip-errors, it should not cause a ER_PRIOR_COMMIT_FAILED in a subsequent transaction. And wise versa, if an error is not ignored, then the following transaction must receive ER_PRIOR_COMMIT_FAILED, and must not ignore that error.

            Note that this does apply to non-transactional tables. These are valid in parallel replication, just with restrictions on the parallellism, but they still need consistent handling of ER_PRIOR_COMMIT_FAILED. I think it's reasonable to make --slave-skip-errors not apply to errors in COMMIT/ROLLBACK (and similarly Xid_log_event, as seems to already be the case).

            Yes, if binlogging fails this is a serious and generally non-recoverable error. Again, it could be reasonable that --slave-skip-errors does not apply to these. But it's not something to try to fix, I think, the --slave-skip-errors is documented to be a very special option meant only for testing, it is on the user's discretion to use it in a way that is appropriate.

            The most important thing is to ensure that --slave-skip-errors does not cause the skipping of retry of a transaction to handle a temporary error. This really applies to both parallel and non-parallel replication, but is especially important with optimistic parallel replication.

            In general, some errors cannot be ignored, and it is reasonable to restrict --slave-skip-errors in these cases. But --slave-skip-errors is a very special-purpose and fragile option, and it's not something to spend a lot of effort on, better keep things simple and leave it to the user's discression how it can be used with whatever restrictions and caveats apply.

            knielsen Kristian Nielsen added a comment - Well, let me repeat the primary feedback: Don't use --slave-skip-errors in your normal automated testing! Seriously. You should never use --slave-skip-errors in testing, except if testing the correctness of that specific option - and there are a lot more important areas of replication to prioritise the efforts on than a marginal feature such as --slave-skip-errors. As to Brandon's comments, my point was that the code must be consistent in the handling of an error. If the error is ignored due to --slave-skip-errors, it should not cause a ER_PRIOR_COMMIT_FAILED in a subsequent transaction. And wise versa, if an error is not ignored, then the following transaction must receive ER_PRIOR_COMMIT_FAILED, and must not ignore that error. Note that this does apply to non-transactional tables. These are valid in parallel replication, just with restrictions on the parallellism, but they still need consistent handling of ER_PRIOR_COMMIT_FAILED. I think it's reasonable to make --slave-skip-errors not apply to errors in COMMIT/ROLLBACK (and similarly Xid_log_event, as seems to already be the case). Yes, if binlogging fails this is a serious and generally non-recoverable error. Again, it could be reasonable that --slave-skip-errors does not apply to these. But it's not something to try to fix, I think, the --slave-skip-errors is documented to be a very special option meant only for testing, it is on the user's discretion to use it in a way that is appropriate. The most important thing is to ensure that --slave-skip-errors does not cause the skipping of retry of a transaction to handle a temporary error. This really applies to both parallel and non-parallel replication, but is especially important with optimistic parallel replication. In general, some errors cannot be ignored, and it is reasonable to restrict --slave-skip-errors in these cases. But --slave-skip-errors is a very special-purpose and fragile option, and it's not something to spend a lot of effort on, better keep things simple and leave it to the user's discression how it can be used with whatever restrictions and caveats apply.

            knielsen, I understand and appreciate your advice on slave-skip-errors, but I respectfully disagree on this point.

            When we test replication features, we include exploratory testing by replaying various statements randomly against the master using various configurations. This approach can cause a range of issues on the slave side. Some issues are valid, like testcase shortcomings, while others are due to replication shortcomings and bugs. There are also a few borderline cases, such as the ~2 tickets where we discussed --slave-skip-errors.

            Over the past year, we've developed numerous filters to automatically handle most valid testcase shortcomings. Our exploratory replication testing runs can now quickly identify major issues, whether they involve crashes, assertions, or failures/errors on the slave side.

            If we remove --slave-skip-errors=all from such testing, it would prematurely halt many tests on the slave side, thereby negating the significant progress we've made in locating replication bugs.

            It's also worth noting that we are almost always to remove slave-skip-errors during test trial analysis, after reducing the testcase to a smaller size. i.e. without slave-skip-errors, for many test trials the slave (unlike the master) would not have progressed far enough to have reached the failing SQL statement, while at the same time the option is not required in the final testcase.

            As such it is very beneficial to keep --slave-skip-errors=all a part of our exploratory testing, which is an important part in our overall replication testing. Your efforts to help us address and fix any issues with it are much appreciated. Thank you

            Roel Roel Van de Paar added a comment - knielsen , I understand and appreciate your advice on slave-skip-errors, but I respectfully disagree on this point. When we test replication features, we include exploratory testing by replaying various statements randomly against the master using various configurations. This approach can cause a range of issues on the slave side. Some issues are valid, like testcase shortcomings, while others are due to replication shortcomings and bugs. There are also a few borderline cases, such as the ~2 tickets where we discussed --slave-skip-errors. Over the past year, we've developed numerous filters to automatically handle most valid testcase shortcomings. Our exploratory replication testing runs can now quickly identify major issues, whether they involve crashes, assertions, or failures/errors on the slave side. If we remove --slave-skip-errors=all from such testing, it would prematurely halt many tests on the slave side, thereby negating the significant progress we've made in locating replication bugs. It's also worth noting that we are almost always to remove slave-skip-errors during test trial analysis, after reducing the testcase to a smaller size. i.e. without slave-skip-errors, for many test trials the slave (unlike the master) would not have progressed far enough to have reached the failing SQL statement, while at the same time the option is not required in the final testcase. As such it is very beneficial to keep --slave-skip-errors=all a part of our exploratory testing, which is an important part in our overall replication testing. Your efforts to help us address and fix any issues with it are much appreciated. Thank you

            People

              bnestere Brandon Nesterenko
              bnestere Brandon Nesterenko
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.