[MDEV-33954] --slave-skip-errors=all Incompatible with Optimistic Parallel Replication - Jira

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.5, 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL)
Fix Version/s: 10.5, 10.6, 10.11, 11.4
Component/s: Replication
Labels:

Description

After fixing ~~MDEV-27512~~, there remains a bug with optimistic parallel replication, and deadlocks in row format.

Now, --slave-skip-errors is obviously a dangerous option, but presumably the
use case is to make the slave continue at all costs in case an unexpected
error occurs, rather than stop the slave and break replication.

But optimistic parallel replication by design can introduce many different
transient errors due to conflicts, that are then handled by rolling back and
retrying the offending transactions. These errors are expected, and they
do not cause slave to stop or replication to break.

However, it appears that --slave_skip_errors also affects such transient
errors due to optimistic parallel replication, and will cause any such
transaction to be silently ignored! This must be very wrong, it will cause
massive replication divergence.

It seems to me this is the real bug here. When an error is encountered
during optimistic apply of a transaction (is_parallel_retry_error() returns
true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should
not be subject to --slave-skip-errors. The transaction should be rolled
back as normal, and wait_for_prior_commit() done. Then after setting
rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error,
--slave-skip-errors can apply.

I put together a quick test that seems to show this behaviour, included
below. This tests replicates correctly without replication stopping with
error. But running it with --mysqld=--slave-skip-errors=all, it replicates
incorrectly, skipping lots of transactions. The test is somewhat contrieved,
but I think it shows the real problem, that --slave-skip-errors can randomly
cause transactions to be skipped or not depending on if optimistic parallel
replication triggers a matching transient error or not.

So in summary, it looks like there is a real problem here, that optimistic
parallel replication is not working correctly with --slave-skip-errors,
transient errors incorrectly causes conflicts to skip transactions rather
than retrying them. This will cause replication to diverge even when no real
errors occur.

--source include/have_innodb.inc

--source include/have_binlog_format_row.inc

--source include/master-slave.inc

--connection master

ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;

CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);

--sync_slave_with_master

--source include/stop_slave.inc

CHANGE MASTER TO master_use_gtid=slave_pos;

SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;

SET GLOBAL innodb_lock_wait_timeout= 5;

SET @old_parallel= @@GLOBAL.slave_parallel_threads;

SET @old_mode= @@GLOBAL.slave_parallel_mode;

SET GLOBAL slave_parallel_mode= aggressive;

SET GLOBAL slave_parallel_threads= 20;

--connection master

UPDATE t1 SET b=b+1 WHERE a=6;

--disable_query_log

let $i= 0;

while ($i < 40) {

  eval UPDATE t1 SET b=b+1 WHERE a=2;

  inc $i;

--enable_query_log

SELECT * FROM t1 ORDER BY a;

--save_master_pos

--connection slave1

# Block first worker, and recursively pause all following workers that get

# temporary errors before they can retry.

BEGIN;

SELECT * FROM t1 WHERE a=6 FOR UPDATE;

--connection slave

# Cause initial row not found error.

SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2;

--source include/start_slave.inc

--sleep 2

# Now following workers should be waiting for prior commit before retrying.

# Remove the row not found error.

SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7;

--connection slave1

ROLLBACK;

--connection slave

--sync_with_master

SELECT * FROM t1 ORDER BY a;

# Cleanup

--connection slave

--source include/stop_slave.inc

SET GLOBAL innodb_lock_wait_timeout= @old_timeout;

SET GLOBAL slave_parallel_threads= @old_parallel;

SET GLOBAL slave_parallel_mode= @old_mode;

--source include/start_slave.inc

--connection default

DROP TABLE t1;

--source include/rpl_end.inc

Attachments

Issue Links

relates to

MDEV-34010 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964

Open

MDEV-27512 Assertion `! thd->transaction_rollback_request' failed in rows_event_stmt_cleanup

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2024-04-27 09:55

I think there are some errors that must never be ignored, even with --slave-skip-errors=all.
In particular 1964 (ER_PRIOR_COMMIT_FAILED). If we stop replication due to an error in transaction T1, we cannot ignore that and continue committing transactions T2, T3, ...
We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly.

Also, deadlock kill errors must not be ignored, as they are central to the workings of optimistic parallel replication. Maybe none of the normal temporary errors such as deadlock or lock wait timeout should be ignorable? It seems more correct to handle the error and retry the transaction, resulting in the transaction eventually committing without error. Only if the max number of retries is reached can such errors be ignored from --slave-skip-errors.

Kristian Nielsen added a comment - 2024-04-27 09:55 I think there are some errors that must never be ignored, even with --slave-skip-errors=all. In particular 1964 (ER_PRIOR_COMMIT_FAILED). If we stop replication due to an error in transaction T1, we cannot ignore that and continue committing transactions T2, T3, ... We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly. Also, deadlock kill errors must not be ignored, as they are central to the workings of optimistic parallel replication. Maybe none of the normal temporary errors such as deadlock or lock wait timeout should be ignorable? It seems more correct to handle the error and retry the transaction, resulting in the transaction eventually committing without error. Only if the max number of retries is reached can such errors be ignored from --slave-skip-errors.

Brandon Nesterenko added a comment - 2024-06-24 21:22

Hi knielsen,

To your concern of

We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly.

A bit of a long response here (in short, there is a problem).

The only event types which check ignored values are Row, Query and Incident log events. The only places where we send wake-up messages are in finish_event_group, trx_group_commit_leader, and write_transaction_to_binlog_events.

The problems seem to lie exclusively with Query_log_events. Your specific scenario is possible (if I understand correctly) only with non-transactional tables, as they use COMMIT/ROLLBACK Query events to end transactions, which during binlog_group_commit, will signal an error to future threads; and then when back at Query_log_event::do_apply_event(), will then be ignored. However, as non-transactional tables aren’t really valid with parallel replication, this doesn’t seem to be a big concern.

But DDLs (and probably admin commands, XA, etc) seem to produce different replication-breaking behaviors (I imagine), as these binlog during their execution. So if binlogging the event fails, it can produce a broken binlog (e.g. a standalone GTID event if its proceeding Query_log_event::write() fails). I’m sure there are more issues here as well, I just haven’t explored the breadth here.

Rows_log_events and Incident_log_events seem safe.

Then also, Xid_log_event’s do not check against ignored errors, which I wonder if they should? This might already be filed though, I haven't checked.

Brandon Nesterenko added a comment - 2024-06-24 21:22 Hi knielsen , To your concern of We also need to be sure that there isn't somewhere in the code where an error occurs in T1, which then sends an ER_PRIOR_COMMIT_FAILED on to T2, and later T1 decides to ignore its own error and continue as normal - then the ER_PRIOR_COMMIT_FAILED was sent incorrectly. A bit of a long response here (in short, there is a problem). The only event types which check ignored values are Row, Query and Incident log events. The only places where we send wake-up messages are in finish_event_group, trx_group_commit_leader, and write_transaction_to_binlog_events. The problems seem to lie exclusively with Query_log_events. Your specific scenario is possible (if I understand correctly) only with non-transactional tables, as they use COMMIT/ROLLBACK Query events to end transactions, which during binlog_group_commit, will signal an error to future threads; and then when back at Query_log_event::do_apply_event(), will then be ignored. However, as non-transactional tables aren’t really valid with parallel replication, this doesn’t seem to be a big concern. But DDLs (and probably admin commands, XA, etc) seem to produce different replication-breaking behaviors (I imagine), as these binlog during their execution. So if binlogging the event fails, it can produce a broken binlog (e.g. a standalone GTID event if its proceeding Query_log_event::write() fails). I’m sure there are more issues here as well, I just haven’t explored the breadth here. Rows_log_events and Incident_log_events seem safe. Then also, Xid_log_event’s do not check against ignored errors, which I wonder if they should? This might already be filed though, I haven't checked.

Roel Van de Paar added a comment - 2024-06-28 22:59

knielsen Hi! Any feedback on the above? It would be great to see this issue resolved from a testing perspective. Thank you

Roel Van de Paar added a comment - 2024-06-28 22:59 knielsen Hi! Any feedback on the above? It would be great to see this issue resolved from a testing perspective. Thank you

Kristian Nielsen added a comment - 2024-06-29 11:25

Well, let me repeat the primary feedback: Don't use --slave-skip-errors in your normal automated testing! Seriously. You should never use --slave-skip-errors in testing, except if testing the correctness of that specific option - and there are a lot more important areas of replication to prioritise the efforts on than a marginal feature such as --slave-skip-errors.

As to Brandon's comments, my point was that the code must be consistent in the handling of an error. If the error is ignored due to --slave-skip-errors, it should not cause a ER_PRIOR_COMMIT_FAILED in a subsequent transaction. And wise versa, if an error is not ignored, then the following transaction must receive ER_PRIOR_COMMIT_FAILED, and must not ignore that error.

Note that this does apply to non-transactional tables. These are valid in parallel replication, just with restrictions on the parallellism, but they still need consistent handling of ER_PRIOR_COMMIT_FAILED. I think it's reasonable to make --slave-skip-errors not apply to errors in COMMIT/ROLLBACK (and similarly Xid_log_event, as seems to already be the case).

Yes, if binlogging fails this is a serious and generally non-recoverable error. Again, it could be reasonable that --slave-skip-errors does not apply to these. But it's not something to try to fix, I think, the --slave-skip-errors is documented to be a very special option meant only for testing, it is on the user's discretion to use it in a way that is appropriate.

The most important thing is to ensure that --slave-skip-errors does not cause the skipping of retry of a transaction to handle a temporary error. This really applies to both parallel and non-parallel replication, but is especially important with optimistic parallel replication.

In general, some errors cannot be ignored, and it is reasonable to restrict --slave-skip-errors in these cases. But --slave-skip-errors is a very special-purpose and fragile option, and it's not something to spend a lot of effort on, better keep things simple and leave it to the user's discression how it can be used with whatever restrictions and caveats apply.

Kristian Nielsen added a comment - 2024-06-29 11:25 Well, let me repeat the primary feedback: Don't use --slave-skip-errors in your normal automated testing! Seriously. You should never use --slave-skip-errors in testing, except if testing the correctness of that specific option - and there are a lot more important areas of replication to prioritise the efforts on than a marginal feature such as --slave-skip-errors. As to Brandon's comments, my point was that the code must be consistent in the handling of an error. If the error is ignored due to --slave-skip-errors, it should not cause a ER_PRIOR_COMMIT_FAILED in a subsequent transaction. And wise versa, if an error is not ignored, then the following transaction must receive ER_PRIOR_COMMIT_FAILED, and must not ignore that error. Note that this does apply to non-transactional tables. These are valid in parallel replication, just with restrictions on the parallellism, but they still need consistent handling of ER_PRIOR_COMMIT_FAILED. I think it's reasonable to make --slave-skip-errors not apply to errors in COMMIT/ROLLBACK (and similarly Xid_log_event, as seems to already be the case). Yes, if binlogging fails this is a serious and generally non-recoverable error. Again, it could be reasonable that --slave-skip-errors does not apply to these. But it's not something to try to fix, I think, the --slave-skip-errors is documented to be a very special option meant only for testing, it is on the user's discretion to use it in a way that is appropriate. The most important thing is to ensure that --slave-skip-errors does not cause the skipping of retry of a transaction to handle a temporary error. This really applies to both parallel and non-parallel replication, but is especially important with optimistic parallel replication. In general, some errors cannot be ignored, and it is reasonable to restrict --slave-skip-errors in these cases. But --slave-skip-errors is a very special-purpose and fragile option, and it's not something to spend a lot of effort on, better keep things simple and leave it to the user's discression how it can be used with whatever restrictions and caveats apply.

Roel Van de Paar added a comment - 2024-06-30 23:03

knielsen, I understand and appreciate your advice on slave-skip-errors, but I respectfully disagree on this point.

When we test replication features, we include exploratory testing by replaying various statements randomly against the master using various configurations. This approach can cause a range of issues on the slave side. Some issues are valid, like testcase shortcomings, while others are due to replication shortcomings and bugs. There are also a few borderline cases, such as the ~2 tickets where we discussed --slave-skip-errors.

Over the past year, we've developed numerous filters to automatically handle most valid testcase shortcomings. Our exploratory replication testing runs can now quickly identify major issues, whether they involve crashes, assertions, or failures/errors on the slave side.

If we remove --slave-skip-errors=all from such testing, it would prematurely halt many tests on the slave side, thereby negating the significant progress we've made in locating replication bugs.

It's also worth noting that we are almost always to remove slave-skip-errors during test trial analysis, after reducing the testcase to a smaller size. i.e. without slave-skip-errors, for many test trials the slave (unlike the master) would not have progressed far enough to have reached the failing SQL statement, while at the same time the option is not required in the final testcase.

As such it is very beneficial to keep --slave-skip-errors=all a part of our exploratory testing, which is an important part in our overall replication testing. Your efforts to help us address and fix any issues with it are much appreciated. Thank you

Roel Van de Paar added a comment - 2024-06-30 23:03 knielsen , I understand and appreciate your advice on slave-skip-errors, but I respectfully disagree on this point. When we test replication features, we include exploratory testing by replaying various statements randomly against the master using various configurations. This approach can cause a range of issues on the slave side. Some issues are valid, like testcase shortcomings, while others are due to replication shortcomings and bugs. There are also a few borderline cases, such as the ~2 tickets where we discussed --slave-skip-errors. Over the past year, we've developed numerous filters to automatically handle most valid testcase shortcomings. Our exploratory replication testing runs can now quickly identify major issues, whether they involve crashes, assertions, or failures/errors on the slave side. If we remove --slave-skip-errors=all from such testing, it would prematurely halt many tests on the slave side, thereby negating the significant progress we've made in locating replication bugs. It's also worth noting that we are almost always to remove slave-skip-errors during test trial analysis, after reducing the testcase to a smaller size. i.e. without slave-skip-errors, for many test trials the slave (unlike the master) would not have progressed far enough to have reached the failing SQL statement, while at the same time the option is not required in the final testcase. As such it is very beneficial to keep --slave-skip-errors=all a part of our exploratory testing, which is an important part in our overall replication testing. Your efforts to help us address and fix any issues with it are much appreciated. Thank you

MariaDB Server

--slave-skip-errors=all Incompatible with Optimistic Parallel Replication

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration