[MDEV-33954] --slave-skip-errors=all Incompatible with Optimistic Parallel Replication - Jira

XML

Word

Printable

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.5(EOL), 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL)
Fix Version/s: 10.6, 10.11, 11.4
Component/s: Replication
Labels:

Description

After fixing ~~MDEV-27512~~, there remains a bug with optimistic parallel replication, and deadlocks in row format.

Now, --slave-skip-errors is obviously a dangerous option, but presumably the
use case is to make the slave continue at all costs in case an unexpected
error occurs, rather than stop the slave and break replication.

But optimistic parallel replication by design can introduce many different
transient errors due to conflicts, that are then handled by rolling back and
retrying the offending transactions. These errors are expected, and they
do not cause slave to stop or replication to break.

However, it appears that --slave_skip_errors also affects such transient
errors due to optimistic parallel replication, and will cause any such
transaction to be silently ignored! This must be very wrong, it will cause
massive replication divergence.

It seems to me this is the real bug here. When an error is encountered
during optimistic apply of a transaction (is_parallel_retry_error() returns
true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should
not be subject to --slave-skip-errors. The transaction should be rolled
back as normal, and wait_for_prior_commit() done. Then after setting
rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error,
--slave-skip-errors can apply.

I put together a quick test that seems to show this behaviour, included
below. This tests replicates correctly without replication stopping with
error. But running it with --mysqld=--slave-skip-errors=all, it replicates
incorrectly, skipping lots of transactions. The test is somewhat contrieved,
but I think it shows the real problem, that --slave-skip-errors can randomly
cause transactions to be skipped or not depending on if optimistic parallel
replication triggers a matching transient error or not.

So in summary, it looks like there is a real problem here, that optimistic
parallel replication is not working correctly with --slave-skip-errors,
transient errors incorrectly causes conflicts to skip transactions rather
than retrying them. This will cause replication to diverge even when no real
errors occur.

--source include/have_innodb.inc

--source include/have_binlog_format_row.inc

--source include/master-slave.inc

--connection master

ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;

CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;

INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);

--sync_slave_with_master

--source include/stop_slave.inc

CHANGE MASTER TO master_use_gtid=slave_pos;

SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;

SET GLOBAL innodb_lock_wait_timeout= 5;

SET @old_parallel= @@GLOBAL.slave_parallel_threads;

SET @old_mode= @@GLOBAL.slave_parallel_mode;

SET GLOBAL slave_parallel_mode= aggressive;

SET GLOBAL slave_parallel_threads= 20;

--connection master

UPDATE t1 SET b=b+1 WHERE a=6;

--disable_query_log

let $i= 0;

while ($i < 40) {

  eval UPDATE t1 SET b=b+1 WHERE a=2;

  inc $i;

--enable_query_log

SELECT * FROM t1 ORDER BY a;

--save_master_pos

--connection slave1

# Block first worker, and recursively pause all following workers that get

# temporary errors before they can retry.

BEGIN;

SELECT * FROM t1 WHERE a=6 FOR UPDATE;

--connection slave

# Cause initial row not found error.

SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2;

--source include/start_slave.inc

--sleep 2

# Now following workers should be waiting for prior commit before retrying.

# Remove the row not found error.

SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7;

--connection slave1

ROLLBACK;

--connection slave

--sync_with_master

SELECT * FROM t1 ORDER BY a;

# Cleanup

--connection slave

--source include/stop_slave.inc

SET GLOBAL innodb_lock_wait_timeout= @old_timeout;

SET GLOBAL slave_parallel_threads= @old_parallel;

SET GLOBAL slave_parallel_mode= @old_mode;

--source include/start_slave.inc

--connection default

DROP TABLE t1;

--source include/rpl_end.inc

Attachments

Issue Links

relates to

MDEV-34010 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964

Open

MDEV-27512 Assertion `! thd->transaction_rollback_request' failed in rows_event_stmt_cleanup

Closed

Activity

People

Assignee:: Brandon Nesterenko

Reporter:: Brandon Nesterenko

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-04-19 21:11

Updated:: 2025-07-14 21:43

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1d 1h 55m

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.