Details
-
Bug
-
Status: Confirmed (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.5, 10.6, 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL), 11.4, 11.5(EOL)
Description
After fixing MDEV-27512, there remains a bug with optimistic parallel replication, and deadlocks in row format.
From knielsen:
Now, --slave-skip-errors is obviously a dangerous option, but presumably the
use case is to make the slave continue at all costs in case an unexpected
error occurs, rather than stop the slave and break replication.But optimistic parallel replication by design can introduce many different
transient errors due to conflicts, that are then handled by rolling back and
retrying the offending transactions. These errors are expected, and they
do not cause slave to stop or replication to break.However, it appears that --slave_skip_errors also affects such transient
errors due to optimistic parallel replication, and will cause any such
transaction to be silently ignored! This must be very wrong, it will cause
massive replication divergence.It seems to me this is the real bug here. When an error is encountered
during optimistic apply of a transaction (is_parallel_retry_error() returns
true, eg. rgi->speculation == SPECULATE_OPTIMISTIC, then this error should
not be subject to --slave-skip-errors. The transaction should be rolled
back as normal, and wait_for_prior_commit() done. Then after setting
rgi->speculation = SPECULATE_WAIT and retrying, if we still get the error,
--slave-skip-errors can apply.I put together a quick test that seems to show this behaviour, included
below. This tests replicates correctly without replication stopping with
error. But running it with --mysqld=--slave-skip-errors=all, it replicates
incorrectly, skipping lots of transactions. The test is somewhat contrieved,
but I think it shows the real problem, that --slave-skip-errors can randomly
cause transactions to be skipped or not depending on if optimistic parallel
replication triggers a matching transient error or not.So in summary, it looks like there is a real problem here, that optimistic
parallel replication is not working correctly with --slave-skip-errors,
transient errors incorrectly causes conflicts to skip transactions rather
than retrying them. This will cause replication to diverge even when no real
errors occur.
--source include/have_innodb.inc
|
--source include/have_binlog_format_row.inc
|
--source include/master-slave.inc
|
|
--connection master
|
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
|
CREATE TABLE t1 (a INT PRIMARY KEY, b INT) ENGINE=InnoDB;
|
INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);
|
|
--sync_slave_with_master
|
|
--source include/stop_slave.inc
|
CHANGE MASTER TO master_use_gtid=slave_pos;
|
SET @old_timeout= @@GLOBAL.innodb_lock_wait_timeout;
|
SET GLOBAL innodb_lock_wait_timeout= 5;
|
SET @old_parallel= @@GLOBAL.slave_parallel_threads;
|
SET @old_mode= @@GLOBAL.slave_parallel_mode;
|
SET GLOBAL slave_parallel_mode= aggressive;
|
SET GLOBAL slave_parallel_threads= 20;
|
|
--connection master
|
UPDATE t1 SET b=b+1 WHERE a=6;
|
|
--disable_query_log
|
let $i= 0;
|
while ($i < 40) {
|
eval UPDATE t1 SET b=b+1 WHERE a=2;
|
inc $i;
|
}
|
--enable_query_log
|
|
SELECT * FROM t1 ORDER BY a;
|
--save_master_pos
|
|
--connection slave1
|
# Block first worker, and recursively pause all following workers that get
|
# temporary errors before they can retry.
|
BEGIN;
|
SELECT * FROM t1 WHERE a=6 FOR UPDATE;
|
|
--connection slave
|
# Cause initial row not found error.
|
SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=7 WHERE a=2;
|
|
--source include/start_slave.inc
|
|
--sleep 2
|
# Now following workers should be waiting for prior commit before retrying.
|
# Remove the row not found error.
|
SET STATEMENT sql_log_bin=0 FOR UPDATE t1 SET a=2 WHERE a=7;
|
|
--connection slave1
|
ROLLBACK;
|
|
--connection slave
|
--sync_with_master
|
|
SELECT * FROM t1 ORDER BY a;
|
|
# Cleanup
|
--connection slave
|
--source include/stop_slave.inc
|
SET GLOBAL innodb_lock_wait_timeout= @old_timeout;
|
SET GLOBAL slave_parallel_threads= @old_parallel;
|
SET GLOBAL slave_parallel_mode= @old_mode;
|
--source include/start_slave.inc
|
|
--connection default
|
DROP TABLE t1;
|
|
--source include/rpl_end.inc
|
Attachments
Issue Links
- relates to
-
MDEV-34010 [ERROR] Slave SQL: Commit failed due to failure of an earlier commit on which this one depends, Gtid ..., Internal MariaDB error code: 1964
- Open
-
MDEV-27512 Assertion `! thd->transaction_rollback_request' failed in rows_event_stmt_cleanup
- Closed