[MDEV-6589] Incorrect relay log start position when restarting SQL thread after error in parallel replication Created: 2014-08-15  Updated: 2015-03-04  Resolved: 2015-03-04

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.13
Fix Version/s: 10.0.18, 10.1.4

Type: Bug Priority: Critical
Reporter: Kristian Nielsen Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: gtid, parallelslave, replication

Issue Links:
Relates
relates to MDEV-4698 With GTID replication, relay logs can... Open
relates to MDEV-6551 Some replication errors are ignored i... Closed

 Description   

Incorrect relay log start position when restarting SQL thread after error in parallel replication

Suppose we are using parallel replication in GTID mode, using multiple
replication domains to run independent things in parallel.

Now suppose we get an error in one domain that causes replication to fail and
stop. Suppose further that the domain that fails is somewhat behind another
domain in the relay log. Eg. we might have events D1 D2 E1, event D1 in domain
D fails when E1 in domain E has already been replicated and committed.

In this case, the replication SQL thread will stop with current GTID position
"D1,E1". However, the IO thread keeps running.

Now suppose the SQL thread is restarted without first stopping the IO
thread. In this case, the SQL thread needs to continue from where it came to
in the relay log (in contrast, if the IO thread was also stopped first, then
the relay log will be deleted, and everything fetched anew from the master
starting at GTID position "D1,E1").

In this situation, it is necessary for the restarted SQL thread to start at
the correct position in the relay log corresponding to the GTID position
"D1,E1". Thus, the domain D must start at an earlier point than the domain
E. The SQL driver thread must start fetching at the start of D1, but then skip
E1 as it has already been executed.

Unfortunately, currently there is no such code to handle this situation
correctly. So what happens is that the SQL will restart from after the latest
event that was replicated correctly, losing any transactions in different
replication domains that occur earlier in the relay log and were not yet
replicated due to parallel replication. In the example, restart will begin after
E1, losing events D1 and D2.

So this is a rather serious problem.

A work-around is to stop the IO thread before restarting the SQL thread after
an error. This will cause the relay logs to be purged and fetched anew from
the master - and this code path does correctly handle starting each
replication domain in the correct position. This workaround can be done
manually by the user, or we could implement it as a temporary work-around
until the proper fix can be made.

To fix this, I think we need to implement proper GTID position search in the
relay log. Thus, when the SQL thread starts in GTID mode, it needs to find the
start position in the relay log based on the current GTID position, same way
as happens on the master when the IO thread connects over the network to
request events sent from a particular GTID position.

[This will then also prepare the way for later implementing that we can
preserve the relay log on the slave when the slave server is restarted (so we
do not need to re-fetch already fetched but not executed events). This however
will in addition require that crash recovery is implemented for the relay log]

Here is a test case for the bug:

--source include/have_innodb.inc
--let $rpl_topology=1->2
--source include/rpl_init.inc
 
--connection server_2
--source include/stop_slave.inc
CHANGE MASTER TO master_use_gtid=slave_pos;
--source include/start_slave.inc
 
 
--echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***
 
--connection server_1
ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1);
--source include/save_master_gtid.inc
 
--connection server_2
--source include/sync_with_master_gtid.inc
SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;
--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads=10;
 
--connection server_1
SET @old_dom= @@gtid_domain_id;
SET gtid_domain_id= 1;
INSERT INTO t1 VALUES (2);
INSERT INTO t1 VALUES (3);
SET gtid_domain_id= 2;
INSERT INTO t1 VALUES (4);
INSERT INTO t1 VALUES (5);
SET gtid_domain_id= 1;
INSERT INTO t1 VALUES (6);
SET gtid_domain_id= @old_dom;
 
--source include/save_master_gtid.inc
 
# Set a local row conflict to cause domain_id=1 to have to wait.
--connect (s2,127.0.0.1,root,,test,$SERVER_MYPORT_2,)
--connection s2
SET sql_log_bin= 0;
BEGIN;
INSERT INTO t1 VALUES (2);
 
--connection server_2
--source include/start_slave.inc
 
# Wait for domain_id=2 to have progressed further into the relay log than
# the domain_id=1.
--let $wait_condition= SELECT COUNT(*) = 1 FROM t1 WHERE a=5
--source include/wait_condition.inc
 
# Now force a duplicate key error for domain_id=1.
--connection s2
COMMIT;
SET sql_log_bin= 1;
 
--connection server_2
--let $slave_sql_errno= 1062
--source include/wait_for_slave_sql_error.inc
 
# Now resolve the error, and restart the SQL thread.
# The bug was that it would start at the point up to which domain_id=2 was
# replicated, skipping the earlier events from domain_id=1 that were not yet
# replicated due to the duplicate key error.
SET sql_log_bin= 0;
DELETE FROM t1 WHERE a=2;
SET sql_log_bin= 1;
 
--source include/start_slave.inc
--source include/sync_with_master_gtid.inc
 
# Check that everything was replicated, without losing any rows.
SELECT * FROM t1 ORDER BY a;
 
 
# Clean up.
--connection server_2
--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads=@old_parallel_threads;
--source include/start_slave.inc
 
--connection server_1
DROP TABLE t1;
 
--source include/rpl_end.inc



 Comments   
Comment by Sergei Golubchik [ 2014-10-13 ]

After discussion with Kristian on IRC, we concluded that a proper fix would be to generalize the correct GTID search code (that currently only works when fetching binlogs from the master) to also work when reading from relay logs. But as it sounds pretty complex and risky task, this can only be done in 10.1 at the earliest. So in 10.0 we could use a workaround of automatically restarting the IO thread, as explained above.

Comment by Kristian Nielsen [ 2015-02-18 ]

Hm, I managed to come up with a patch that is not too intrusive:

http://lists.askmonty.org/pipermail/commits/2015-February/007436.html

It turns out that most of the complexity of the GTID search code on the master
side is not needed in the SQL thread. All that should be needed is to go back
in the relay logs and skip any GTIDs that were already applied. (We have
different kind of slave-side complexity due to the strange code for handling
relaylogs though).

I think this patch is suitable for both 10.0 and 10.1. It doesn't really
change things much except in the case where parallel replication is used with
GTID mode, we have multiple GTID domains active, and we stop and restart the
SQL thread without also stopping the IO thread. And this is exactly the case
that is in any case completely broken in 10.0, so things can only get better.

And it will be nice to have the same code in both 10.0 and 10.1 for this
complex part of parallel replication - so as to not have to support/debug
different kinds of problems in the two versions.

This patch fixes just the bug. The patch could be extended later (10.2?), to
maybe implement MDEV-4698 (preserve relay logs in GTID mode also if restarting
both IO and SQL threads), for which there has been some interest. This might
clean up relay log handling somewhat, and this patch provides part of what is
needed (but not all - for example, relay log crash recovery will also be
needed).

Comment by Kristian Nielsen [ 2015-02-18 ]

Monty: As discussed on the phone, please review the linked patch.

Generated at Thu Feb 08 07:13:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.