[MDEV-4275] GTID: If slave was set up with master_gtid_pos=auto, IO thread restart makes it start from the beginning of the binlog Created: 2013-03-15  Updated: 2013-03-21  Resolved: 2013-03-21

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: replication

Issue Links:
Relates
relates to MDEV-26 Global transaction ID Closed

 Description   

I run CHANGE MASTER ... master_gtid_pos=auto;
start slave;
execute some statements on master;
wait till slave catches up with the master;
stop slave (both threads or IO only);
start slave again

=> the slave attempts to re-execute previous statements.

revision-id: knielsen@knielsen-hq.org-20130311151655-yc1i3z72v6c00pfz
revno: 3468
branch-nick: 10.0-mdev26

Test case:

--source include/master-slave.inc
 
--connection slave
STOP SLAVE;
--source include/wait_for_slave_to_stop.inc
RESET SLAVE ALL;
 
--connection master
RESET MASTER;
 
--connection slave
eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, master_user='root', master_gtid_pos=auto;
START SLAVE;
--source include/wait_for_slave_to_start.inc
 
--connection master
CREATE TABLE t1 (i INT);
INSERT INTO t1 VALUES (1);
 
--sync_slave_with_master
STOP SLAVE IO_THREAD;
--source include/wait_for_slave_io_to_stop.inc
START SLAVE IO_THREAD;
--source include/wait_for_slave_io_to_start.inc
--sync_with_master

Result:

=== SHOW SLAVE STATUS ===
---- 1. ----
Slave_IO_State	Waiting for master to send event
Master_Host	127.0.0.1
Master_User	root
Master_Port	16000
Connect_Retry	1
Master_Log_File	master-bin.000001
Read_Master_Log_Pos	311
Relay_Log_File	slave-relay-bin.000002
Relay_Log_Pos	599
Relay_Master_Log_File	master-bin.000001
Slave_IO_Running	Yes
Slave_SQL_Running	No
Replicate_Do_DB	
Replicate_Ignore_DB	
Replicate_Do_Table	
Replicate_Ignore_Table	
Replicate_Wild_Do_Table	
Replicate_Wild_Ignore_Table	
Last_Errno	1050
Last_Error	Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (i INT)'
Skip_Counter	0
Exec_Master_Log_Pos	311
Relay_Log_Space	1863
Until_Condition	None
Until_Log_File	
Until_Log_Pos	0
Master_SSL_Allowed	No
Master_SSL_CA_File	
Master_SSL_CA_Path	
Master_SSL_Cert	
Master_SSL_Cipher	
Master_SSL_Key	
Seconds_Behind_Master	
Master_SSL_Verify_Server_Cert	No
Last_IO_Errno	0
Last_IO_Error	
Last_SQL_Errno	1050
Last_SQL_Error	Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (i INT)'
Replicate_Ignore_Server_Ids	
Master_Server_Id	1
Using_Gtid	1
=========================



 Comments   
Comment by Kristian Nielsen [ 2013-03-18 ]

Right, this is an important issue, thanks for catching.

The underlying issue here is that when IO thread connects (or re-connects), it needs to request position
by GTID, which is related to what the SQL thread has last executed, not to what the IO thread last fetched.

So there are several possibitilities for fetching again something that the SQL thread is in the middle of executing, or similar races. My current code does not handle this at all. It can be especially tricky as the SQL thread may be running while the IO thread loses the connection to the master and needs to automatically reconnect.

I think I need to make it so that the SQL thread remembers what it executed, so that it can skip stuff that gets duplicate-fetched into relay logs. This is not too hard, it only needs to be done in-memory. Whenever slave server is restarted or CHANGE MASTER is executed, we can just drop existing relay logs (which we need to do anyway).

Still, needs to be done carefully to handle all cases properly.

Comment by Kristian Nielsen [ 2013-03-21 ]

I found a better approach to fix this.

The first connect of the I/O thread after CHANGE MASTER or restart removes any
old relay logs and connects using GTID. Then subsequent reconnects use the
position of the last event fetched into relay logs rather than GTID. This
avoids any complications with duplicate events in the relay logs, and is
anyway both more correct and more robust.

Generated at Thu Feb 08 06:55:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.