[MDEV-4275] GTID: If slave was set up with master_gtid_pos=auto, IO thread restart makes it start from the beginning of the binlog - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
- replication

Description

I run CHANGE MASTER ... master_gtid_pos=auto;
start slave;
execute some statements on master;
wait till slave catches up with the master;
stop slave (both threads or IO only);
start slave again

=> the slave attempts to re-execute previous statements.

revision-id: knielsen@knielsen-hq.org-20130311151655-yc1i3z72v6c00pfz

revno: 3468

branch-nick: 10.0-mdev26

Test case:

--source include/master-slave.inc

--connection slave

STOP SLAVE;

--source include/wait_for_slave_to_stop.inc

RESET SLAVE ALL;

--connection master

RESET MASTER;

--connection slave

eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, master_user='root', master_gtid_pos=auto;

START SLAVE;

--source include/wait_for_slave_to_start.inc

--connection master

CREATE TABLE t1 (i INT);

INSERT INTO t1 VALUES (1);

--sync_slave_with_master

STOP SLAVE IO_THREAD;

--source include/wait_for_slave_io_to_stop.inc

START SLAVE IO_THREAD;

--source include/wait_for_slave_io_to_start.inc

--sync_with_master

Result:

=== SHOW SLAVE STATUS ===

---- 1. ----

Slave_IO_State	Waiting for master to send event

Master_Host	127.0.0.1

Master_User	root

Master_Port	16000

Connect_Retry	1

Master_Log_File	master-bin.000001

Read_Master_Log_Pos	311

Relay_Log_File	slave-relay-bin.000002

Relay_Log_Pos	599

Relay_Master_Log_File	master-bin.000001

Slave_IO_Running	Yes

Slave_SQL_Running	No

Replicate_Do_DB

Replicate_Ignore_DB

Replicate_Do_Table

Replicate_Ignore_Table

Replicate_Wild_Do_Table

Replicate_Wild_Ignore_Table

Last_Errno	1050

Last_Error	Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (i INT)'

Skip_Counter	0

Exec_Master_Log_Pos	311

Relay_Log_Space	1863

Until_Condition	None

Until_Log_File

Until_Log_Pos	0

Master_SSL_Allowed	No

Master_SSL_CA_File

Master_SSL_CA_Path

Master_SSL_Cert

Master_SSL_Cipher

Master_SSL_Key

Seconds_Behind_Master

Master_SSL_Verify_Server_Cert	No

Last_IO_Errno	0

Last_IO_Error

Last_SQL_Errno	1050

Last_SQL_Error	Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (i INT)'

Replicate_Ignore_Server_Ids

Master_Server_Id	1

Using_Gtid	1

=========================

Attachments

Issue Links

relates to

MDEV-26 Global transaction ID

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2013-03-18 14:39

Right, this is an important issue, thanks for catching.

The underlying issue here is that when IO thread connects (or re-connects), it needs to request position
by GTID, which is related to what the SQL thread has last executed, not to what the IO thread last fetched.

So there are several possibitilities for fetching again something that the SQL thread is in the middle of executing, or similar races. My current code does not handle this at all. It can be especially tricky as the SQL thread may be running while the IO thread loses the connection to the master and needs to automatically reconnect.

I think I need to make it so that the SQL thread remembers what it executed, so that it can skip stuff that gets duplicate-fetched into relay logs. This is not too hard, it only needs to be done in-memory. Whenever slave server is restarted or CHANGE MASTER is executed, we can just drop existing relay logs (which we need to do anyway).

Still, needs to be done carefully to handle all cases properly.

Kristian Nielsen added a comment - 2013-03-18 14:39 Right, this is an important issue, thanks for catching. The underlying issue here is that when IO thread connects (or re-connects), it needs to request position by GTID, which is related to what the SQL thread has last executed, not to what the IO thread last fetched. So there are several possibitilities for fetching again something that the SQL thread is in the middle of executing, or similar races. My current code does not handle this at all. It can be especially tricky as the SQL thread may be running while the IO thread loses the connection to the master and needs to automatically reconnect. I think I need to make it so that the SQL thread remembers what it executed, so that it can skip stuff that gets duplicate-fetched into relay logs. This is not too hard, it only needs to be done in-memory. Whenever slave server is restarted or CHANGE MASTER is executed, we can just drop existing relay logs (which we need to do anyway). Still, needs to be done carefully to handle all cases properly.

Kristian Nielsen added a comment - 2013-03-21 12:06

I found a better approach to fix this.

The first connect of the I/O thread after CHANGE MASTER or restart removes any
old relay logs and connects using GTID. Then subsequent reconnects use the
position of the last event fetched into relay logs rather than GTID. This
avoids any complications with duplicate events in the relay logs, and is
anyway both more correct and more robust.

Kristian Nielsen added a comment - 2013-03-21 12:06 I found a better approach to fix this. The first connect of the I/O thread after CHANGE MASTER or restart removes any old relay logs and connects using GTID. Then subsequent reconnects use the position of the last event fetched into relay logs rather than GTID. This avoids any complications with duplicate events in the relay logs, and is anyway both more correct and more robust.

People

Assignee:: Kristian Nielsen

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2013-03-15 00:45

Updated:: 2013-03-21 12:06

Resolved:: 2013-03-21 12:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server