Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4275

GTID: If slave was set up with master_gtid_pos=auto, IO thread restart makes it start from the beginning of the binlog

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      I run CHANGE MASTER ... master_gtid_pos=auto;
      start slave;
      execute some statements on master;
      wait till slave catches up with the master;
      stop slave (both threads or IO only);
      start slave again

      => the slave attempts to re-execute previous statements.

      revision-id: knielsen@knielsen-hq.org-20130311151655-yc1i3z72v6c00pfz
      revno: 3468
      branch-nick: 10.0-mdev26

      Test case:

      --source include/master-slave.inc
       
      --connection slave
      STOP SLAVE;
      --source include/wait_for_slave_to_stop.inc
      RESET SLAVE ALL;
       
      --connection master
      RESET MASTER;
       
      --connection slave
      eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, master_user='root', master_gtid_pos=auto;
      START SLAVE;
      --source include/wait_for_slave_to_start.inc
       
      --connection master
      CREATE TABLE t1 (i INT);
      INSERT INTO t1 VALUES (1);
       
      --sync_slave_with_master
      STOP SLAVE IO_THREAD;
      --source include/wait_for_slave_io_to_stop.inc
      START SLAVE IO_THREAD;
      --source include/wait_for_slave_io_to_start.inc
      --sync_with_master

      Result:

      === SHOW SLAVE STATUS ===
      ---- 1. ----
      Slave_IO_State	Waiting for master to send event
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	master-bin.000001
      Read_Master_Log_Pos	311
      Relay_Log_File	slave-relay-bin.000002
      Relay_Log_Pos	599
      Relay_Master_Log_File	master-bin.000001
      Slave_IO_Running	Yes
      Slave_SQL_Running	No
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	1050
      Last_Error	Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (i INT)'
      Skip_Counter	0
      Exec_Master_Log_Pos	311
      Relay_Log_Space	1863
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	0
      Last_IO_Error	
      Last_SQL_Errno	1050
      Last_SQL_Error	Error 'Table 't1' already exists' on query. Default database: 'test'. Query: 'CREATE TABLE t1 (i INT)'
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	1
      =========================

      Attachments

        Issue Links

          Activity

            Right, this is an important issue, thanks for catching.

            The underlying issue here is that when IO thread connects (or re-connects), it needs to request position
            by GTID, which is related to what the SQL thread has last executed, not to what the IO thread last fetched.

            So there are several possibitilities for fetching again something that the SQL thread is in the middle of executing, or similar races. My current code does not handle this at all. It can be especially tricky as the SQL thread may be running while the IO thread loses the connection to the master and needs to automatically reconnect.

            I think I need to make it so that the SQL thread remembers what it executed, so that it can skip stuff that gets duplicate-fetched into relay logs. This is not too hard, it only needs to be done in-memory. Whenever slave server is restarted or CHANGE MASTER is executed, we can just drop existing relay logs (which we need to do anyway).

            Still, needs to be done carefully to handle all cases properly.

            knielsen Kristian Nielsen added a comment - Right, this is an important issue, thanks for catching. The underlying issue here is that when IO thread connects (or re-connects), it needs to request position by GTID, which is related to what the SQL thread has last executed, not to what the IO thread last fetched. So there are several possibitilities for fetching again something that the SQL thread is in the middle of executing, or similar races. My current code does not handle this at all. It can be especially tricky as the SQL thread may be running while the IO thread loses the connection to the master and needs to automatically reconnect. I think I need to make it so that the SQL thread remembers what it executed, so that it can skip stuff that gets duplicate-fetched into relay logs. This is not too hard, it only needs to be done in-memory. Whenever slave server is restarted or CHANGE MASTER is executed, we can just drop existing relay logs (which we need to do anyway). Still, needs to be done carefully to handle all cases properly.

            I found a better approach to fix this.

            The first connect of the I/O thread after CHANGE MASTER or restart removes any
            old relay logs and connects using GTID. Then subsequent reconnects use the
            position of the last event fetched into relay logs rather than GTID. This
            avoids any complications with duplicate events in the relay logs, and is
            anyway both more correct and more robust.

            knielsen Kristian Nielsen added a comment - I found a better approach to fix this. The first connect of the I/O thread after CHANGE MASTER or restart removes any old relay logs and connects using GTID. Then subsequent reconnects use the position of the last event fetched into relay logs rather than GTID. This avoids any complications with duplicate events in the relay logs, and is anyway both more correct and more robust.

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.