Details
-
Bug
-
Status: Confirmed (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
If the slave uses using_gtid >0 ,When we execute the stop and start slave. Slave I/O thread threads start from where the slave_sql_thread stopped. In case of delayed replica or Master is down, The slave/delayed slave will lose the downloaded binlogs.
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb1-bin.000048
|
Read_Master_Log_Pos: 527
|
Relay_Log_File: mariadb2-relay-bin.000002
|
Relay_Log_Pos: 558
|
Relay_Master_Log_File: mariadb1-bin.000046
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 256
|
Relay_Log_Space: 3451
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 47
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1000
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 1-1000-37
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 300
|
SQL_Remaining_Delay: 253
|
Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
|
Slave_DDL_Groups: 26
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 0
|
1 row in set (0.002 sec)
|
|
MariaDB [(none)]> stop slave ;
|
Query OK, 0 rows affected (2.027 sec)
|
|
MariaDB [(none)]> start slave;
|
Query OK, 0 rows affected (0.026 sec)
|
|
|
show slave status\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Connecting to master
|
Master_Host: 172.20.0.2
|
Master_User: repl_user
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb1-bin.000046
|
Read_Master_Log_Pos: 256
|
Relay_Log_File: mariadb2-relay-bin.000001
|
Relay_Log_Pos: 4
|
Relay_Master_Log_File: mariadb1-bin.000046
|
Slave_IO_Running: Connecting
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 256
|
Relay_Log_Space: 256
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1000
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Slave_Pos
|
Gtid_IO_Pos: 1-1000-35
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: optimistic
|
SQL_Delay: 300
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
|
Slave_DDL_Groups: 26
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 0
|
|
2024-03-10 6:57:18 290 [Note] Slave: received end packet from server, apparent master shutdown:
|
2024-03-10 6:57:18 290 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb1-bin.000049' at position 527; GTID position '1-1000-38'
|
2024-03-10 6:57:18 290 [ERROR] Slave I/O: error reconnecting to master 'repl_user@172.20.0.2:3306' - retry-time: 60 maximum-retries: 100000 message: Can't connect to server on '172.20.0.2' (111 "Connection refused"), Internal MariaDB error code: 2003
|
2024-03-10 6:57:35 291 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb1-bin.000046' at position 256; GTID position '1-1000-35', master: 172.20.0.2:3306
|
2024-03-10 6:57:35 290 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
|
2024-03-10 6:57:35 290 [Note] Slave I/O thread exiting, read up to log 'mariadb1-bin.000049', position 527; GTID position 1-1000-38, master 172.20.0.2:3306
|
2024-03-10 6:57:37 290 [Note] cannot connect to master to kill slave io_thread's connection
|
2024-03-10 6:57:58 304 [Note] Slave I/O thread: Start semi-sync replication to master 'repl_user@172.20.0.2:3306' in log 'mariadb1-bin.000046' at position 256
|
2024-03-10 6:57:58 305 [Note] Slave SQL thread initialized, starting replication in log 'mariadb1-bin.000046' at position 256, relay log './mariadb2-relay-bin.000001' position: 4; GTID position '1-1000-35'
|
2024-03-10 6:58:16 304 [ERROR] Slave I/O: error connecting to master 'repl_user@172.20.0.2:3306' - retry-time: 60 maximum-retries: 100000 message: Can't connect to server on '172.20.0.2' (113 "No route to host"), Internal MariaDB error code: 2003
|
This is by design / a known limitation of GTID replication. When both the SQL and IO thread are restarted, the relay logs are deleted and fetched anew from the master.
It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible.
The implementation will need to very carefully consider and handle the different cases that can arise around reconnect, including multiple domains, out-of-order GTID sequence numbers, configuration changes (eg. replication filters) during restart, DNS-changes causing reconnect to reach a different server, etc...