[MDEV-33645] Stop and Start slave reset the Master_info_file - Jira

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: 10.4(EOL)
Component/s: Replication
Labels:
None

Description

If the slave uses using_gtid >0 ,When we execute the stop and start slave. Slave I/O thread threads start from where the slave_sql_thread stopped. In case of delayed replica or Master is down, The slave/delayed slave will lose the downloaded binlogs.

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mariadb1-bin.000048

           Read_Master_Log_Pos: 527

                Relay_Log_File: mariadb2-relay-bin.000002

                 Relay_Log_Pos: 558

         Relay_Master_Log_File: mariadb1-bin.000046

              Slave_IO_Running: Yes

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 256

               Relay_Log_Space: 3451

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: 47

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 1000

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 1-1000-37

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 300

           SQL_Remaining_Delay: 253

       Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event

              Slave_DDL_Groups: 26

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 0

1 row in set (0.002 sec)

MariaDB [(none)]> stop slave ;

Query OK, 0 rows affected (2.027 sec)

MariaDB [(none)]> start slave;

Query OK, 0 rows affected (0.026 sec)

show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Connecting to master

                   Master_Host: 172.20.0.2

                   Master_User: repl_user

                   Master_Port: 3306

                 Connect_Retry: 60

               Master_Log_File: mariadb1-bin.000046

           Read_Master_Log_Pos: 256

                Relay_Log_File: mariadb2-relay-bin.000001

                 Relay_Log_Pos: 4

         Relay_Master_Log_File: mariadb1-bin.000046

              Slave_IO_Running: Connecting

             Slave_SQL_Running: Yes

               Replicate_Do_DB:

           Replicate_Ignore_DB:

            Replicate_Do_Table:

        Replicate_Ignore_Table:

       Replicate_Wild_Do_Table:

   Replicate_Wild_Ignore_Table:

                    Last_Errno: 0

                    Last_Error:

                  Skip_Counter: 0

           Exec_Master_Log_Pos: 256

               Relay_Log_Space: 256

               Until_Condition: None

                Until_Log_File:

                 Until_Log_Pos: 0

            Master_SSL_Allowed: No

            Master_SSL_CA_File:

            Master_SSL_CA_Path:

               Master_SSL_Cert:

             Master_SSL_Cipher:

                Master_SSL_Key:

         Seconds_Behind_Master: NULL

 Master_SSL_Verify_Server_Cert: No

                 Last_IO_Errno: 0

                 Last_IO_Error:

                Last_SQL_Errno: 0

                Last_SQL_Error:

   Replicate_Ignore_Server_Ids:

              Master_Server_Id: 1000

                Master_SSL_Crl:

            Master_SSL_Crlpath:

                    Using_Gtid: Slave_Pos

                   Gtid_IO_Pos: 1-1000-35

       Replicate_Do_Domain_Ids:

   Replicate_Ignore_Domain_Ids:

                 Parallel_Mode: optimistic

                     SQL_Delay: 300

           SQL_Remaining_Delay: NULL

       Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates

              Slave_DDL_Groups: 26

Slave_Non_Transactional_Groups: 0

    Slave_Transactional_Groups: 0

2024-03-10  6:57:18 290 [Note] Slave: received end packet from server, apparent master shutdown:

2024-03-10  6:57:18 290 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb1-bin.000049' at position 527; GTID position '1-1000-38'

2024-03-10  6:57:18 290 [ERROR] Slave I/O: error reconnecting to master 'repl_user@172.20.0.2:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '172.20.0.2' (111 "Connection refused"), Internal MariaDB error code: 2003

2024-03-10  6:57:35 291 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb1-bin.000046' at position 256; GTID position '1-1000-35', master: 172.20.0.2:3306

2024-03-10  6:57:35 290 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read

2024-03-10  6:57:35 290 [Note] Slave I/O thread exiting, read up to log 'mariadb1-bin.000049', position 527; GTID position 1-1000-38, master 172.20.0.2:3306

2024-03-10  6:57:37 290 [Note] cannot connect to master to kill slave io_thread's connection

2024-03-10  6:57:58 304 [Note] Slave I/O thread: Start semi-sync replication to master 'repl_user@172.20.0.2:3306' in log 'mariadb1-bin.000046' at position 256

2024-03-10  6:57:58 305 [Note] Slave SQL thread initialized, starting replication in log 'mariadb1-bin.000046' at position 256, relay log './mariadb2-relay-bin.000001' position: 4; GTID position '1-1000-35'

2024-03-10  6:58:16 304 [ERROR] Slave I/O: error connecting to master 'repl_user@172.20.0.2:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '172.20.0.2' (113 "No route to host"), Internal MariaDB error code: 2003

Attachments

Issue Links

duplicates

MDEV-4698 With GTID replication, relay logs cannot be relied upon while purging binary logs on master

Open

MDEV-8945 Avoid overloading the master NIC on restarting IO_THREAD on lagging slave.

Open

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen added a comment - 2024-03-10 07:48

This is by design / a known limitation of GTID replication. When both the SQL and IO thread are restarted, the relay logs are deleted and fetched anew from the master.

It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible.

The implementation will need to very carefully consider and handle the different cases that can arise around reconnect, including multiple domains, out-of-order GTID sequence numbers, configuration changes (eg. replication filters) during restart, DNS-changes causing reconnect to reach a different server, etc...

Kristian Nielsen added a comment - 2024-03-10 07:48 This is by design / a known limitation of GTID replication. When both the SQL and IO thread are restarted, the relay logs are deleted and fetched anew from the master. It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible. The implementation will need to very carefully consider and handle the different cases that can arise around reconnect, including multiple domains, out-of-order GTID sequence numbers, configuration changes (eg. replication filters) during restart, DNS-changes causing reconnect to reach a different server, etc...

Andrei Elkin added a comment - 2024-04-10 17:43 - edited

> It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible.
Indeed, knielsen.

However the scope could be much smaller and still pretty practical.
Say slave is running in the gtid mode

    Slave_IO_Running: Yes

    Slave_SQL_Running: Yes

    Using_Gtid: Slave_Pos

In the following

 1 --connection slave

 2   stop slave sql_thread

 4 --connection master

 5   /* create more gtid:s */

 7 #--connection slave

 8 --sync_slave_io_with_master

 9   stop slave io_thread;

the line 5 created gtids are in the slave's relay-log, but next

10 start slave sql_thread;

removes the relay-log. And that's kind of cruel too 'cos intuitively as a user I would expect to
process the events.

It looks to me that the relay-log resetting function should be passed over to the IO thread and it
would require the SQL be down.

Could you please give it your thinking?

Andrei Elkin added a comment - 2024-04-10 17:43 - edited > It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible. Indeed, knielsen . However the scope could be much smaller and still pretty practical. Say slave is running in the gtid mode Slave_IO_Running: Yes Slave_SQL_Running: Yes Using_Gtid: Slave_Pos In the following 1 --connection slave 2 stop slave sql_thread 3 4 --connection master 5 /* create more gtid:s */ 6 7 # --connection slave 8 --sync_slave_io_with_master 9 stop slave io_thread; the line 5 created gtids are in the slave's relay-log, but next 10 start slave sql_thread; removes the relay-log. And that's kind of cruel too 'cos intuitively as a user I would expect to process the events. It looks to me that the relay-log resetting function should be passed over to the IO thread and it would require the SQL be down. Could you please give it your thinking?

Kristian Nielsen added a comment - 2024-04-10 18:21

Elkin, I don't really understand what you are asking.

AFAIK, there is no fundamental reason the relay logs need to be deleted in GTID mode. It's just that the required logic to be able to restart the slave SQL and/or IO threads on existing relay logs is not implemented.

In your example, both the IO and SQL threads are stopped, and we have something in the relay logs. To be able to start the SQL thread without deleting the relay logs would require the logic to find the right place to start for each domain in the @@gtid_slave_pos. There is logic to handle the corresponding problem on the master, to find the right place to start in the master's binlog for each domain requested by the connecting IO thread. Similar logic would need to be implemented on the slave side for the SQL thread to start correctly in the relay log.

To preserve the relay log when starting the IO thread would require the IO thread to be able to request the correct starting position from the master - this would not be @@gtid_slave_pos, but some point corresponding to what is at the end of the current relay log.

It also needs to be considered that the relay log is not crash safe, so the code needs to handle whatever is in the relay log correctly.

The problem to be solved in this task is not the coding, it is the design, to carefully consider all relevant scenarios and decide how to handle them correctly. Ad-hoc testing will not be able to exhaustively test all required cases and avoid tricky regressions in corner cases.

Kristian Nielsen added a comment - 2024-04-10 18:21 Elkin , I don't really understand what you are asking. AFAIK, there is no fundamental reason the relay logs need to be deleted in GTID mode. It's just that the required logic to be able to restart the slave SQL and/or IO threads on existing relay logs is not implemented. In your example, both the IO and SQL threads are stopped, and we have something in the relay logs. To be able to start the SQL thread without deleting the relay logs would require the logic to find the right place to start for each domain in the @@gtid_slave_pos. There is logic to handle the corresponding problem on the master, to find the right place to start in the master's binlog for each domain requested by the connecting IO thread. Similar logic would need to be implemented on the slave side for the SQL thread to start correctly in the relay log. To preserve the relay log when starting the IO thread would require the IO thread to be able to request the correct starting position from the master - this would not be @@gtid_slave_pos, but some point corresponding to what is at the end of the current relay log. It also needs to be considered that the relay log is not crash safe, so the code needs to handle whatever is in the relay log correctly. The problem to be solved in this task is not the coding, it is the design, to carefully consider all relevant scenarios and decide how to handle them correctly. Ad-hoc testing will not be able to exhaustively test all required cases and avoid tricky regressions in corner cases.

Andrei Elkin added a comment - 2024-04-11 09:42

Well I did not mention, did not focus in my last comment on next steps.
It was clear to me that the start position for the applier (line 10) would be Relay_Log_File:Relay_Log_File pair from relay_log_info, also left there after the stop of line 2.
knielsen, such the slave applier behaviour that the logs remain at/after line 10 would be satisfactory to the user, and arguably natural for them too.
Maybe we need a new slave applier option for that, and having it set ON,

10 start slave sql_thread preserve_relay_logs

let's start the IO thread

11 start slave io_thread

I suggest the IO thread would just do as it normally does which include keep appending into relay logs.

In effect the gtid slave mode's relay-log resetting behavior gets narrowed to the slave whose applier
is not run configured with the new option.

Andrei Elkin added a comment - 2024-04-11 09:42 Well I did not mention, did not focus in my last comment on next steps. It was clear to me that the start position for the applier (line 10) would be Relay_Log_File:Relay_Log_File pair from relay_log_info , also left there after the stop of line 2. knielsen , such the slave applier behaviour that the logs remain at/after line 10 would be satisfactory to the user, and arguably natural for them too. Maybe we need a new slave applier option for that, and having it set ON, 10 start slave sql_thread preserve_relay_logs let's start the IO thread 11 start slave io_thread I suggest the IO thread would just do as it normally does which include keep appending into relay logs. In effect the gtid slave mode's relay-log resetting behavior gets narrowed to the slave whose applier is not run configured with the new option.

Kristian Nielsen added a comment - 2024-04-11 12:40

You want to use relay_log_info, but you ignore my comment that this is not crash safe.
You also ignore my comment about considering relevant scenarios.
It makes it very hard to sensibly discuss new features :-/

Let's not push to users yet another option preserve_relay_logs that they need to understand and consider. Let's just implement things properly and correctly and have the server do the right thing.

The IO thread cannot "do as it normally does", as that is to fetch from the @@gtid_slave_pos on the master and append that to the relay log. But this is wrong if the relay log already contains some of those events because it was not deleted before starting IO thread.

As I wrote before, I'd like to see a full description of the issues to be considered and how to handle them. Preserving relay logs over slave restart is very desirable, but it needs to be done 100% correctly and give always the correct slave behaviour, in all scenarios. That is the hard part of this task, not removing the code that deletes the relay log.

Kristian Nielsen added a comment - 2024-04-11 12:40 You want to use relay_log_info, but you ignore my comment that this is not crash safe. You also ignore my comment about considering relevant scenarios. It makes it very hard to sensibly discuss new features :-/ Let's not push to users yet another option preserve_relay_logs that they need to understand and consider. Let's just implement things properly and correctly and have the server do the right thing. The IO thread cannot "do as it normally does", as that is to fetch from the @@gtid_slave_pos on the master and append that to the relay log. But this is wrong if the relay log already contains some of those events because it was not deleted before starting IO thread. As I wrote before, I'd like to see a full description of the issues to be considered and how to handle them. Preserving relay logs over slave restart is very desirable, but it needs to be done 100% correctly and give always the correct slave behaviour, in all scenarios. That is the hard part of this task, not removing the code that deletes the relay log.

Andrei Elkin added a comment - 2024-04-11 17:01

I did not comment on

> It also needs to be considered that the relay log is not crash safe

because I thought I did not have to in the context of my 11 lines.
Perhaps I should've made that clear, e.g to ask why (and I am doing now below).
I'll promise do that next time to avoid any impression of ignoring for no reason.
Please don't take it wrong, dear Kristian. It can be difficult to discuss anything with me, not least 'cos of there could be questions for me to answer at my work that nobody let me ignore them.

So why should we care of crash-recovery of relay log, when the new option in

start slave sql_thread preserve_relay_logs

just makes the slave to operate in the traditional mode? I explored something about the IO thread, namely when it was OFF and switches ON. That behavior makes sense to me.
I did not consider the opposite. But it should be straightforward.
Let's redraw the above sequence to focus on the IO thread effects:

 1 --connection slave

 2   STOP SLAVE SQL_THREAD

 4 --connection master

 5   /* create more gtid:s */

 7 #--connection slave

 8 --sync_slave_io_with_master

 9 STOP SLAVE;

10 START SLAVE IO_THREAD; # => reset relay-logs

11 START SLAVE SQL_THREAD PRESERVE_RELAY_LOGS; # will not do anything special *now*

12 STOP SLAVE IO_THREAD;

13 START SLAVE IO_THREAD; # => won't reset relay-logs 'cos of l.11

The last line indicates the master_use_gtid configured slave operates in the traditional mode
when the applier thread is configured with the new option.

It's a light but not necessarily pretty consistent solution.
And the value of the new option is that it resolves a real user issue.

Andrei Elkin added a comment - 2024-04-11 17:01 I did not comment on > It also needs to be considered that the relay log is not crash safe because I thought I did not have to in the context of my 11 lines. Perhaps I should've made that clear, e.g to ask why (and I am doing now below). I'll promise do that next time to avoid any impression of ignoring for no reason. Please don't take it wrong, dear Kristian. It can be difficult to discuss anything with me, not least 'cos of there could be questions for me to answer at my work that nobody let me ignore them . So why should we care of crash-recovery of relay log, when the new option in start slave sql_thread preserve_relay_logs just makes the slave to operate in the traditional mode? I explored something about the IO thread, namely when it was OFF and switches ON. That behavior makes sense to me. I did not consider the opposite. But it should be straightforward. Let's redraw the above sequence to focus on the IO thread effects: 1 --connection slave 2 STOP SLAVE SQL_THREAD 3 4 --connection master 5 /* create more gtid:s */ 6 7 # --connection slave 8 --sync_slave_io_with_master 9 STOP SLAVE; 10 START SLAVE IO_THREAD; # => reset relay-logs 11 START SLAVE SQL_THREAD PRESERVE_RELAY_LOGS; # will not do anything special *now* 12 STOP SLAVE IO_THREAD; 13 START SLAVE IO_THREAD; # => won 't reset relay-logs ' cos of l.11 The last line indicates the master_use_gtid configured slave operates in the traditional mode when the applier thread is configured with the new option. It's a light but not necessarily pretty consistent solution. And the value of the new option is that it resolves a real user issue.

People

Assignee:: Andrei Elkin

Reporter:: Pandikrishnan Gurusamy

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-03-10 07:08

Updated:: 2024-06-24 13:37

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server