Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33645

Stop and Start slave reset the Master_info_file

Details

    • Bug
    • Status: Confirmed (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • 10.4(EOL)
    • Replication
    • None

    Description

      If the slave uses using_gtid >0 ,When we execute the stop and start slave. Slave I/O thread threads start from where the slave_sql_thread stopped. In case of delayed replica or Master is down, The slave/delayed slave will lose the downloaded binlogs.

       
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb1-bin.000048
                 Read_Master_Log_Pos: 527
                      Relay_Log_File: mariadb2-relay-bin.000002
                       Relay_Log_Pos: 558
               Relay_Master_Log_File: mariadb1-bin.000046
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB: 
                 Replicate_Ignore_DB: 
                  Replicate_Do_Table: 
              Replicate_Ignore_Table: 
             Replicate_Wild_Do_Table: 
         Replicate_Wild_Ignore_Table: 
                          Last_Errno: 0
                          Last_Error: 
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 256
                     Relay_Log_Space: 3451
                     Until_Condition: None
                      Until_Log_File: 
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File: 
                  Master_SSL_CA_Path: 
                     Master_SSL_Cert: 
                   Master_SSL_Cipher: 
                      Master_SSL_Key: 
               Seconds_Behind_Master: 47
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error: 
                      Last_SQL_Errno: 0
                      Last_SQL_Error: 
         Replicate_Ignore_Server_Ids: 
                    Master_Server_Id: 1000
                      Master_SSL_Crl: 
                  Master_SSL_Crlpath: 
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 1-1000-37
             Replicate_Do_Domain_Ids: 
         Replicate_Ignore_Domain_Ids: 
                       Parallel_Mode: optimistic
                           SQL_Delay: 300
                 SQL_Remaining_Delay: 253
             Slave_SQL_Running_State: Waiting until MASTER_DELAY seconds after master executed event
                    Slave_DDL_Groups: 26
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.002 sec)
       
      MariaDB [(none)]> stop slave ;
      Query OK, 0 rows affected (2.027 sec)
       
      MariaDB [(none)]> start slave;
      Query OK, 0 rows affected (0.026 sec)
       
       
      show slave status\G
      *************************** 1. row ***************************
                      Slave_IO_State: Connecting to master
                         Master_Host: 172.20.0.2
                         Master_User: repl_user
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb1-bin.000046
                 Read_Master_Log_Pos: 256
                      Relay_Log_File: mariadb2-relay-bin.000001
                       Relay_Log_Pos: 4
               Relay_Master_Log_File: mariadb1-bin.000046
                    Slave_IO_Running: Connecting
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB: 
                 Replicate_Ignore_DB: 
                  Replicate_Do_Table: 
              Replicate_Ignore_Table: 
             Replicate_Wild_Do_Table: 
         Replicate_Wild_Ignore_Table: 
                          Last_Errno: 0
                          Last_Error: 
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 256
                     Relay_Log_Space: 256
                     Until_Condition: None
                      Until_Log_File: 
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File: 
                  Master_SSL_CA_Path: 
                     Master_SSL_Cert: 
                   Master_SSL_Cipher: 
                      Master_SSL_Key: 
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error: 
                      Last_SQL_Errno: 0
                      Last_SQL_Error: 
         Replicate_Ignore_Server_Ids: 
                    Master_Server_Id: 1000
                      Master_SSL_Crl: 
                  Master_SSL_Crlpath: 
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 1-1000-35
             Replicate_Do_Domain_Ids: 
         Replicate_Ignore_Domain_Ids: 
                       Parallel_Mode: optimistic
                           SQL_Delay: 300
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
                    Slave_DDL_Groups: 26
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      
      

      2024-03-10  6:57:18 290 [Note] Slave: received end packet from server, apparent master shutdown: 
      2024-03-10  6:57:18 290 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'mariadb1-bin.000049' at position 527; GTID position '1-1000-38'
      2024-03-10  6:57:18 290 [ERROR] Slave I/O: error reconnecting to master 'repl_user@172.20.0.2:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '172.20.0.2' (111 "Connection refused"), Internal MariaDB error code: 2003
      2024-03-10  6:57:35 291 [Note] Slave SQL thread exiting, replication stopped in log 'mariadb1-bin.000046' at position 256; GTID position '1-1000-35', master: 172.20.0.2:3306
      2024-03-10  6:57:35 290 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
      2024-03-10  6:57:35 290 [Note] Slave I/O thread exiting, read up to log 'mariadb1-bin.000049', position 527; GTID position 1-1000-38, master 172.20.0.2:3306
      2024-03-10  6:57:37 290 [Note] cannot connect to master to kill slave io_thread's connection
      2024-03-10  6:57:58 304 [Note] Slave I/O thread: Start semi-sync replication to master 'repl_user@172.20.0.2:3306' in log 'mariadb1-bin.000046' at position 256
      2024-03-10  6:57:58 305 [Note] Slave SQL thread initialized, starting replication in log 'mariadb1-bin.000046' at position 256, relay log './mariadb2-relay-bin.000001' position: 4; GTID position '1-1000-35'
      2024-03-10  6:58:16 304 [ERROR] Slave I/O: error connecting to master 'repl_user@172.20.0.2:3306' - retry-time: 60  maximum-retries: 100000  message: Can't connect to server on '172.20.0.2' (113 "No route to host"), Internal MariaDB error code: 2003
      

      Attachments

        Issue Links

          Activity

            This is by design / a known limitation of GTID replication. When both the SQL and IO thread are restarted, the relay logs are deleted and fetched anew from the master.

            It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible.

            The implementation will need to very carefully consider and handle the different cases that can arise around reconnect, including multiple domains, out-of-order GTID sequence numbers, configuration changes (eg. replication filters) during restart, DNS-changes causing reconnect to reach a different server, etc...

            knielsen Kristian Nielsen added a comment - This is by design / a known limitation of GTID replication. When both the SQL and IO thread are restarted, the relay logs are deleted and fetched anew from the master. It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible. The implementation will need to very carefully consider and handle the different cases that can arise around reconnect, including multiple domains, out-of-order GTID sequence numbers, configuration changes (eg. replication filters) during restart, DNS-changes causing reconnect to reach a different server, etc...
            Elkin Andrei Elkin added a comment - - edited

            > It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible.
            Indeed, knielsen.

            However the scope could be much smaller and still pretty practical.
            Say slave is running in the gtid mode

                Slave_IO_Running: Yes
                Slave_SQL_Running: Yes
                Using_Gtid: Slave_Pos
            

            In the following

             1 --connection slave
             2   stop slave sql_thread
             3 
             4 --connection master
             5   /* create more gtid:s */
             6 
             7 #--connection slave
             8 --sync_slave_io_with_master
             9   stop slave io_thread;
            
            

            the line 5 created gtids are in the slave's relay-log, but next

            10 start slave sql_thread;
            

            removes the relay-log. And that's kind of cruel too 'cos intuitively as a user I would expect to
            process the events.

            It looks to me that the relay-log resetting function should be passed over to the IO thread and it
            would require the SQL be down.

            Could you please give it your thinking?

            Elkin Andrei Elkin added a comment - - edited > It will be good to remove this limitation and be able to preserve the relaylogs on the slave when possible. Indeed, knielsen . However the scope could be much smaller and still pretty practical. Say slave is running in the gtid mode Slave_IO_Running: Yes Slave_SQL_Running: Yes Using_Gtid: Slave_Pos In the following 1 --connection slave 2 stop slave sql_thread 3 4 --connection master 5 /* create more gtid:s */ 6 7 # --connection slave 8 --sync_slave_io_with_master 9 stop slave io_thread; the line 5 created gtids are in the slave's relay-log, but next 10 start slave sql_thread; removes the relay-log. And that's kind of cruel too 'cos intuitively as a user I would expect to process the events. It looks to me that the relay-log resetting function should be passed over to the IO thread and it would require the SQL be down. Could you please give it your thinking?

            Elkin, I don't really understand what you are asking.

            AFAIK, there is no fundamental reason the relay logs need to be deleted in GTID mode. It's just that the required logic to be able to restart the slave SQL and/or IO threads on existing relay logs is not implemented.

            In your example, both the IO and SQL threads are stopped, and we have something in the relay logs. To be able to start the SQL thread without deleting the relay logs would require the logic to find the right place to start for each domain in the @@gtid_slave_pos. There is logic to handle the corresponding problem on the master, to find the right place to start in the master's binlog for each domain requested by the connecting IO thread. Similar logic would need to be implemented on the slave side for the SQL thread to start correctly in the relay log.

            To preserve the relay log when starting the IO thread would require the IO thread to be able to request the correct starting position from the master - this would not be @@gtid_slave_pos, but some point corresponding to what is at the end of the current relay log.

            It also needs to be considered that the relay log is not crash safe, so the code needs to handle whatever is in the relay log correctly.

            The problem to be solved in this task is not the coding, it is the design, to carefully consider all relevant scenarios and decide how to handle them correctly. Ad-hoc testing will not be able to exhaustively test all required cases and avoid tricky regressions in corner cases.

            knielsen Kristian Nielsen added a comment - Elkin , I don't really understand what you are asking. AFAIK, there is no fundamental reason the relay logs need to be deleted in GTID mode. It's just that the required logic to be able to restart the slave SQL and/or IO threads on existing relay logs is not implemented. In your example, both the IO and SQL threads are stopped, and we have something in the relay logs. To be able to start the SQL thread without deleting the relay logs would require the logic to find the right place to start for each domain in the @@gtid_slave_pos. There is logic to handle the corresponding problem on the master, to find the right place to start in the master's binlog for each domain requested by the connecting IO thread. Similar logic would need to be implemented on the slave side for the SQL thread to start correctly in the relay log. To preserve the relay log when starting the IO thread would require the IO thread to be able to request the correct starting position from the master - this would not be @@gtid_slave_pos, but some point corresponding to what is at the end of the current relay log. It also needs to be considered that the relay log is not crash safe, so the code needs to handle whatever is in the relay log correctly. The problem to be solved in this task is not the coding, it is the design, to carefully consider all relevant scenarios and decide how to handle them correctly. Ad-hoc testing will not be able to exhaustively test all required cases and avoid tricky regressions in corner cases.
            Elkin Andrei Elkin added a comment -

            Well I did not mention, did not focus in my last comment on next steps.
            It was clear to me that the start position for the applier (line 10) would be Relay_Log_File:Relay_Log_File pair from relay_log_info, also left there after the stop of line 2.
            knielsen, such the slave applier behaviour that the logs remain at/after line 10 would be satisfactory to the user, and arguably natural for them too.
            Maybe we need a new slave applier option for that, and having it set ON,

            10 start slave sql_thread preserve_relay_logs
            

            let's start the IO thread

            11 start slave io_thread
            

            I suggest the IO thread would just do as it normally does which include keep appending into relay logs.

            In effect the gtid slave mode's relay-log resetting behavior gets narrowed to the slave whose applier
            is not run configured with the new option.

            Elkin Andrei Elkin added a comment - Well I did not mention, did not focus in my last comment on next steps. It was clear to me that the start position for the applier (line 10) would be Relay_Log_File:Relay_Log_File pair from relay_log_info , also left there after the stop of line 2. knielsen , such the slave applier behaviour that the logs remain at/after line 10 would be satisfactory to the user, and arguably natural for them too. Maybe we need a new slave applier option for that, and having it set ON, 10 start slave sql_thread preserve_relay_logs let's start the IO thread 11 start slave io_thread I suggest the IO thread would just do as it normally does which include keep appending into relay logs. In effect the gtid slave mode's relay-log resetting behavior gets narrowed to the slave whose applier is not run configured with the new option.

            You want to use relay_log_info, but you ignore my comment that this is not crash safe.
            You also ignore my comment about considering relevant scenarios.
            It makes it very hard to sensibly discuss new features :-/

            Let's not push to users yet another option preserve_relay_logs that they need to understand and consider. Let's just implement things properly and correctly and have the server do the right thing.

            The IO thread cannot "do as it normally does", as that is to fetch from the @@gtid_slave_pos on the master and append that to the relay log. But this is wrong if the relay log already contains some of those events because it was not deleted before starting IO thread.

            As I wrote before, I'd like to see a full description of the issues to be considered and how to handle them. Preserving relay logs over slave restart is very desirable, but it needs to be done 100% correctly and give always the correct slave behaviour, in all scenarios. That is the hard part of this task, not removing the code that deletes the relay log.

            knielsen Kristian Nielsen added a comment - You want to use relay_log_info, but you ignore my comment that this is not crash safe. You also ignore my comment about considering relevant scenarios. It makes it very hard to sensibly discuss new features :-/ Let's not push to users yet another option preserve_relay_logs that they need to understand and consider. Let's just implement things properly and correctly and have the server do the right thing. The IO thread cannot "do as it normally does", as that is to fetch from the @@gtid_slave_pos on the master and append that to the relay log. But this is wrong if the relay log already contains some of those events because it was not deleted before starting IO thread. As I wrote before, I'd like to see a full description of the issues to be considered and how to handle them. Preserving relay logs over slave restart is very desirable, but it needs to be done 100% correctly and give always the correct slave behaviour, in all scenarios. That is the hard part of this task, not removing the code that deletes the relay log.
            Elkin Andrei Elkin added a comment -

            I did not comment on

            > It also needs to be considered that the relay log is not crash safe

            because I thought I did not have to in the context of my 11 lines.
            Perhaps I should've made that clear, e.g to ask why (and I am doing now below).
            I'll promise do that next time to avoid any impression of ignoring for no reason.
            Please don't take it wrong, dear Kristian. It can be difficult to discuss anything with me, not least 'cos of there could be questions for me to answer at my work that nobody let me ignore them.

            So why should we care of crash-recovery of relay log, when the new option in

            start slave sql_thread preserve_relay_logs
            

            just makes the slave to operate in the traditional mode? I explored something about the IO thread, namely when it was OFF and switches ON. That behavior makes sense to me.
            I did not consider the opposite. But it should be straightforward.
            Let's redraw the above sequence to focus on the IO thread effects:

             1 --connection slave
             2   STOP SLAVE SQL_THREAD
             3 
             4 --connection master
             5   /* create more gtid:s */
             6 
             7 #--connection slave
             8 --sync_slave_io_with_master
             9 STOP SLAVE;
            10 START SLAVE IO_THREAD; # => reset relay-logs
            11 START SLAVE SQL_THREAD PRESERVE_RELAY_LOGS; # will not do anything special *now*
            12 STOP SLAVE IO_THREAD;
            13 START SLAVE IO_THREAD; # => won't reset relay-logs 'cos of l.11
            

            The last line indicates the master_use_gtid configured slave operates in the traditional mode
            when the applier thread is configured with the new option.

            It's a light but not necessarily pretty consistent solution.
            And the value of the new option is that it resolves a real user issue.

            Elkin Andrei Elkin added a comment - I did not comment on > It also needs to be considered that the relay log is not crash safe because I thought I did not have to in the context of my 11 lines. Perhaps I should've made that clear, e.g to ask why (and I am doing now below). I'll promise do that next time to avoid any impression of ignoring for no reason. Please don't take it wrong, dear Kristian. It can be difficult to discuss anything with me, not least 'cos of there could be questions for me to answer at my work that nobody let me ignore them . So why should we care of crash-recovery of relay log, when the new option in start slave sql_thread preserve_relay_logs just makes the slave to operate in the traditional mode? I explored something about the IO thread, namely when it was OFF and switches ON. That behavior makes sense to me. I did not consider the opposite. But it should be straightforward. Let's redraw the above sequence to focus on the IO thread effects: 1 --connection slave 2 STOP SLAVE SQL_THREAD 3 4 --connection master 5 /* create more gtid:s */ 6 7 # --connection slave 8 --sync_slave_io_with_master 9 STOP SLAVE; 10 START SLAVE IO_THREAD; # => reset relay-logs 11 START SLAVE SQL_THREAD PRESERVE_RELAY_LOGS; # will not do anything special *now* 12 STOP SLAVE IO_THREAD; 13 START SLAVE IO_THREAD; # => won 't reset relay-logs ' cos of l.11 The last line indicates the master_use_gtid configured slave operates in the traditional mode when the applier thread is configured with the new option. It's a light but not necessarily pretty consistent solution. And the value of the new option is that it resolves a real user issue.

            People

              Elkin Andrei Elkin
              pandi.gurusamy Pandikrishnan Gurusamy
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.