Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20438

Replication fails to start if slave's gtid_slave_pos is empty and all of master's binlogs have non-empty Gtid_list values

    XMLWordPrintable

Details

    Description

      When gtid_slave_pos is set to the empty string, in my opinion, that should mean "start from the beginning of your current binary logs and give me everything that you have".

      In reality, when gtid_slave_pos is set to the empty string, this currently means "start from the binary log position that has a Gtid_list event with the contents []".

      To reproduce this, let's say that we have one server that will be our master, and one server that will be our slave.

      On the server that will be our master, let's do some preparation to ensure that our GTID state is in a specific condition:

      -- Clear current gtid_binlog_pos
      RESET MASTER;
      -- Generate some GTIDs in the binary logs
      SET SESSION gtid_domain_id=1;
      CREATE DATABASE repltest;
      SET SESSION gtid_domain_id=2;
      CREATE TABLE repltest.tab (
         id int primary key,
         str varchar(50)
      );
      SET SESSION gtid_domain_id=3;
      INSERT INTO repltest.tab VALUES (1, 'str1');
      

      Output:

      MariaDB [(none)]> -- Clear current gtid_binlog_pos
      MariaDB [(none)]> RESET MASTER;
      Query OK, 0 rows affected (0.006 sec)
       
      MariaDB [(none)]> -- Generate some GTIDs in the binary logs
      MariaDB [(none)]> SET SESSION gtid_domain_id=1;
      Query OK, 0 rows affected (0.000 sec)
       
      MariaDB [(none)]> CREATE DATABASE repltest;
      Query OK, 1 row affected (0.004 sec)
       
      MariaDB [(none)]> SET SESSION gtid_domain_id=2;
      Query OK, 0 rows affected (0.000 sec)
       
      MariaDB [(none)]> CREATE TABLE repltest.tab (
          ->    id int primary key,
          ->    str varchar(50)
          -> );
      Query OK, 0 rows affected (0.016 sec)
       
      MariaDB [(none)]> SET SESSION gtid_domain_id=3;
      Query OK, 0 rows affected (0.000 sec)
       
      MariaDB [(none)]> INSERT INTO repltest.tab VALUES (1, 'str1');
      Query OK, 1 row affected (0.002 sec)
      

      At this point, the server's only binary log has a Gtid_list event with the contents [], and it has 3 transactions with GTIDs in three different domains:

      MariaDB [(none)]> SHOW BINLOG EVENTS;
      +--------------------+-----+-------------------+-----------+-------------+-------------------------------------------------------------------------+
      | Log_name           | Pos | Event_type        | Server_id | End_log_pos | Info                                                                    |
      +--------------------+-----+-------------------+-----------+-------------+-------------------------------------------------------------------------+
      | mariadb-bin.000001 |   4 | Format_desc       |         2 |         256 | Server ver: 10.3.17-MariaDB-log, Binlog ver: 4                          |
      | mariadb-bin.000001 | 256 | Gtid_list         |         2 |         285 | []                                                                      |
      | mariadb-bin.000001 | 285 | Binlog_checkpoint |         2 |         330 | mariadb-bin.000001                                                      |
      | mariadb-bin.000001 | 330 | Gtid              |         2 |         372 | GTID 1-2-1                                                              |
      | mariadb-bin.000001 | 372 | Query             |         2 |         467 | CREATE DATABASE repltest                                                |
      | mariadb-bin.000001 | 467 | Gtid              |         2 |         509 | GTID 2-2-1                                                              |
      | mariadb-bin.000001 | 509 | Query             |         2 |         643 | CREATE TABLE repltest.tab (
         id int primary key,
         str varchar(50)
      ) |
      | mariadb-bin.000001 | 643 | Gtid              |         2 |         685 | BEGIN GTID 3-2-1                                                        |
      | mariadb-bin.000001 | 685 | Annotate_rows     |         2 |         751 | INSERT INTO repltest.tab VALUES (1, 'str1')                             |
      | mariadb-bin.000001 | 751 | Table_map         |         2 |         804 | table_id: 25 (repltest.tab)                                             |
      | mariadb-bin.000001 | 804 | Write_rows_v1     |         2 |         847 | table_id: 25 flags: STMT_END_F                                          |
      | mariadb-bin.000001 | 847 | Xid               |         2 |         878 | COMMIT /* xid=115 */                                                    |
      +--------------------+-----+-------------------+-----------+-------------+-------------------------------------------------------------------------+
      12 rows in set (0.000 sec)
      

      Now let's purge this binary log:

      FLUSH LOGS;
      PURGE BINARY LOGS TO 'mariadb-bin.000002';
      

      Output:

      MariaDB [(none)]> FLUSH LOGS;
      Query OK, 0 rows affected (0.004 sec)
       
      MariaDB [(none)]> PURGE BINARY LOGS TO 'mariadb-bin.000002';
      Query OK, 0 rows affected (0.003 sec)
      

      At this point, the server's only binary log has a Gtid_list event that contains the previous GTIDs in three different domains:

      MariaDB [(none)]> SHOW BINLOG EVENTS;
      +--------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
      | Log_name           | Pos | Event_type        | Server_id | End_log_pos | Info                                           |
      +--------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
      | mariadb-bin.000002 |   4 | Format_desc       |         2 |         256 | Server ver: 10.3.17-MariaDB-log, Binlog ver: 4 |
      | mariadb-bin.000002 | 256 | Gtid_list         |         2 |         331 | [1-2-1,2-2-1,3-2-1]                            |
      | mariadb-bin.000002 | 331 | Binlog_checkpoint |         2 |         376 | mariadb-bin.000001                             |
      | mariadb-bin.000002 | 376 | Binlog_checkpoint |         2 |         421 | mariadb-bin.000002                             |
      | mariadb-bin.000002 | 421 | Rotate            |         2 |         470 | mariadb-bin.000003;pos=4                       |
      +--------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
      5 rows in set (0.000 sec)
      

      Now let's try to set up our second server as a slave to replicate from this server when it has gtid_slave_pos set to an empty string.

      SET GLOBAL gtid_slave_pos = '';
      CHANGE MASTER TO 
         MASTER_HOST='172.30.0.105', 
         MASTER_USER='maxscale', 
         MASTER_PASSWORD='password', 
         MASTER_USE_GTID=slave_pos;
      START SLAVE;
      

      Output:

      MariaDB [(none)]> SET GLOBAL gtid_slave_pos = '';
      Query OK, 0 rows affected (0.012 sec)
       
      MariaDB [(none)]> CHANGE MASTER TO
          ->    MASTER_HOST='172.30.0.105',
          ->    MASTER_USER='maxscale',
          ->    MASTER_PASSWORD='password',
          ->    MASTER_USE_GTID=slave_pos;
      Query OK, 0 rows affected (0.010 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.006 sec)
      

      At this point, the slave's I/O thread will fail:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State:
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File:
                 Read_Master_Log_Pos: 4
                      Relay_Log_File: ip-172-30-0-123-relay-bin.000001
                       Relay_Log_Pos: 4
               Relay_Master_Log_File:
                    Slave_IO_Running: No
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 4
                     Relay_Log_Space: 256
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 1236
                       Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 2
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos:
             Replicate_Do_Domain_Ids: 3
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      In order to get it to start replicating from the start of the master's binary logs, you have to explicitly set gtid_slave_pos to the value in the Gtid_List event:

      STOP SLAVE;
      SET GLOBAL gtid_slave_pos = '1-2-1,2-2-1,3-2-1';
      CHANGE MASTER TO 
         MASTER_HOST='172.30.0.105', 
         MASTER_USER='maxscale', 
         MASTER_PASSWORD='password', 
         MASTER_USE_GTID=slave_pos;
      START SLAVE;
      

      Output:

      MariaDB [(none)]> STOP SLAVE;
      Query OK, 0 rows affected (0.002 sec)
       
      MariaDB [(none)]> SET GLOBAL gtid_slave_pos = '1-2-1,2-2-1,3-2-1';
      Query OK, 0 rows affected (0.015 sec)
       
      MariaDB [(none)]> CHANGE MASTER TO
          ->    MASTER_HOST='172.30.0.105',
          ->    MASTER_USER='maxscale',
          ->    MASTER_PASSWORD='password',
          ->    MASTER_USE_GTID=slave_pos;
      Query OK, 0 rows affected (0.006 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.006 sec)
      

      At this point, the slave is properly replicating:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000003
                 Read_Master_Log_Pos: 421
                      Relay_Log_File: ip-172-30-0-123-relay-bin.000002
                       Relay_Log_Pos: 722
               Relay_Master_Log_File: mariadb-bin.000003
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 421
                     Relay_Log_Space: 1041
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 2
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos: 1-2-1,2-2-1,3-2-1
             Replicate_Do_Domain_Ids: 3
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      The current behavior is not necessarily "wrong". However, it would be useful in some cases if having an empty GTID value had a special meaning that meant "start from the beginning of your current binary logs and give me everything that you have".

      This would be especially useful in environments that use a lot of different gtid_domain_id values, and you know that the server is consistent to start replicating from the first GTID of a specific domain, and the empty gtid_slave_pos value is combined with the DO_DOMAIN_IDS and/or IGNORE_DOMAIN_IDS options.

      For example, let's say that the Gtid_List event in the master's first and only binary log is for the GTID string '1-2-1,2-2-1,3-2-1':

      MariaDB [(none)]> SHOW BINLOG EVENTS;
      +--------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
      | Log_name           | Pos | Event_type        | Server_id | End_log_pos | Info                                           |
      +--------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
      | mariadb-bin.000002 |   4 | Format_desc       |         2 |         256 | Server ver: 10.3.17-MariaDB-log, Binlog ver: 4 |
      | mariadb-bin.000002 | 256 | Gtid_list         |         2 |         331 | [1-2-1,2-2-1,3-2-1]                            |
      | mariadb-bin.000002 | 331 | Binlog_checkpoint |         2 |         376 | mariadb-bin.000001                             |
      | mariadb-bin.000002 | 376 | Binlog_checkpoint |         2 |         421 | mariadb-bin.000002                             |
      | mariadb-bin.000002 | 421 | Rotate            |         2 |         470 | mariadb-bin.000003;pos=4                       |
      +--------------------+-----+-------------------+-----------+-------------+------------------------------------------------+
      5 rows in set (0.000 sec)
      

      And let's say that the slave only wants to replicate transactions with gtid_domain_id=4.

      In this case, it would make sense to set gtid_slave_pos to the empty string, because the slave should start replicating this specific domain from the very beginning, and it doesn't care about any other domains. I would think that the following would work:

      -- Start replicating from the first event with gtid_domain_id=4,
      -- and ignore all other gtid_domain_id values
      SET GLOBAL gtid_slave_pos = '';
      CHANGE MASTER TO 
         MASTER_HOST='172.30.0.105', 
         MASTER_USER='maxscale', 
         MASTER_PASSWORD='password', 
         MASTER_USE_GTID=slave_pos,
         DO_DOMAIN_IDS=(4);
      START SLAVE;
      

      Output:

      MariaDB [(none)]> SET GLOBAL gtid_slave_pos = '';
      Query OK, 0 rows affected (0.012 sec)
       
      MariaDB [(none)]> CHANGE MASTER TO
          ->    MASTER_HOST='172.30.0.105',
          ->    MASTER_USER='maxscale',
          ->    MASTER_PASSWORD='password',
          ->    MASTER_USE_GTID=slave_pos,
          ->    DO_DOMAIN_IDS=(4);
      Query OK, 0 rows affected (0.010 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.007 sec)
      

      But this does not work either:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State:
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File:
                 Read_Master_Log_Pos: 4
                      Relay_Log_File: ip-172-30-0-123-relay-bin.000001
                       Relay_Log_Pos: 4
               Relay_Master_Log_File:
                    Slave_IO_Running: No
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 4
                     Relay_Log_Space: 256
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 1236
                       Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.'
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 2
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Slave_Pos
                         Gtid_IO_Pos:
             Replicate_Do_Domain_Ids: 4
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      Attachments

        Activity

          People

            Elkin Andrei Elkin
            GeoffMontee Geoff Montee (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.