Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20122

With MASTER_USE_GTID=current_pos, slave's I/O thread only checks gtid_current_pos when thread is first started

    XMLWordPrintable

    Details

      Description

      When a slave is configured to replicate with "MASTER_USE_GTID=current_pos", the slave uses its value of gtid_current_pos to replicate from the master.

      https://mariadb.com/kb/en/library/change-master-to/#master_use_gtid

      https://mariadb.com/kb/en/library/gtid/#gtid_current_pos

      The value of gtid_current_pos includes GTIDs from both gtid_slave_pos and gtid_binlog_pos:

      https://mariadb.com/kb/en/library/gtid/#gtid_slave_pos

      https://mariadb.com/kb/en/library/gtid/#gtid_binlog_pos

      Since both gtid_slave_pos and gtid_binlog_pos are used, this means that the position takes into account both local transactions and replicated transactions. This can be somewhat problematic, since it means that executing a single local transaction on the slave can end up breaking replication, due to the fact that the local transaction would cause the slave's GTID position to become inconsistent with the master's GTID position. However, in my opinion, this makes sense, given the design of the GTID functionality. To prevent this specific issue, if a slave is using "MASTER_USE_GTID=current_pos", then it should have read_only=ON set.

      However, the more problematic issue is that MariaDB will not alert users to the inconsistent GTID position until the slave threads are restarted. If the slave is running smoothly, then the slave threads may not be restarted for weeks or months.

      The root cause of this appears to be that the slave's I/O thread only initializes its local value of gtid_current_pos when the thread is first started in start_slave_threads():

      https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/slave.cc#L1400

      This means that if a local transaction is executed on the slave, then the slave won't notice that its GTID position is inconsistent with the master until the slave threads are restarted.

      For example, let's say that I have a master and a slave.

      The master's GTID position:

      MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
      +------------------------+--------------------+
      | Variable_name          | Value              |
      +------------------------+--------------------+
      | gtid_binlog_pos        | 1-1-95,3-1-1,4-2-1 |
      | gtid_binlog_state      | 1-1-95,3-1-1,4-2-1 |
      | gtid_current_pos       | 1-1-95,3-1-1,4-2-1 |
      | gtid_domain_id         | 3                  |
      | gtid_ignore_duplicates | OFF                |
      | gtid_pos_auto_engines  |                    |
      | gtid_slave_pos         | 1-1-95,3-1-1,4-2-1 |
      | gtid_strict_mode       | OFF                |
      | wsrep_gtid_domain_id   | 0                  |
      | wsrep_gtid_mode        | OFF                |
      +------------------------+--------------------+
      10 rows in set (0.001 sec)
      

      The slave's GTID position:

      MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
      +------------------------+--------------------+
      | Variable_name          | Value              |
      +------------------------+--------------------+
      | gtid_binlog_pos        | 1-1-95,3-1-1,4-2-1 |
      | gtid_binlog_state      | 1-1-95,3-1-1,4-2-1 |
      | gtid_current_pos       | 1-1-95,3-1-1,4-2-1 |
      | gtid_domain_id         | 4                  |
      | gtid_ignore_duplicates | OFF                |
      | gtid_pos_auto_engines  |                    |
      | gtid_slave_pos         | 1-1-95,3-1-1       |
      | gtid_strict_mode       | OFF                |
      | wsrep_gtid_domain_id   | 0                  |
      | wsrep_gtid_mode        | OFF                |
      +------------------------+--------------------+
      10 rows in set (0.001 sec)
      

      And let's say that the slave is configured to use "MASTER_USE_GTID=current_pos":

      MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='172.30.0.105', MASTER_USER='maxscale', MASTER_PASSWORD='password', MASTER_USE_GTID=current_pos;
      Query OK, 0 rows affected (0.009 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.045 sec)
      

      And the slave is initially replicating normally:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000001
                 Read_Master_Log_Pos: 376
                      Relay_Log_File: ip-172-30-0-96-relay-bin.000002
                       Relay_Log_Pos: 717
               Relay_Master_Log_File: mariadb-bin.000001
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 376
                     Relay_Log_Space: 1035
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 1
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Current_Pos
                         Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      But then let's say that we execute a local transaction on the slave. We can see that the slave's gtid_binlog_pos changes:

      MariaDB [(none)]> CREATE DATABASE slave_db;
      Query OK, 1 row affected (0.000 sec)
       
      MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
      +------------------------+--------------------+
      | Variable_name          | Value              |
      +------------------------+--------------------+
      | gtid_binlog_pos        | 1-1-95,3-1-1,4-2-2 |
      | gtid_binlog_state      | 1-1-95,3-1-1,4-2-2 |
      | gtid_current_pos       | 1-1-95,3-1-1,4-2-2 |
      | gtid_domain_id         | 4                  |
      | gtid_ignore_duplicates | OFF                |
      | gtid_pos_auto_engines  |                    |
      | gtid_slave_pos         | 1-1-95,3-1-1       |
      | gtid_strict_mode       | OFF                |
      | wsrep_gtid_domain_id   | 0                  |
      | wsrep_gtid_mode        | OFF                |
      +------------------------+--------------------+
      10 rows in set (0.001 sec)
      

      But at first, the slave doesn't actually notice that its position is inconsistent with the master:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000001
                 Read_Master_Log_Pos: 376
                      Relay_Log_File: ip-172-30-0-96-relay-bin.000002
                       Relay_Log_Pos: 717
               Relay_Master_Log_File: mariadb-bin.000001
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 376
                     Relay_Log_Space: 1035
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 1
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Current_Pos
                         Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      The slave only notices when the slave threads are restarted:

      MariaDB [(none)]> STOP SLAVE;
      Query OK, 0 rows affected (0.002 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.005 sec)
       
      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State:
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000001
                 Read_Master_Log_Pos: 376
                      Relay_Log_File: ip-172-30-0-96-relay-bin.000001
                       Relay_Log_Pos: 4
               Relay_Master_Log_File: mariadb-bin.000001
                    Slave_IO_Running: No
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 376
                     Relay_Log_Space: 296
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 1236
                       Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 4-2-2, which is not in the master's binlog'
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 1
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Current_Pos
                         Gtid_IO_Pos: 1-1-95,4-2-2,3-1-1
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      I think the slave should warn the user about this, so that users can be aware of inconsistent positions, even when the slave threads are not restarted.

      For example, here's one potential fix:

      If a slave has "MASTER_USE_GTID=current_pos" set, then the slave's I/O thread could periodically compare the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos) to the slave's global value of gtid_binlog_pos. If the global value of gtid_binlog_pos contains GTIDs that are greater than the GTIDs in the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos), then the slave could write a warning to the error log. If gtid_strict_mode were enabled, then maybe the warning could be changed to an error.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Elkin Andrei Elkin
              Reporter:
              GeoffMontee Geoff Montee
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

                Dates

                Created:
                Updated:

                  Git Integration