Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20122

Deprecate MASTER_USE_GTID=Current_Pos to favor new MASTER_DEMOTE_TO_SLAVE option

    XMLWordPrintable

Details

    Description

      ======================================
      Description update after problem discussion:
      ======================================

      This work deprecates Current_Pos as an option to CHANGE MASTER TO MASTER_USE_GTID while also adding a safe replacement option MASTER_DEMOTE_TO_SLAVE=<bool>. Specifically, the use case of Current_Pos is to transition a master to become a slave; however, this can break replication state due to actively updating gtid_current_pos with gtid_binlog_pos and gtid_slave_pos.

      MASTER_DEMOTE_TO_SLAVE changes this use case by forcing users to set Using_Gtid=Slave_Pos and merging gtid_binlog_pos into gtid_slave_pos once at CHANGE MASTER TO time. Note that if gtid_slave_pos is more recent than gtid_binlog_pos (as in the case of chain replication), the replication state should be preserved.

      Then, MASTER_USE_GTID=Current_Pos is deprecated in favor of using Slave_Pos in combination with MASTER_DEMOTE_TO_SLAVE=1.

      ==========================
      Original Description:
      ==========================

      When a slave is configured to replicate with "MASTER_USE_GTID=current_pos", the slave uses its value of gtid_current_pos to replicate from the master.

      https://mariadb.com/kb/en/library/change-master-to/#master_use_gtid

      https://mariadb.com/kb/en/library/gtid/#gtid_current_pos

      The value of gtid_current_pos includes GTIDs from both gtid_slave_pos and gtid_binlog_pos:

      https://mariadb.com/kb/en/library/gtid/#gtid_slave_pos

      https://mariadb.com/kb/en/library/gtid/#gtid_binlog_pos

      Since both gtid_slave_pos and gtid_binlog_pos are used, this means that the position takes into account both local transactions and replicated transactions. This can be somewhat problematic, since it means that executing a single local transaction on the slave can end up breaking replication, due to the fact that the local transaction would cause the slave's GTID position to become inconsistent with the master's GTID position. However, in my opinion, this makes sense, given the design of the GTID functionality. To prevent this specific issue, if a slave is using "MASTER_USE_GTID=current_pos", then it should have read_only=ON set.

      However, the more problematic issue is that MariaDB will not alert users to the inconsistent GTID position until the slave threads are restarted. If the slave is running smoothly, then the slave threads may not be restarted for weeks or months.

      The root cause of this appears to be that the slave's I/O thread only initializes its local value of gtid_current_pos when the thread is first started in start_slave_threads():

      https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/slave.cc#L1400

      This means that if a local transaction is executed on the slave, then the slave won't notice that its GTID position is inconsistent with the master until the slave threads are restarted.

      For example, let's say that I have a master and a slave.

      The master's GTID position:

      MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
      +------------------------+--------------------+
      | Variable_name          | Value              |
      +------------------------+--------------------+
      | gtid_binlog_pos        | 1-1-95,3-1-1,4-2-1 |
      | gtid_binlog_state      | 1-1-95,3-1-1,4-2-1 |
      | gtid_current_pos       | 1-1-95,3-1-1,4-2-1 |
      | gtid_domain_id         | 3                  |
      | gtid_ignore_duplicates | OFF                |
      | gtid_pos_auto_engines  |                    |
      | gtid_slave_pos         | 1-1-95,3-1-1,4-2-1 |
      | gtid_strict_mode       | OFF                |
      | wsrep_gtid_domain_id   | 0                  |
      | wsrep_gtid_mode        | OFF                |
      +------------------------+--------------------+
      10 rows in set (0.001 sec)
      

      The slave's GTID position:

      MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
      +------------------------+--------------------+
      | Variable_name          | Value              |
      +------------------------+--------------------+
      | gtid_binlog_pos        | 1-1-95,3-1-1,4-2-1 |
      | gtid_binlog_state      | 1-1-95,3-1-1,4-2-1 |
      | gtid_current_pos       | 1-1-95,3-1-1,4-2-1 |
      | gtid_domain_id         | 4                  |
      | gtid_ignore_duplicates | OFF                |
      | gtid_pos_auto_engines  |                    |
      | gtid_slave_pos         | 1-1-95,3-1-1       |
      | gtid_strict_mode       | OFF                |
      | wsrep_gtid_domain_id   | 0                  |
      | wsrep_gtid_mode        | OFF                |
      +------------------------+--------------------+
      10 rows in set (0.001 sec)
      

      And let's say that the slave is configured to use "MASTER_USE_GTID=current_pos":

      MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='172.30.0.105', MASTER_USER='maxscale', MASTER_PASSWORD='password', MASTER_USE_GTID=current_pos;
      Query OK, 0 rows affected (0.009 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.045 sec)
      

      And the slave is initially replicating normally:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000001
                 Read_Master_Log_Pos: 376
                      Relay_Log_File: ip-172-30-0-96-relay-bin.000002
                       Relay_Log_Pos: 717
               Relay_Master_Log_File: mariadb-bin.000001
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 376
                     Relay_Log_Space: 1035
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 1
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Current_Pos
                         Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      But then let's say that we execute a local transaction on the slave. We can see that the slave's gtid_binlog_pos changes:

      MariaDB [(none)]> CREATE DATABASE slave_db;
      Query OK, 1 row affected (0.000 sec)
       
      MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
      +------------------------+--------------------+
      | Variable_name          | Value              |
      +------------------------+--------------------+
      | gtid_binlog_pos        | 1-1-95,3-1-1,4-2-2 |
      | gtid_binlog_state      | 1-1-95,3-1-1,4-2-2 |
      | gtid_current_pos       | 1-1-95,3-1-1,4-2-2 |
      | gtid_domain_id         | 4                  |
      | gtid_ignore_duplicates | OFF                |
      | gtid_pos_auto_engines  |                    |
      | gtid_slave_pos         | 1-1-95,3-1-1       |
      | gtid_strict_mode       | OFF                |
      | wsrep_gtid_domain_id   | 0                  |
      | wsrep_gtid_mode        | OFF                |
      +------------------------+--------------------+
      10 rows in set (0.001 sec)
      

      But at first, the slave doesn't actually notice that its position is inconsistent with the master:

      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State: Waiting for master to send event
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000001
                 Read_Master_Log_Pos: 376
                      Relay_Log_File: ip-172-30-0-96-relay-bin.000002
                       Relay_Log_Pos: 717
               Relay_Master_Log_File: mariadb-bin.000001
                    Slave_IO_Running: Yes
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 376
                     Relay_Log_Space: 1035
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: 0
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 0
                       Last_IO_Error:
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 1
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Current_Pos
                         Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      The slave only notices when the slave threads are restarted:

      MariaDB [(none)]> STOP SLAVE;
      Query OK, 0 rows affected (0.002 sec)
       
      MariaDB [(none)]> START SLAVE;
      Query OK, 0 rows affected (0.005 sec)
       
      MariaDB [(none)]> SHOW SLAVE STATUS\G
      *************************** 1. row ***************************
                      Slave_IO_State:
                         Master_Host: 172.30.0.105
                         Master_User: maxscale
                         Master_Port: 3306
                       Connect_Retry: 60
                     Master_Log_File: mariadb-bin.000001
                 Read_Master_Log_Pos: 376
                      Relay_Log_File: ip-172-30-0-96-relay-bin.000001
                       Relay_Log_Pos: 4
               Relay_Master_Log_File: mariadb-bin.000001
                    Slave_IO_Running: No
                   Slave_SQL_Running: Yes
                     Replicate_Do_DB:
                 Replicate_Ignore_DB:
                  Replicate_Do_Table:
              Replicate_Ignore_Table:
             Replicate_Wild_Do_Table:
         Replicate_Wild_Ignore_Table:
                          Last_Errno: 0
                          Last_Error:
                        Skip_Counter: 0
                 Exec_Master_Log_Pos: 376
                     Relay_Log_Space: 296
                     Until_Condition: None
                      Until_Log_File:
                       Until_Log_Pos: 0
                  Master_SSL_Allowed: No
                  Master_SSL_CA_File:
                  Master_SSL_CA_Path:
                     Master_SSL_Cert:
                   Master_SSL_Cipher:
                      Master_SSL_Key:
               Seconds_Behind_Master: NULL
       Master_SSL_Verify_Server_Cert: No
                       Last_IO_Errno: 1236
                       Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 4-2-2, which is not in the master's binlog'
                      Last_SQL_Errno: 0
                      Last_SQL_Error:
         Replicate_Ignore_Server_Ids:
                    Master_Server_Id: 1
                      Master_SSL_Crl:
                  Master_SSL_Crlpath:
                          Using_Gtid: Current_Pos
                         Gtid_IO_Pos: 1-1-95,4-2-2,3-1-1
             Replicate_Do_Domain_Ids:
         Replicate_Ignore_Domain_Ids:
                       Parallel_Mode: conservative
                           SQL_Delay: 0
                 SQL_Remaining_Delay: NULL
             Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                    Slave_DDL_Groups: 0
      Slave_Non_Transactional_Groups: 0
          Slave_Transactional_Groups: 0
      1 row in set (0.000 sec)
      

      I think the slave should warn the user about this, so that users can be aware of inconsistent positions, even when the slave threads are not restarted.

      For example, here's one potential fix:

      If a slave has "MASTER_USE_GTID=current_pos" set, then the slave's I/O thread could periodically compare the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos) to the slave's global value of gtid_binlog_pos. If the global value of gtid_binlog_pos contains GTIDs that are greater than the GTIDs in the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos), then the slave could write a warning to the error log. If gtid_strict_mode were enabled, then maybe the warning could be changed to an error.

      Attachments

        Issue Links

          Activity

            People

              bnestere Brandon Nesterenko
              GeoffMontee Geoff Montee (Inactive)
              Votes:
              3 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.