Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
======================================
Description update after problem discussion:
======================================
This work deprecates Current_Pos as an option to CHANGE MASTER TO MASTER_USE_GTID while also adding a safe replacement option MASTER_DEMOTE_TO_SLAVE=<bool>. Specifically, the use case of Current_Pos is to transition a master to become a slave; however, this can break replication state due to actively updating gtid_current_pos with gtid_binlog_pos and gtid_slave_pos.
MASTER_DEMOTE_TO_SLAVE changes this use case by forcing users to set Using_Gtid=Slave_Pos and merging gtid_binlog_pos into gtid_slave_pos once at CHANGE MASTER TO time. Note that if gtid_slave_pos is more recent than gtid_binlog_pos (as in the case of chain replication), the replication state should be preserved.
Then, MASTER_USE_GTID=Current_Pos is deprecated in favor of using Slave_Pos in combination with MASTER_DEMOTE_TO_SLAVE=1.
==========================
Original Description:
==========================
When a slave is configured to replicate with "MASTER_USE_GTID=current_pos", the slave uses its value of gtid_current_pos to replicate from the master.
https://mariadb.com/kb/en/library/change-master-to/#master_use_gtid
https://mariadb.com/kb/en/library/gtid/#gtid_current_pos
The value of gtid_current_pos includes GTIDs from both gtid_slave_pos and gtid_binlog_pos:
https://mariadb.com/kb/en/library/gtid/#gtid_slave_pos
https://mariadb.com/kb/en/library/gtid/#gtid_binlog_pos
Since both gtid_slave_pos and gtid_binlog_pos are used, this means that the position takes into account both local transactions and replicated transactions. This can be somewhat problematic, since it means that executing a single local transaction on the slave can end up breaking replication, due to the fact that the local transaction would cause the slave's GTID position to become inconsistent with the master's GTID position. However, in my opinion, this makes sense, given the design of the GTID functionality. To prevent this specific issue, if a slave is using "MASTER_USE_GTID=current_pos", then it should have read_only=ON set.
However, the more problematic issue is that MariaDB will not alert users to the inconsistent GTID position until the slave threads are restarted. If the slave is running smoothly, then the slave threads may not be restarted for weeks or months.
The root cause of this appears to be that the slave's I/O thread only initializes its local value of gtid_current_pos when the thread is first started in start_slave_threads():
https://github.com/MariaDB/server/blob/mariadb-10.4.6/sql/slave.cc#L1400
This means that if a local transaction is executed on the slave, then the slave won't notice that its GTID position is inconsistent with the master until the slave threads are restarted.
For example, let's say that I have a master and a slave.
The master's GTID position:
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
|
+------------------------+--------------------+
|
| Variable_name | Value |
|
+------------------------+--------------------+
|
| gtid_binlog_pos | 1-1-95,3-1-1,4-2-1 |
|
| gtid_binlog_state | 1-1-95,3-1-1,4-2-1 |
|
| gtid_current_pos | 1-1-95,3-1-1,4-2-1 |
|
| gtid_domain_id | 3 |
|
| gtid_ignore_duplicates | OFF |
|
| gtid_pos_auto_engines | |
|
| gtid_slave_pos | 1-1-95,3-1-1,4-2-1 |
|
| gtid_strict_mode | OFF |
|
| wsrep_gtid_domain_id | 0 |
|
| wsrep_gtid_mode | OFF |
|
+------------------------+--------------------+
|
10 rows in set (0.001 sec)
|
The slave's GTID position:
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
|
+------------------------+--------------------+
|
| Variable_name | Value |
|
+------------------------+--------------------+
|
| gtid_binlog_pos | 1-1-95,3-1-1,4-2-1 |
|
| gtid_binlog_state | 1-1-95,3-1-1,4-2-1 |
|
| gtid_current_pos | 1-1-95,3-1-1,4-2-1 |
|
| gtid_domain_id | 4 |
|
| gtid_ignore_duplicates | OFF |
|
| gtid_pos_auto_engines | |
|
| gtid_slave_pos | 1-1-95,3-1-1 |
|
| gtid_strict_mode | OFF |
|
| wsrep_gtid_domain_id | 0 |
|
| wsrep_gtid_mode | OFF |
|
+------------------------+--------------------+
|
10 rows in set (0.001 sec)
|
And let's say that the slave is configured to use "MASTER_USE_GTID=current_pos":
MariaDB [(none)]> CHANGE MASTER TO MASTER_HOST='172.30.0.105', MASTER_USER='maxscale', MASTER_PASSWORD='password', MASTER_USE_GTID=current_pos;
|
Query OK, 0 rows affected (0.009 sec)
|
 |
MariaDB [(none)]> START SLAVE;
|
Query OK, 0 rows affected (0.045 sec)
|
And the slave is initially replicating normally:
MariaDB [(none)]> SHOW SLAVE STATUS\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 172.30.0.105
|
Master_User: maxscale
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb-bin.000001
|
Read_Master_Log_Pos: 376
|
Relay_Log_File: ip-172-30-0-96-relay-bin.000002
|
Relay_Log_Pos: 717
|
Relay_Master_Log_File: mariadb-bin.000001
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 376
|
Relay_Log_Space: 1035
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 0
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Current_Pos
|
Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: conservative
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
|
Slave_DDL_Groups: 0
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 0
|
1 row in set (0.000 sec)
|
But then let's say that we execute a local transaction on the slave. We can see that the slave's gtid_binlog_pos changes:
MariaDB [(none)]> CREATE DATABASE slave_db;
|
Query OK, 1 row affected (0.000 sec)
|
 |
MariaDB [(none)]> SHOW GLOBAL VARIABLES LIKE '%gtid%';
|
+------------------------+--------------------+
|
| Variable_name | Value |
|
+------------------------+--------------------+
|
| gtid_binlog_pos | 1-1-95,3-1-1,4-2-2 |
|
| gtid_binlog_state | 1-1-95,3-1-1,4-2-2 |
|
| gtid_current_pos | 1-1-95,3-1-1,4-2-2 |
|
| gtid_domain_id | 4 |
|
| gtid_ignore_duplicates | OFF |
|
| gtid_pos_auto_engines | |
|
| gtid_slave_pos | 1-1-95,3-1-1 |
|
| gtid_strict_mode | OFF |
|
| wsrep_gtid_domain_id | 0 |
|
| wsrep_gtid_mode | OFF |
|
+------------------------+--------------------+
|
10 rows in set (0.001 sec)
|
But at first, the slave doesn't actually notice that its position is inconsistent with the master:
MariaDB [(none)]> SHOW SLAVE STATUS\G
|
*************************** 1. row ***************************
|
Slave_IO_State: Waiting for master to send event
|
Master_Host: 172.30.0.105
|
Master_User: maxscale
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb-bin.000001
|
Read_Master_Log_Pos: 376
|
Relay_Log_File: ip-172-30-0-96-relay-bin.000002
|
Relay_Log_Pos: 717
|
Relay_Master_Log_File: mariadb-bin.000001
|
Slave_IO_Running: Yes
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 376
|
Relay_Log_Space: 1035
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: 0
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 0
|
Last_IO_Error:
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Current_Pos
|
Gtid_IO_Pos: 1-1-95,4-2-1,3-1-1
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: conservative
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
|
Slave_DDL_Groups: 0
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 0
|
1 row in set (0.000 sec)
|
The slave only notices when the slave threads are restarted:
MariaDB [(none)]> STOP SLAVE;
|
Query OK, 0 rows affected (0.002 sec)
|
 |
MariaDB [(none)]> START SLAVE;
|
Query OK, 0 rows affected (0.005 sec)
|
 |
MariaDB [(none)]> SHOW SLAVE STATUS\G
|
*************************** 1. row ***************************
|
Slave_IO_State:
|
Master_Host: 172.30.0.105
|
Master_User: maxscale
|
Master_Port: 3306
|
Connect_Retry: 60
|
Master_Log_File: mariadb-bin.000001
|
Read_Master_Log_Pos: 376
|
Relay_Log_File: ip-172-30-0-96-relay-bin.000001
|
Relay_Log_Pos: 4
|
Relay_Master_Log_File: mariadb-bin.000001
|
Slave_IO_Running: No
|
Slave_SQL_Running: Yes
|
Replicate_Do_DB:
|
Replicate_Ignore_DB:
|
Replicate_Do_Table:
|
Replicate_Ignore_Table:
|
Replicate_Wild_Do_Table:
|
Replicate_Wild_Ignore_Table:
|
Last_Errno: 0
|
Last_Error:
|
Skip_Counter: 0
|
Exec_Master_Log_Pos: 376
|
Relay_Log_Space: 296
|
Until_Condition: None
|
Until_Log_File:
|
Until_Log_Pos: 0
|
Master_SSL_Allowed: No
|
Master_SSL_CA_File:
|
Master_SSL_CA_Path:
|
Master_SSL_Cert:
|
Master_SSL_Cipher:
|
Master_SSL_Key:
|
Seconds_Behind_Master: NULL
|
Master_SSL_Verify_Server_Cert: No
|
Last_IO_Errno: 1236
|
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 4-2-2, which is not in the master's binlog'
|
Last_SQL_Errno: 0
|
Last_SQL_Error:
|
Replicate_Ignore_Server_Ids:
|
Master_Server_Id: 1
|
Master_SSL_Crl:
|
Master_SSL_Crlpath:
|
Using_Gtid: Current_Pos
|
Gtid_IO_Pos: 1-1-95,4-2-2,3-1-1
|
Replicate_Do_Domain_Ids:
|
Replicate_Ignore_Domain_Ids:
|
Parallel_Mode: conservative
|
SQL_Delay: 0
|
SQL_Remaining_Delay: NULL
|
Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
|
Slave_DDL_Groups: 0
|
Slave_Non_Transactional_Groups: 0
|
Slave_Transactional_Groups: 0
|
1 row in set (0.000 sec)
|
I think the slave should warn the user about this, so that users can be aware of inconsistent positions, even when the slave threads are not restarted.
For example, here's one potential fix:
If a slave has "MASTER_USE_GTID=current_pos" set, then the slave's I/O thread could periodically compare the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos) to the slave's global value of gtid_binlog_pos. If the global value of gtid_binlog_pos contains GTIDs that are greater than the GTIDs in the thread's local value of gtid_current_pos (i.e. mi->gtid_current_pos), then the slave could write a warning to the error log. If gtid_strict_mode were enabled, then maybe the warning could be changed to an error.
Attachments
Issue Links
- causes
-
MDEV-31768 Alias MASTER_DEMOTE_TO_REPLICA for MASTER_DEMOTE_TO_SLAVE
- Closed
-
MDEV-34064 ERROR 12607 (HY000): This xid is not exist and Assertion `!is_set() || (m_status == DA_OK_BULK && is_bulk_op())' failed from change_master()
- Confirmed
- is duplicated by
-
MDEV-16800 Instead of master_use_gtid=current_pos , create gtid_slave_pos_from_master=1;
- Closed
- relates to
-
MDEV-10279 gtid_current_pos is not updated with slave transactions from old master
- Open
-
MDEV-16834 GTID current_pos easily breaks replication
- Closed
-
MDEV-17156 Local transactions on a Slave don't update GTID's gtid_current_pos after RESET MASTER on Slave (master_use_gtid value is not relevant)
- Closed
-
MDEV-28839 Convert master_use_gtid = current_pos to slave_pos in mtr suites
- Closed
-
MDEV-30647 Remove Current_Pos from MASTER_USE_GTID options
- Closed
-
MDEV-31768 Alias MASTER_DEMOTE_TO_REPLICA for MASTER_DEMOTE_TO_SLAVE
- Closed
- split to
-
MDEV-32976 Un-deprecate MASTER_USE_GTID=Current_Pos
- Closed