Details
-
Bug
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.6, 10.11, 11.4, 11.8, 10.5(EOL), 12.0(EOL)
-
None
-
Can result in data loss
Description
Transactions will be lost if a master is reverted to a prior snapshot and made a slave. I.e. taking a master, reverting it to some past state of itself (where there are newer transactions from this same server that its slaves have), and then setting it to be a slave of one of its slaves. In other words, our master of state (D-S-N) D=domain_id, S=server_id, N=seq_no is then reverted to some past state D-S-P where P < N, but S is the same. --replicate-same-server-id and --log-slave-updates currently can't both be enabled, so we have two options:
- If --replicate-same-server-id is off, all transactions between D-S-P and D-S-N will be dropped when replicated to the newly-demoted-slave (formerly the master (server S) which logged D-S-P through D-S-N.
- If --replicate-same-server-id is on, --log-slave-updates must be off, and then the newly-demoted slave (server S) won't re-binlog these transactions, and its binary logs will have a hole.
A workaround for this is to temporarily change the server_id of the demoting server while it replicates transactions it serviced after the restored snapshot. The following steps will ensure this:
- Ensure the newly promoted primary has replicated all transactions from its former primary (which is demoting to a replica)
- `SELECT @@global.gtid_binlog_state` on the newly promoted primary. This will show a list of GTIDs (possibly in the same domain). Each GTID in this list has a unique <domain_id, server_id> combination, where the GTID itself shows the last transaction executed by that server_id. Find the GTIDs with the server_id that matches the @@global.server_id variable of the demoting-to-replica server (i.e. the server which is having a snapshot restored on it). These GTIDs will be referred to later as <last_gtids_from_old_master>.
- After restoring the snapshot on the former primary (the newly-demoting replica), temporarily change the server_id to some value that is unique-to-the-cluster.
- Ensure replication is configured to use MASTER_USE_GTID=Slave_pos (rather than Current_pos or No).
- When ready to start replication on the newly-demoted replica, use the command START REPLICA UNTIL master_gtid_pos="<last_gtids_from_old_master>" where last_gtids_from_old_master comes from step 2. This will catch the server up to the state that the server was formerly at before the snapshot was restored.
- Wait for replication to automatically stop (due to the UNTIL condition being satisfied).
- Restore the server_id of the newly-demoted replica to its original value (i.e. before step 3).
- Start replication as normal.
Attachments
Issue Links
- is duplicated by
-
MDEV-36973 Replication breaks after snapshot/failover/restore cycle
-
- Closed
-