[MDEV-37135] Data Loss if a Primary is Reverted And Made a Slave - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6, 10.11, 11.4, 10.5(EOL), 12.0(EOL), 11.8
Fix Version/s: 10.11, 11.4, 11.8
Component/s: Backup, Replication
Labels:
None

Bug Category:
Can result in data loss

Description

Transactions will be lost if a master is reverted to a prior snapshot and made a slave. I.e. taking a master, reverting it to some past state of itself (where there are newer transactions from this same server that its slaves have), and then setting it to be a slave of one of its slaves. In other words, our master of state (D-S-N) D=domain_id, S=server_id, N=seq_no is then reverted to some past state D-S-P where P < N, but S is the same. --replicate-same-server-id and --log-slave-updates currently can't both be enabled, so we have two options:

If --replicate-same-server-id is off, all transactions between D-S-P and D-S-N will be dropped when replicated to the newly-demoted-slave (formerly the master (server S) which logged D-S-P through D-S-N.
If --replicate-same-server-id is on, --log-slave-updates must be off, and then the newly-demoted slave (server S) won't re-binlog these transactions, and its binary logs will have a hole.

A workaround for this is to temporarily change the server_id of the demoting server while it replicates transactions it serviced after the restored snapshot. The following steps will ensure this:

Ensure the newly promoted primary has replicated all transactions from its former primary (which is demoting to a replica)
`SELECT @@global.gtid_binlog_state` on the newly promoted primary. This will show a list of GTIDs (possibly in the same domain). Each GTID in this list has a unique <domain_id, server_id> combination, where the GTID itself shows the last transaction executed by that server_id. Find the GTIDs with the server_id that matches the @@global.server_id variable of the demoting-to-replica server (i.e. the server which is having a snapshot restored on it). These GTIDs will be referred to later as <last_gtids_from_old_master>.
After restoring the snapshot on the former primary (the newly-demoting replica), temporarily change the server_id to some value that is unique-to-the-cluster.
Ensure replication is configured to use MASTER_USE_GTID=Slave_pos (rather than Current_pos or No).
When ready to start replication on the newly-demoted replica, use the command START REPLICA UNTIL master_gtid_pos="<last_gtids_from_old_master>" where last_gtids_from_old_master comes from step 2. This will catch the server up to the state that the server was formerly at before the snapshot was restored.
Wait for replication to automatically stop (due to the UNTIL condition being satisfied).
Restore the server_id of the newly-demoted replica to its original value (i.e. before step 3).
Start replication as normal.

Attachments

Issue Links

is duplicated by

MDEV-36973 Replication breaks after snapshot/failover/restore cycle

Closed

relates to

MDEV-10163 mysqldump hard to work and dangerous using multi master

Closed

MDEV-11268 Start slave causing data loss

Closed

Activity

People

Assignee:: Brandon Nesterenko

Reporter:: Brandon Nesterenko

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2025-07-01 17:17

Updated:: 2025-11-26 09:10

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.