[MDEV-10810] GTID out-of-order on ¿concurrent? duplicate transactions Created: 2016-09-14  Updated: 2022-11-10

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.0.25, 10.0.27
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Pablo Guzman Assignee: Angelique Sklavounos (Inactive)
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Ubuntu 14.04.4 LTS (GNU/Linux 3.13.0-86-generic x86_64)


Attachments: Text File errorlog.txt     Text File slaveStatusAinC.txt    

 Description   

Setup:
We have a replication schema where three servers replicate from and to every other server
A => B, C. A server has server_id = 1, domain_id = 1
B => A, C. B server has server_id = 2, domain_id = 2
C => A, B. C server has server_id = 3, domain_id = 3
On this layout, only the B server has read_only=OFF while A and C have it ON.
We have gtid_strict_mode, gtid_ignore_duplicates and log_slave_updates ON on every server.

Issue: So on this setup B handles transactions, replicates them to A and C, which them replicate it to the other server again. (A to C and C to A).
The issue is that randomly and often both B and C get the slave stopped with the error:
An attempt was made to binlog GTID 2-1-850464 which would create an out-of-order sequence number with existing GTID 2-1-850464, and gtid strict mode is enabled.

Sometimes both get locked in the same GTID, sometimes different GTIDs. (This happens to the replication between A and C, the replication with and from B always works fine and never gets interrupted)

In every case the GTID says it's trying to binlog the same GTID that already has in the binlog.
If we stop the slave and start it again (without using the skip counter or anything) everything starts working again until it randomly stops again by the same error.

We aren't sure what's going on but my gut tells me this seems to be some kind of race condition when the same transaction arrives from both servers at roughly the same time and the option gtid_ignore_duplicates seems to be unable to stop this from happening.

The latency between the servers are
A and B: 0.08 MS
A and C: 2.1 MS
B and C: 2.1 MS



 Comments   
Comment by Pablo Guzman [ 2016-09-16 ]

We have now eliminated one of the links to test it out (C no longer replicates from A) and the issue doesn't happen in C anymore. However A (which replicates from B and C) still has this issue randomly and often.

All the slaves are set to master_gtid_pos=current_pos

Generated at Thu Feb 08 07:45:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.