Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Incomplete
-
6.4.3
-
None
Description
I've been chasing a weird thing (bug?) which happens from time to time (once every 3-6 months) and looks like this:
- SQL dump from a local MariaDB stand-alone DB (taken by HeidiSQL) is then loaded onto a master/slave cluster via MaxScale (presumably by the same tool).
- MaxScale starts feeding the dump to the master (as expected).
- At some point, MaxScale makes a connection to the slave and tries to execute something there.
- The replica runs with GTID strict mode enabled and stops the replication:
An attempt was made to binlog GTID 0-11-170271865 which would create an out-of-order sequence number with existing GTID 0-12-170271865, and gtid strict mode is enabled
The binlog on the master and the relay log are identical, as it may be expected; the binlog on the replica shows a GTID with the local server ID created (but no data) and then the replication stops.
This has happened enough times to rule out any accidental cause, Mercury retrograde etc.; both the master and the slave are firewalled out, so there is now way any connection could have been made directly into them - so this must have come from MaxScale. This happens both on loaded systems and testing ones, so the load factor seems not to play any role.
MaxScale config is pretty straightforward, with a simple read-write split. It does have causal reads tracking now set to "global", but the same issue was present when this option did not exist; all-in-all, this has happened on all MaxScale mainlines from 2.3 to 6. We do have "use_sql_variables_in=master" but I don't see how this would cause the observed effect.
The most frustrating thing is that this happens rarely and by far not every data load ends like this; however, this thing only happens on dump loads, so it must be somehow related.
I'm attaching here the relevant parts (stripped of the repetitive INSERT statements) from the master log, relay log and the slave log on a breakage that happened few days ago; also the header of a dump from HeidiSQL, which caused a similar breakage few months ago (it then broke literally after the first CREATE TABLE statement, on the USE statement shown - but this time it broke halfway through the dump which I'm still waiting to receive).
I understand these are hardly enough and I cannot give a way to reliably reproduce this, but maybe somebody has seen or heard of something similar? Does this ring any bell?