[MXS-4411] Loading a SQL dump sends queries to the replica, breaking it (GTID under strict mode) Created: 2022-11-22  Updated: 2023-02-08  Resolved: 2023-02-08

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 6.4.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Assen Totin (Inactive) Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None

Attachments: File HeidiSQL- dump.sql     File master.log     File relay.log     File slave.log    

 Description   

I've been chasing a weird thing (bug?) which happens from time to time (once every 3-6 months) and looks like this:

  • SQL dump from a local MariaDB stand-alone DB (taken by HeidiSQL) is then loaded onto a master/slave cluster via MaxScale (presumably by the same tool).
  • MaxScale starts feeding the dump to the master (as expected).
  • At some point, MaxScale makes a connection to the slave and tries to execute something there.
  • The replica runs with GTID strict mode enabled and stops the replication:

An attempt was made to binlog GTID 0-11-170271865 which would create an out-of-order sequence number with existing GTID 0-12-170271865, and gtid strict mode is enabled

The binlog on the master and the relay log are identical, as it may be expected; the binlog on the replica shows a GTID with the local server ID created (but no data) and then the replication stops.

This has happened enough times to rule out any accidental cause, Mercury retrograde etc.; both the master and the slave are firewalled out, so there is now way any connection could have been made directly into them - so this must have come from MaxScale. This happens both on loaded systems and testing ones, so the load factor seems not to play any role.

MaxScale config is pretty straightforward, with a simple read-write split. It does have causal reads tracking now set to "global", but the same issue was present when this option did not exist; all-in-all, this has happened on all MaxScale mainlines from 2.3 to 6. We do have "use_sql_variables_in=master" but I don't see how this would cause the observed effect.

The most frustrating thing is that this happens rarely and by far not every data load ends like this; however, this thing only happens on dump loads, so it must be somehow related.

I'm attaching here the relevant parts (stripped of the repetitive INSERT statements) from the master log, relay log and the slave log on a breakage that happened few days ago; also the header of a dump from HeidiSQL, which caused a similar breakage few months ago (it then broke literally after the first CREATE TABLE statement, on the USE statement shown - but this time it broke halfway through the dump which I'm still waiting to receive).

I understand these are hardly enough and I cannot give a way to reliably reproduce this, but maybe somebody has seen or heard of something similar? Does this ring any bell?



 Comments   
Comment by markus makela [ 2022-11-25 ]

Is it possible for you to try and reproduce this with log_info enabled for MaxScale? Without MaxScale logs and with no way to consistently reproduce it, it's quite hard to say what might be happening.

The only way MaxScale would send a write to a slave server would be if a bug like MXS-4269 or MXS-4411 caused something to be classified as a read-only command that modifies the session state when in reality it would be an update.

Comment by Johan Wikman [ 2022-11-25 ]

MXS-4413 that will be fixed in 6.4.4 is also one that causes an update to be sent to a slave.

Comment by markus makela [ 2022-12-01 ]

In theory this could also be caused by MXS-4421 if HeidiSQL loads the dump by writing the data without waiting for individual responses.

Comment by markus makela [ 2023-02-08 ]

Closing this as Incomplete since there's a possibility that this was caused by MXS-4421.

Generated at Thu Feb 08 04:28:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.