[MXS-4542] Duplicate GTID on replicas Created: 2023-03-05 Updated: 2023-03-07 |
|
| Status: | Open |
| Project: | MariaDB MaxScale |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | Assen Totin | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Description |
|
This is a continuation of the story from https://jira.mariadb.org/browse/MXS-4421 : replicas under GTID strict mode stop with duplicate transaction ID. This time the error happened during nighttime operations, no human action involved, so at least we can lift the blame from loading a dump or using a particular SQL client (the clients were regular PHP web sites). Just like before, this killed multiple replicas at the same time with the same transaction number error. The odd thing this time is the error message on the replica: Last_Error: An attempt was made to binlog GTID 0-11-3379168833 which would create an out-of-order sequence number with existing GTID 0-11-3379168833, and gtid strict mode is enabled Note that both the "existing" and "new" transaction have the SAME origin server (ID=11). At first, we thought this is a bug of some kind in the logging. Then, we inspected the binlog on the master, the relay log on the replica (as expected, found to be identical to the binlog of the master) and the binlog on the replica. To our amazement, we found GTID 0-11-3379168833 to have completely different content in the relay log and in the binlog of the replica! The relay log entry is a long transaction, starting like this:
and the local binlog has this
These two seem to originate from different SQL clients, as they refer to different databases (there is a strict policy of one SQL account per DB and there are no cross-schema queries or clients that have access to both db_XXX and db_YYY). It is completely unclear how the entry in the replica's binlog came to be, because it has the server ID of the master (11) while the local one is different (12 and 13 on the affected replicas). The next puzzling thing is that the replicas continued to receive DML for short time even after replication had stopped and logged two more statements - this time, under their own server ID:
Again, we have two other databases involved, meaning two more SQL clients: db_AAA and db_BBB. This is the end of the log, since with the replication stopped, MaxScale has removed the replicas from routing. We don't have much more clues to offer, but at least we know that this mysterious issue is still around. |
| Comments |
| Comment by Assen Totin [ 2023-03-05 ] |
|
Hmmm, it just happened again. Ana again a DELETE statement gets sent to all the replicas, making the next replicated transaction fail under strict mode. Not only was these again DELETE statements, but they were on the same databases and tables! So, this seemed suspicious to me and I ran a SHOW CREATE TABLE on them: CREATE TABLE `wp_wfls_role_counts` ( Now, MEMORY is a bit unusual here... could it be that MaxScale routes the query this way because of the engine type? It seems to be a stand-alone query with auto-commit (the query comes from some WordPress plugin, so it should not be anything fancy by design). Does MacScale make any decision based on the engine type? Optimistic transactions are set to false in the read-write split router, so, at least, it is not their fault. I'll convert these to Aria to see if this changes anything. |
| Comment by Assen Totin [ 2023-03-05 ] |
|
And the DELETE query itself is a bad one, it is "DELETE FROM db_name.table_name"... again, could this be confusing MaxScale somehow so that it routes it to the slaves? |
| Comment by Johan Wikman [ 2023-03-07 ] |
|
Totin A long shot, but do the statements contain embedded comments? |