Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
24.02.2
-
None
-
None
Description
When using MaxScale CDC service (avrorouter) with gtid_start_pos=newest, the service fails to recover automatically when the MariaDB server's GTID position becomes unavailable (error 1236). This commonly occurs when a MariaDB node joins a Galera cluster with a different GTID history, causing binlog files to be purged.
Environment
MaxScale Version: 24.02.2
MariaDB Version: 11.7
Configuration:
CDC service using avrorouter in direct replication mode
_[cdc-service]
type=service
router=avrorouter
servers=dbserver
server_id=01
user=maxscale_user
password=somepwd
group_rows=20
gtid_start_pos=newest
[cdc-listener]
type=listener
service=cdc-service
protocol=CDC
port=4001_
Steps to Reproduce
- Start MaxScale CDC service with gtid_start_pos=newest
- MaxScale begins replicating from MariaDB server at GTID 0-1-64151,1-1-1
- MaxScale saves this GTID to current_gtid.txt
- MariaDB server joins a new Galera cluster (or performs SST/IST)
- Server's GTID jumps to 0-1-245100,1-1-1 (cluster's current state)
- Old binlog files containing GTID 0-1-64151 are purged
- MaxScale reconnects to the server
Expected Behavior
When gtid_start_pos=newest is configured and the saved GTID position is no longer available on the server (error 1236), MaxScale should automatically recover and continue replication from the server's current GTID position, similar to how it behaves on initial startup when no current_gtid.txt exists.
Actual Behavior
MaxScale enters an infinite retry loop, continuously attempting to replicate from the old, unavailable GTID position:
The CDC service remains broken until MaxScale is manually restarted with current_gtid.txt deleted.
Observations:
MaxScale monitor shows correct GTID: Running maxctrl list servers shows that MaxScale correctly detects the server's current GTID as 0-1-245100,1-1-1 but the file current_gtid.txt contains stale GTID: The file contains the old GTID 0-1-64151,1-1-1 and is never updated despite the error
MariaDB server logs show the discrepancy:
The first connection successfully starts with the new GTID, while MaxScale's connection fails because it requests the old, purged GTID.
gtid_start_pos=newest only works on first startup: The gtid_start_pos=newest parameter appears to only take effect when there is no saved state in current_gtid.txt. Once a GTID is saved, it is always used regardless of whether it's still valid on the server.
Current Workaround
The only way to recover is to:
- Restart MaxScale
This forces MaxScale to behave as if it's starting fresh and query the server's current GTID.