[MXS-6030] CDC/Avrorouter Fails to Recover from Error 1236 When gtid_start_pos=newest - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 24.02.2
Fix Version/s: None
Component/s: avrorouter
Labels:
None

Description

When using MaxScale CDC service (avrorouter) with gtid_start_pos=newest, the service fails to recover automatically when the MariaDB server's GTID position becomes unavailable (error 1236). This commonly occurs when a MariaDB node joins a Galera cluster with a different GTID history, causing binlog files to be purged.

Environment
MaxScale Version: 24.02.2
MariaDB Version: 11.7

Configuration:
CDC service using avrorouter in direct replication mode

_[cdc-service]
type=service
router=avrorouter
servers=dbserver
server_id=01
user=maxscale_user
password=somepwd
group_rows=20
gtid_start_pos=newest

[cdc-listener]
type=listener
service=cdc-service
protocol=CDC
port=4001_

Steps to Reproduce

Start MaxScale CDC service with gtid_start_pos=newest
MaxScale begins replicating from MariaDB server at GTID 0-1-64151,1-1-1
MaxScale saves this GTID to current_gtid.txt
MariaDB server joins a new Galera cluster (or performs SST/IST)
Server's GTID jumps to 0-1-245100,1-1-1 (cluster's current state)
Old binlog files containing GTID 0-1-64151 are purged
MaxScale reconnects to the server

Expected Behavior
When gtid_start_pos=newest is configured and the saved GTID position is no longer available on the server (error 1236), MaxScale should automatically recover and continue replication from the server's current GTID position, similar to how it behaves on initial startup when no current_gtid.txt exists.

Actual Behavior
MaxScale enters an infinite retry loop, continuously attempting to replicate from the old, unavailable GTID position:
The CDC service remains broken until MaxScale is manually restarted with current_gtid.txt deleted.

Observations:
MaxScale monitor shows correct GTID: Running maxctrl list servers shows that MaxScale correctly detects the server's current GTID as 0-1-245100,1-1-1 but the file current_gtid.txt contains stale GTID: The file contains the old GTID 0-1-64151,1-1-1 and is never updated despite the error

MariaDB server logs show the discrepancy:

The first connection successfully starts with the new GTID, while MaxScale's connection fails because it requests the old, purged GTID.
gtid_start_pos=newest only works on first startup: The gtid_start_pos=newest parameter appears to only take effect when there is no saved state in current_gtid.txt. Once a GTID is saved, it is always used regardless of whether it's still valid on the server.

Current Workaround
The only way to recover is to:

Restart MaxScale

This forces MaxScale to behave as if it's starting fresh and query the server's current GTID.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Har Gagan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 4 days ago 01:05

Updated:: Yesterday 16:26

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.