Uploaded image for project: 'MariaDB MaxScale'
  1. MariaDB MaxScale
  2. MXS-6030

CDC/Avrorouter Fails to Recover from Error 1236 When gtid_start_pos=newest

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 24.02.2
    • None
    • avrorouter
    • None

    Description

      When using MaxScale CDC service (avrorouter) with gtid_start_pos=newest, the service fails to recover automatically when the MariaDB server's GTID position becomes unavailable (error 1236). This commonly occurs when a MariaDB node joins a Galera cluster with a different GTID history, causing binlog files to be purged.

      Environment
      MaxScale Version: 24.02.2
      MariaDB Version: 11.7

      Configuration:
      CDC service using avrorouter in direct replication mode

      _[cdc-service]
      type=service
      router=avrorouter
      servers=dbserver
      server_id=01
      user=maxscale_user
      password=somepwd
      group_rows=20
      gtid_start_pos=newest

      [cdc-listener]
      type=listener
      service=cdc-service
      protocol=CDC
      port=4001_

      Steps to Reproduce

      1. Start MaxScale CDC service with gtid_start_pos=newest
      2. MaxScale begins replicating from MariaDB server at GTID 0-1-64151,1-1-1
      3. MaxScale saves this GTID to current_gtid.txt
      4. MariaDB server joins a new Galera cluster (or performs SST/IST)
      5. Server's GTID jumps to 0-1-245100,1-1-1 (cluster's current state)
      6. Old binlog files containing GTID 0-1-64151 are purged
      7. MaxScale reconnects to the server

      Expected Behavior
      When gtid_start_pos=newest is configured and the saved GTID position is no longer available on the server (error 1236), MaxScale should automatically recover and continue replication from the server's current GTID position, similar to how it behaves on initial startup when no current_gtid.txt exists.

      Actual Behavior
      MaxScale enters an infinite retry loop, continuously attempting to replicate from the old, unavailable GTID position:
      The CDC service remains broken until MaxScale is manually restarted with current_gtid.txt deleted.

      Observations:
      MaxScale monitor shows correct GTID: Running maxctrl list servers shows that MaxScale correctly detects the server's current GTID as 0-1-245100,1-1-1 but the file current_gtid.txt contains stale GTID: The file contains the old GTID 0-1-64151,1-1-1 and is never updated despite the error

      MariaDB server logs show the discrepancy:

      The first connection successfully starts with the new GTID, while MaxScale's connection fails because it requests the old, purged GTID.
      gtid_start_pos=newest only works on first startup: The gtid_start_pos=newest parameter appears to only take effect when there is no saved state in current_gtid.txt. Once a GTID is saved, it is always used regardless of whether it's still valid on the server.

      Current Workaround
      The only way to recover is to:

      • Restart MaxScale

      This forces MaxScale to behave as if it's starting fresh and query the server's current GTID.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Sahai Har Gagan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.