Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38684

Handling Internal XA Recovery and --tc-heuristic-recover in Automated Snapshot Restores

    XMLWordPrintable

Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Background: MariaDB uses a Two-Phase Commit (2PC) protocol to synchronize the InnoDB Storage Engine and the Binary Log (Transaction Coordinator). During a snapshot-based backup, if BACKUP STAGE BLOCK_COMMIT is used, transactions can be "trapped" in the PREPARED state in InnoDB without a corresponding COMMIT entry in the Binary Log.

      Current Behavior: Upon restoring such a snapshot, if the Binary Log is missing, the server detects the "In-Doubt" transactions in InnoDB. To prevent potential data drift between the storage engine and the coordinator, the server hits a "Safety Brake":

      1. It logs: [ERROR] Found X prepared transactions! ... You have to start server with --tc-heuristic-recover switch.
      2. The process Aborts. In Kubernetes or automated environments, this leads to an infinite CrashLoopBackOff, requiring manual intervention.

      Requirement: The server should be capable of an "Automated Self-Heal" when it detects that the Transaction Coordinator (Binary Log) is unavailable or incomplete.

      Proposed Logic Change: Introduce a new configuration variable (e.g., --tc-auto-heuristic-recover=ROLLBACK|COMMIT|OFF) or modify the existing startup logic to handle the following:

      1. Detection: If the server finds internal PREPARED transactions in InnoDB but the configured TC-log (Binary Log) is missing or cannot be initialized.
      2. Action: Instead of aborting, the server should apply a pre-configured heuristic decision (defaulting to ROLLBACK to ensure consistency with the missing logs).
      3. Execution: The server should resolve the transactions, log a [WARNING] instead of an [ERROR], and proceed to a full READY state for connections.

      Business Justification: In modern cloud-native environments (Kubernetes/OpenShift), manual intervention to provide startup flags is a significant blocker for High Availability (HA) and Disaster Recovery (DR). Automating this recovery ensures that standby nodes can recover from filesystem-level snapshots without administrative overhead.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              suresh.ramagiri@mariadb.com suresh ramagiri
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.