[MDEV-38684] Handling Internal XA Recovery and --tc-heuristic-recover in Automated Snapshot Restores - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: None
Labels:
None

Description

Background: MariaDB uses a Two-Phase Commit (2PC) protocol to synchronize the InnoDB Storage Engine and the Binary Log (Transaction Coordinator). During a snapshot-based backup, if BACKUP STAGE BLOCK_COMMIT is used, transactions can be "trapped" in the PREPARED state in InnoDB without a corresponding COMMIT entry in the Binary Log.

Current Behavior: Upon restoring such a snapshot, if the Binary Log is missing, the server detects the "In-Doubt" transactions in InnoDB. To prevent potential data drift between the storage engine and the coordinator, the server hits a "Safety Brake":

It logs: [ERROR] Found X prepared transactions! ... You have to start server with --tc-heuristic-recover switch.
The process Aborts. In Kubernetes or automated environments, this leads to an infinite CrashLoopBackOff, requiring manual intervention.

Requirement: The server should be capable of an "Automated Self-Heal" when it detects that the Transaction Coordinator (Binary Log) is unavailable or incomplete.

Proposed Logic Change: Introduce a new configuration variable (e.g., --tc-auto-heuristic-recover=ROLLBACK|COMMIT|OFF) or modify the existing startup logic to handle the following:

Detection: If the server finds internal PREPARED transactions in InnoDB but the configured TC-log (Binary Log) is missing or cannot be initialized.
Action: Instead of aborting, the server should apply a pre-configured heuristic decision (defaulting to ROLLBACK to ensure consistency with the missing logs).
Execution: The server should resolve the transactions, log a [WARNING] instead of an [ERROR], and proceed to a full READY state for connections.

Business Justification: In modern cloud-native environments (Kubernetes/OpenShift), manual intervention to provide startup flags is a significant blocker for High Availability (HA) and Disaster Recovery (DR). Automating this recovery ensures that standby nodes can recover from filesystem-level snapshots without administrative overhead.

Attachments

Issue Links

relates to

MDEV-34705 Improving performance of binary logging by removing the need of syncing it

Closed

MDEV-36025 backup taken from a replica with optimistic parallel replication fails to restore most of the time

Confirmed

Activity

People

Assignee:: Unassigned

Reporter:: suresh ramagiri

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2026-01-28 06:08

Updated:: 2026-02-10 02:24

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.