[MDEV-33168] XA crash-recovery base on engines prepare first rule Created: 2024-01-03  Updated: 2024-02-06

Status: In Progress
Project: MariaDB Server
Component/s: Replication, Server, XA
Affects Version/s: 10.4, 10.5, 10.6, 10.11, 11.1, 11.2, 11.3
Fix Version/s: 10.5, 10.6, 10.11, 11.1, 11.2, 11.3

Type: Bug Priority: Critical
Reporter: Andrei Elkin Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-31949 slow parallel replication of user xa In Review

 Description   

This ticket covers XA binlog-based crash-recovery to base on and complement MDEV-32830/MDEV-31949 patch. MDEV-32830 refines XA prepare binlogging in that
the XA engine branches get prepared first.
The recovery decision is largely follows the normal transaction case flow:

  • when at the server recovery a xid exists in both Engines and binlog has recorded an XA completion operation, the xa transaction gets completed;
  • when both contain xid in the prepared state, nothing is done;
  • when a xid exists only in Engine(s)'s persistent memory, the xa transaction is rolled back.

To resolve a dilemma of whether such "orphan" (engine-only) XID did indeed missed binlogging on the eve of crash, or it was prepared some time ago (maybe in a previous server incarnation) a Xid_log_list_event is introduced to contain xid:s of prepared user xa:s at time of binlog rotation (including one that is caused by RESET MASTER).
When detected any such "veteran" still uncommitted XID must remain in the prepared state, despite not being present as XA event in binlogs at recovery.

This algorithm must comply with MDEV-21117 semisync slave recovery option.



 Comments   
Comment by Andrei Elkin [ 2024-01-09 ]

The status update is here.

Comment by Andrei Elkin [ 2024-01-10 ]

The recovery related part IV of the branch is updated

 a71f13489d0...ec36ddb8a4b HEAD -> bb-10.6-MDEV-31949 (forced update)

still to miss out the Xid_list_log_event integration. That's scheduled for tomorrow.

Comment by Andrei Elkin [ 2024-01-12 ]

The recovery related part IV of the branch is updated not yet to the review ready

 ec36ddb8a4b...1bdc9bc8077 HEAD -> bb-10.6-MDEV-31949 (forced update)

It extends the semisync slave XA recovery test base, fix !HAVE_REPLICATION compile et al.
Xlle integration is for tomorrow.

Generated at Thu Feb 08 10:36:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.