Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.5
-
None
Description
The XA implementation in 10.5+ first does the full XA PREPARE step in replication, and only after updates the mysql.gtid_slave_pos as a separate transaction.
If there is a deadlock kill, but the timing is such that the XA PREPARE succeeds anyway, then the kill can hit the following update of the the gtid_slave_pos, causing replication to break with error:
2024-04-30 17:35:43 21 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1927: Connection was killed, Error_code: 1927; the event's master log master-bin.000001, end_log_pos 164525, Gtid 2-1-219, Internal MariaDB error code: 1942
|
Suggested fix is to clear any pending deadlock kill after completing the prepare, and before updating the GTID position:
diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
|
index 003774c24aa..7aa43a14b4d 100644
|
--- a/sql/log_event_server.cc
|
+++ b/sql/log_event_server.cc
|
@@ -4547,6 +4547,19 @@ int XA_prepare_log_event::do_commit()
|
else
|
res= trans_xa_commit(thd);
|
|
+ if (thd->rgi_slave->is_parallel_exec)
|
+ {
|
+ /*
|
+ Since the transaction is prepared/committed without updating the GTID pos
|
+ (MDEV-32020...), we need here to clear any pending deadlock kill.
|
+ Otherwise if the kill happened after the prepare/commit completed, it
|
+ might end up killing the subsequent GTID position update, causing the
|
+ slave to fail with error.
|
+ */
|
+ wait_for_pending_deadlock_kill(thd, thd->rgi_slave);
|
+ thd->reset_killed();
|
+ }
|
+
|
return res;
|
}
|
#endif // HAVE_REPLICATION
|
This bug causes a sporadic failure in the test rpl.rpl_parallel_multi_domain_xa. The failure became much easier to trigger after the patch for MDEV-33798 in bb-10.11-MDEV-33798-knielsen-pkgtest