[MDEV-34104] rpl.rpl_parallel_multi_domain_xa sporadic failure due to leaked deadlock kill - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6, 10.11
Fix Version/s: 10.6, 10.11
Component/s: XA
Labels:
None

Description

After ~~MDEV-34042~~, there is still one rare sporadic failure left in
rpl.rpl_parallel_multi_domain_xa, which is caused by another bug in XA
replication.

The scenario is the following, with transactions T and U:

1. U does an XA PREPARE, completes the prepare step, but has not yet called
slave_applier_reset_xa_trans() which clears trx->mysql_thd in InnoDB through
ha_close_connection().

2. T gets a lock conflict against U, calls thd_rpl_deadlock_check(),
inspects the THD::rgi_slave of U, decides to deadlock kill U, but has not
yet called slave_background_kill_request()

3. U wakes up, clears trx->mysql_thd inside InnoDB, starts do_record_gtid().

4. T wakes up, completes the deadlock kill of U.

5. U gets deadlock killed during do_record_gtid() and replication breaks.

This is occasionally seen as a failure in rpl.rpl_parallel_multi_domain_xa
with this in the server error log:

2024-05-03 20:06:30 12 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1927:  Connection was killed, Error_code: 1927; the event's master log master-bin.000001, end_log_pos 1620, Gtid 0-1-4, Internal MariaDB error code: 1942

This failure is hard to reproduce, but I have pushed to branch
bb-10.6-knielsen a testcase that more reliably triggers it (uses my_sleep()
and probably needs improvement before pushing to main tree).

The root cause is MDEV-32020, that replication of XA PREPARE leaves InnoDB
row locks after completing the transaction. This breaks
thd_rpl_deadlock_check(), which needs to know that the other_thd will not
complete its transaction while the deadlock kill proceeds. This is not
suppose to be possible, since the other_thd cannot release its locks while
thd_rpl_deadlock_check() is running.

One possible fix is this, which is also pushed to branch bb-10.6-knielsen:

diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc

index e6068f51e04..5ba6df9ac57 100644

--- a/storage/innobase/trx/trx0trx.cc

+++ b/storage/innobase/trx/trx0trx.cc

@@ -554,7 +554,16 @@ void trx_disconnect_prepared(trx_t *trx)

   trx->read_view.close();

   trx_sys.trx_list.freeze();

   trx->is_recovered= true;

+  /*

+    Take the lock_sys.wait_mutex around releasing the THD associated with the

+    prepared transaction. This way, lock_wait() will see a consistent picture

+    while processing locking state: either it runs before clearing the

+    trx->mysql_thd, and sees the lock owned by a specific THD; or it runs after

+    and sees a lock not associated with THD (but not some mixture).

+  */

+  mysql_mutex_lock(&lock_sys.wait_mutex);

   trx->mysql_thd= NULL;

+  mysql_mutex_unlock(&lock_sys.wait_mutex);

   trx_sys.trx_list.unfreeze();

   /* todo/fixme: suggest to do it at innodb prepare */

   trx->will_lock= false;

I personally do not like this patch, it is just a random hack inside InnoDB
to fix what is fundamentally a design problem in XA and replication. At
least it only affects XA transactions, if I understand the InnoDB code
correctly.

But since I had to do this analysis to be able to push my unrelated changes,
due to test failures in rpl.rpl_parallel_multi_domain_xa, I'm putting the
writeup and the testcase/RFC patch here. Since something will eventually
need to be done for this to fix the sporadic test failures in
rpl.rpl_parallel_multi_domain_xa.

Attachments

Activity

People

Assignee:: Andrei Elkin

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2024-05-07 07:12

Updated:: 2025-05-02 15:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.