Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34042

Deadlock kill of XA PREPARE can break replication / rpl.rpl_parallel_multi_domain_xa sporadic failure

    XMLWordPrintable

Details

    Description

      The XA implementation in 10.5+ first does the full XA PREPARE step in replication, and only after updates the mysql.gtid_slave_pos as a separate transaction.

      If there is a deadlock kill, but the timing is such that the XA PREPARE succeeds anyway, then the kill can hit the following update of the the gtid_slave_pos, causing replication to break with error:

      2024-04-30 17:35:43 21 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1927:  Connection was killed, Error_code: 1927; the event's master log master-bin.000001, end_log_pos 164525, Gtid 2-1-219, Internal MariaDB error code: 1942
      

      Suggested fix is to clear any pending deadlock kill after completing the prepare, and before updating the GTID position:

      diff --git a/sql/log_event_server.cc b/sql/log_event_server.cc
      index 003774c24aa..7aa43a14b4d 100644
      --- a/sql/log_event_server.cc
      +++ b/sql/log_event_server.cc
      @@ -4547,6 +4547,19 @@ int XA_prepare_log_event::do_commit()
         else
           res= trans_xa_commit(thd);
       
      +  if (thd->rgi_slave->is_parallel_exec)
      +  {
      +    /*
      +      Since the transaction is prepared/committed without updating the GTID pos
      +      (MDEV-32020...), we need here to clear any pending deadlock kill.
      +      Otherwise if the kill happened after the prepare/commit completed, it
      +      might end up killing the subsequent GTID position update, causing the
      +      slave to fail with error.
      +    */
      +    wait_for_pending_deadlock_kill(thd, thd->rgi_slave);
      +    thd->reset_killed();
      +  }
      +
         return res;
       }
       #endif // HAVE_REPLICATION
      

      This bug causes a sporadic failure in the test rpl.rpl_parallel_multi_domain_xa. The failure became much easier to trigger after the patch for MDEV-33798 in bb-10.11-MDEV-33798-knielsen-pkgtest

      Attachments

        Activity

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.