Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34104

rpl.rpl_parallel_multi_domain_xa sporadic failure due to leaked deadlock kill

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6, 10.11
    • 10.6, 10.11
    • XA
    • None

    Description

      After MDEV-34042, there is still one rare sporadic failure left in
      rpl.rpl_parallel_multi_domain_xa, which is caused by another bug in XA
      replication.

      The scenario is the following, with transactions T and U:

      1. U does an XA PREPARE, completes the prepare step, but has not yet called
      slave_applier_reset_xa_trans() which clears trx->mysql_thd in InnoDB through
      ha_close_connection().

      2. T gets a lock conflict against U, calls thd_rpl_deadlock_check(),
      inspects the THD::rgi_slave of U, decides to deadlock kill U, but has not
      yet called slave_background_kill_request()

      3. U wakes up, clears trx->mysql_thd inside InnoDB, starts do_record_gtid().

      4. T wakes up, completes the deadlock kill of U.

      5. U gets deadlock killed during do_record_gtid() and replication breaks.

      This is occasionally seen as a failure in rpl.rpl_parallel_multi_domain_xa
      with this in the server error log:

      2024-05-03 20:06:30 12 [ERROR] Slave SQL: Error during XID COMMIT: failed to update GTID state in mysql.gtid_slave_pos: 1927:  Connection was killed, Error_code: 1927; the event's master log master-bin.000001, end_log_pos 1620, Gtid 0-1-4, Internal MariaDB error code: 1942
      

      This failure is hard to reproduce, but I have pushed to branch
      bb-10.6-knielsen a testcase that more reliably triggers it (uses my_sleep()
      and probably needs improvement before pushing to main tree).

      The root cause is MDEV-32020, that replication of XA PREPARE leaves InnoDB
      row locks after completing the transaction. This breaks
      thd_rpl_deadlock_check(), which needs to know that the other_thd will not
      complete its transaction while the deadlock kill proceeds. This is not
      suppose to be possible, since the other_thd cannot release its locks while
      thd_rpl_deadlock_check() is running.

      One possible fix is this, which is also pushed to branch bb-10.6-knielsen:

      diff --git a/storage/innobase/trx/trx0trx.cc b/storage/innobase/trx/trx0trx.cc
      index e6068f51e04..5ba6df9ac57 100644
      --- a/storage/innobase/trx/trx0trx.cc
      +++ b/storage/innobase/trx/trx0trx.cc
      @@ -554,7 +554,16 @@ void trx_disconnect_prepared(trx_t *trx)
         trx->read_view.close();
         trx_sys.trx_list.freeze();
         trx->is_recovered= true;
      +  /*
      +    Take the lock_sys.wait_mutex around releasing the THD associated with the
      +    prepared transaction. This way, lock_wait() will see a consistent picture
      +    while processing locking state: either it runs before clearing the
      +    trx->mysql_thd, and sees the lock owned by a specific THD; or it runs after
      +    and sees a lock not associated with THD (but not some mixture).
      +  */
      +  mysql_mutex_lock(&lock_sys.wait_mutex);
         trx->mysql_thd= NULL;
      +  mysql_mutex_unlock(&lock_sys.wait_mutex);
         trx_sys.trx_list.unfreeze();
         /* todo/fixme: suggest to do it at innodb prepare */
         trx->will_lock= false;
      

      I personally do not like this patch, it is just a random hack inside InnoDB
      to fix what is fundamentally a design problem in XA and replication. At
      least it only affects XA transactions, if I understand the InnoDB code
      correctly.

      But since I had to do this analysis to be able to push my unrelated changes,
      due to test failures in rpl.rpl_parallel_multi_domain_xa, I'm putting the
      writeup and the testcase/RFC patch here. Since something will eventually
      need to be done for this to fix the sporadic test failures in
      rpl.rpl_parallel_multi_domain_xa.

      Attachments

        Activity

          People

            Elkin Andrei Elkin
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.