Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7847

"Slave worker thread retried transaction 10 time(s) in vain, giving up", followed by replication hanging

    XMLWordPrintable

Details

    Description

      A user sees an error in test environment, repeated several times. Once in
      10.1.3, once in a 10.1 tree with MDEV-7825 fixed, once in 10.0.16.

      The error log shows a transaction failing due to too many retries:

      150317 19:34:45 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
      150317 19:34:45 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid X-Y-Z, Internal MariaDB error code: 1213
      150317 19:34:45 [Warning] Slave: Connection was killed Error_code: 1927
      150317 19:34:45 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

      In each of three different failure instances, there are 3 transactions getting these
      10-times-retry failures.

      (The error is caused by MDEV-7882).

      After the errors, the replication threads hang. The hang looks different in
      the three instances. In one case, a worker thread was stuck in
      mark_start_commit_inner(), which suggests that the GCO list has become
      corrupted and contains a loop that the thread is iterating through
      infinitely.

      The problem turns out to be incorrect GCO lifetime management in the error
      case. After an error that requires the slave to stop, the worker threads do
      not respect commit order, and this can lead to the GCO being freed too
      early. Then after freeing the GCO another worker threads tries to call
      mark_start_commit() on it. This way, the wakeup of the transactions in
      following event groups can be lost, causing the hang. Or the access_after_free
      could also lead to a looped GCO list, causing the infinite loop that was seen
      in one case.

      Attachments

        Activity

          People

            knielsen Kristian Nielsen
            knielsen Kristian Nielsen
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.