Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7326

Server deadlock in connection with parallel replication

    XMLWordPrintable

Details

    Description

      A user testing parallel replication is seeing the server hang.
      From debugging, it appears that the following happens.

      One batch of group-committed transactions from the master all reach their
      commit phase, say T1 and T2.

      This causes the transactions in the following batch, say T3, T4, and T5, to
      wake up and start running.

      At this point, T2 (say) gets a deadlock or other temporary error, and needs to
      be rolled back and retried. This involves unmark_start_commit(), decrementing
      the count of transactions that have already reached their commit step.

      Now when T3, T4, and T5 reach their commit step, they do mark_start_commit().
      But a following T6 is not woken up, because T2 has not yet done
      mark_start_commit().

      Then when T2 has been successfully retried and done mark_start_commit(), the
      bug is that T6 is not woken up. The wakeup is lost. This is because T2 only
      considers the batch with T3-T5 for wakeup, not following batches.

      It seems fairly certain that this is the scenario that the user
      experienced. It is however unknown at this point how it is possible for T2 to
      get a deadlock error, after it has run all its containing queries and has
      started the commit step. (The idea is that if there was a deadlock with T1,
      then T1 will be blocked from proceeding to mark_start_commit() until T2 has
      done rollback; and T2 does unmark_start_commit() before its rollback).

      The fix should be to make sure that this case, of T2 retrying after T3-T5 have
      started running, is handled correctly: When T2 completes its retry, all
      following and possibly waiting transactions should be considered, so the
      wakeup is not lost.

      The user-visible effect in this hang is that at least one replication worker
      threads are stuck in state "Waiting for prior transaction to start commit
      before starting next transaction", and all other threads are stuck in this
      state or the state "Waiting for prior transaction to commit". (As seen in SHOW
      PROCESSLIST). Killing the worker threads will stop replication, and it can
      then be re-started successfully.

      Attachments

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              5 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.