Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7458

Deadlock in parallel replication can allow following transaction to start replicating too early

    XMLWordPrintable

Details

    Description

      In parallel replication, if T2 blocks T1 on an InnoDB row lock, we deadlock
      kill T2 and retry it.

      If T2 already started to commit, it might have done mark_start_commit() at the
      point where it is deadlock killed. In retry_event_group(), we do
      unmark_start_commit() before doing rollback. The idea is that T1 cannot
      reach its own mark_start_commit() until T2 does rollback. So we are sure to
      get unmark_start_commit() in T2 before mark_start_commit() in T1. This way, a
      following T3 will not start running until the retry of T2 has completed.

      But this turns out not to work as expected. The reason is that
      ha_commit_trans() does a rollback if the commit fails.

      Thus we can have the following situation:

      1. T2 starts committing, it is waiting in queue_for_group_commit() for T1 to
      also commit.

      2. We detect the deadlock, we kill T2. T2 returns error from log_and_order(),
      and ha_commit_trans() does ha_rollback_trans().

      3. T1 can proceed due to the rollback, and itself does mark_start_commit().

      4. T3 sees that T1 and T2 both started to commit, and starts executing.

      5. T2 does unmark_start_commit(). At this point, T2 and T3 are running in
      parallel, even though they should not, as they are from different group
      commits on the master.

      It was first thought that this condition does not cause any user-visible
      problems (after fix of MDEV-7326). However, MDEV-8302 shows one example
      where this can cause replication to fail. If T2 deletes a row with the same
      unique key value that T3 inserts, then running T3 in parallel with T2 can
      cause T3 to fail with a duplicate key error. Other similar scenarios could
      cause various failures from running T3 too early.

      Maybe we need a check in ha_commit_trans() to not rollback in case of parallel
      replication deadlock kill...

      Attachments

        Issue Links

          Activity

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.