[MDEV-7458] Deadlock in parallel replication can allow following transaction to start replicating too early - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.15
Fix Version/s: 10.0.17
Component/s: Replication
Labels:
- parallelslave
- replication

Description

In parallel replication, if T2 blocks T1 on an InnoDB row lock, we deadlock
kill T2 and retry it.

If T2 already started to commit, it might have done mark_start_commit() at the
point where it is deadlock killed. In retry_event_group(), we do
unmark_start_commit() before doing rollback. The idea is that T1 cannot
reach its own mark_start_commit() until T2 does rollback. So we are sure to
get unmark_start_commit() in T2 before mark_start_commit() in T1. This way, a
following T3 will not start running until the retry of T2 has completed.

But this turns out not to work as expected. The reason is that
ha_commit_trans() does a rollback if the commit fails.

Thus we can have the following situation:

1. T2 starts committing, it is waiting in queue_for_group_commit() for T1 to
also commit.

2. We detect the deadlock, we kill T2. T2 returns error from log_and_order(),
and ha_commit_trans() does ha_rollback_trans().

3. T1 can proceed due to the rollback, and itself does mark_start_commit().

4. T3 sees that T1 and T2 both started to commit, and starts executing.

5. T2 does unmark_start_commit(). At this point, T2 and T3 are running in
parallel, even though they should not, as they are from different group
commits on the master.

It was first thought that this condition does not cause any user-visible
problems (after fix of ~~MDEV-7326~~). However, ~~MDEV-8302~~ shows one example
where this can cause replication to fail. If T2 deletes a row with the same
unique key value that T3 inserts, then running T3 in parallel with T2 can
cause T3 to fail with a duplicate key error. Other similar scenarios could
cause various failures from running T3 too early.

Maybe we need a check in ha_commit_trans() to not rollback in case of parallel
replication deadlock kill...

Attachments

Issue Links

relates to

MDEV-7326 Server deadlock in connection with parallel replication

Closed

Activity

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2015-01-14 15:14

Updated:: 2015-08-04 09:40

Resolved:: 2015-02-24 16:11

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server