[MDEV-5262] Missing retry after temp error in parallel replication - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.0.5
Fix Version/s: 10.0.13
Component/s: None
Labels:
- parallelslave
- replication

Description

If a transaction fails with temporary error (like a deadlock), replication
must retry the transaction.

But when using @@slave-parallel-threads > 0, this retry does not happen.

It isn't trivial to correctly do such retry in the parallel case. MySQL 5.6
multi-threaded slave, for example, does not do any retries.

The main problem is where to get the events from to retry. Keeping all events
from each transaction in-memory is not necessarily possible (eg. DELETE FROM
table in row-based replication).

The best design I could come up with so far is:

When an event is queued for replication in a worker thread, remember the
relay log position of the end of the event, and the start position of the
initial GTID event of the event group.

In case of temporary error that needs retry, temporarily set
thd->wait_for_commit_ptr=NULL (to avoid notifying following transactions
too early), rollback, restore thd->wait_for_commit_ptr.

Open a new file handle for the relay log file of the starting GTID of the
event group, and seek to that GTID's position. Start reading and executing
events from that relay log until we reach and successfully execute the
event that caused the retry, then switch back to executing events queued in
memory.

When retrying, we need to handle local rotate events (to be able to switch
to a new relay log). Master does not switch logs within an event group, but
we still need to handle and rollback in the case where master crashed in
the middle of writing an event group (we get a rotate+format description
before the end of the event group).

Implement reference counting for relay log files. Whenever an event is
queued for a worker, increment the reference count for the containing relay
log file. When an event group has completed, decrement the reference count
for the relay log files of each event contained in the event group (so
remember each relay log file used in the event group, and the count of
events in each of them; an event group can span multiple relay log files).

Change automatic relay log purge to use the reference counting, so that we
do not purge relay log files while a transaction may still need them for
retry.

Implement the proper logic in the worker threads so that they are able to
execute events either from the queued list or from re-reading from the
relay logs. Also need to handle that the retry again fails, and a second
retry is necessary.

Attachments

Issue Links

relates to

MDEV-4506 MWL#184: Parallel replication of group-committed transactions

Closed

MDEV-5941 Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on DML with slave parallel threads, SBR

Closed

Activity

Ascending order - Click to sort in descending order

Kristian Nielsen created issue - 2013-11-08 12:52

Kristian Nielsen made changes - 2013-11-08 12:52

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-4506~~ [ ~~MDEV-4506~~ ]

Kristian Nielsen made changes - 2013-11-15 15:49

Description

If a transaction fails with temporary error (like a deadlock), replication
must retry the transaction.

But when using @@slave-parallel-threads > 0, this retry does not happen.

If a transaction fails with temporary error (like a deadlock), replication
must retry the transaction.

But when using @@slave-parallel-threads > 0, this retry does not happen.

It isn't trivial to correctly do such retry in the parallel case. MySQL 5.6
multi-threaded slave, for example, does not do any retries.

The main problem is where to get the events from to retry. Keeping all events
from each transaction in-memory is not necessarily possible (eg. DELETE FROM
table in row-based replication).

The best design I could come up with so far is:

- When an event is queued for replication in a worker thread, remember the
   relay log position of the end of the event, and the start position of the
   initial GTID event of the event group.

- In case of temporary error that needs retry, temporarily set
   thd->wait_for_commit_ptr=NULL (to avoid notifying following transactions
   too early), rollback, restore thd->wait_for_commit_ptr.

- Open a new file handle for the relay log file of the starting GTID of the
   event group, and seek to that GTID's position. Start reading and executing
   events from that relay log until we reach and successfully execute the
   event that caused the retry, then switch back to executing events queued in
   memory.

- When retrying, we need to handle local rotate events (to be able to switch
   to a new relay log). Master does not switch logs within an event group, but
   we still need to handle and rollback in the case where master crashed in
   the middle of writing an event group (we get a rotate+format description
   before the end of the event group).

- Implement reference counting for relay log files. Whenever an event is
   queued for a worker, increment the reference count for the containing relay
   log file. When an event group has completed, decrement the reference count
   for the relay log files of each event contained in the event group (so
   remember each relay log file used in the event group, and the count of
   events in each of them; an event group can span multiple relay log files).

- Change automatic relay log purge to use the reference counting, so that we
   do not purge relay log files while a transaction may still need them for
   retry.

- Implement the proper logic in the worker threads so that they are able to
   execute events either from the queued list or from re-reading from the
   relay logs. Also need to handle that the retry again fails, and a second
   retry is necessary.

Sergei Golubchik made changes - 2013-11-18 21:52

Fix Version/s		10.0.7 [ 14100 ]
Fix Version/s	10.0.6 [ 13202 ]

Kristian Nielsen made changes - 2013-11-22 12:14

Summary

Missing retry after temp error inn parallel replication

Missing retry after temp error in parallel replication

Kristian Nielsen added a comment - 2013-11-22 12:15

Note that MySQL 5.6 multi-threaded slave does not support retrying transactions
in case of transient errors.

Kristian Nielsen added a comment - 2013-11-22 12:15 Note that MySQL 5.6 multi-threaded slave does not support retrying transactions in case of transient errors.

Kristian Nielsen made changes - 2013-12-17 16:41

Fix Version/s		10.1.0 [ 12200 ]
Fix Version/s	10.0.7 [ 14100 ]

Kristian Nielsen made changes - 2013-12-17 16:41

Priority

Major [ 3 ]

Trivial [ 5 ]

Kristian Nielsen made changes - 2014-04-11 12:12

Priority

Trivial [ 5 ]

Major [ 3 ]

Kristian Nielsen made changes - 2014-04-11 12:12

Link

This issue relates to ~~MDEV-5941~~ [ ~~MDEV-5941~~ ]

Kristian Nielsen made changes - 2014-04-30 10:33

Status

Open [ 1 ]

In Progress [ 3 ]

Sergei Golubchik made changes - 2014-06-10 10:35

Priority

Major [ 3 ]

Critical [ 2 ]

Sergei Golubchik made changes - 2014-06-10 10:35

Priority

Critical [ 2 ]

Minor [ 4 ]

Kristian Nielsen made changes - 2014-06-10 13:53

Fix Version/s		10.0.13 [ 16000 ]
Fix Version/s	10.1.0 [ 12200 ]

Kristian Nielsen added a comment - 2014-06-10 14:13

Temporary re-assigned to Serg for review

Kristian Nielsen added a comment - 2014-06-10 14:13 Temporary re-assigned to Serg for review

Kristian Nielsen made changes - 2014-06-10 14:13

Fix Version/s		10.0.12 [ 15201 ]
Fix Version/s	10.0.13 [ 16000 ]
Assignee	Kristian Nielsen [ knielsen ]	Sergei Golubchik [ serg ]
Priority	Minor [ 4 ]	Critical [ 2 ]

Sergei Golubchik made changes - 2014-06-10 21:47

Labels

parallelslave

parallelslave replication

Sergei Golubchik made changes - 2014-06-13 15:06

Workflow

defaullt [ 29635 ]

MariaDB v2 [ 43748 ]

Sergei Golubchik made changes - 2014-06-13 15:08

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2014-06-13 15:08

Status

Stalled [ 10000 ]

In Review [ 10002 ]

Sergei Golubchik made changes - 2014-06-16 13:27

Fix Version/s		10.0.13 [ 16000 ]
Fix Version/s	10.0.12 [ 15201 ]

Sergei Golubchik made changes - 2014-06-24 17:49

Fix Version/s		10.0.13 [ 16300 ]
Fix Version/s	10.0 [ 16000 ]

Sergei Golubchik made changes - 2014-07-03 11:33

Assignee	Sergei Golubchik [ serg ]	Kristian Nielsen [ knielsen ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Kristian Nielsen made changes - 2014-07-10 16:59

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Kristian Nielsen added a comment - 2014-07-11 13:08

Pushed to 10.0.13.

Kristian Nielsen added a comment - 2014-07-11 13:08 Pushed to 10.0.13.

Kristian Nielsen made changes - 2014-07-11 13:08

Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Rasmus Johansson (Inactive) made changes - 2015-05-18 17:51

Workflow

MariaDB v2 [ 43748 ]

MariaDB v3 [ 64484 ]

Sergei Golubchik made changes - 2021-12-06 21:39

Workflow

MariaDB v3 [ 64484 ]

MariaDB v4 [ 147211 ]

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2013-11-08 12:52

Updated:: 2014-07-11 13:08

Resolved:: 2014-07-11 13:08

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration