[MDEV-5262] Missing retry after temp error in parallel replication Created: 2013-11-08  Updated: 2014-07-11  Resolved: 2014-07-11

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.5
Fix Version/s: 10.0.13

Type: Bug Priority: Critical
Reporter: Kristian Nielsen Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: parallelslave, replication

Issue Links:
Relates
relates to MDEV-4506 MWL#184: Parallel replication of grou... Closed
relates to MDEV-5941 Slave SQL: Error 'Lock wait timeout e... Closed

 Description   

If a transaction fails with temporary error (like a deadlock), replication
must retry the transaction.

But when using @@slave-parallel-threads > 0, this retry does not happen.

It isn't trivial to correctly do such retry in the parallel case. MySQL 5.6
multi-threaded slave, for example, does not do any retries.

The main problem is where to get the events from to retry. Keeping all events
from each transaction in-memory is not necessarily possible (eg. DELETE FROM
table in row-based replication).

The best design I could come up with so far is:

  • When an event is queued for replication in a worker thread, remember the
    relay log position of the end of the event, and the start position of the
    initial GTID event of the event group.
  • In case of temporary error that needs retry, temporarily set
    thd->wait_for_commit_ptr=NULL (to avoid notifying following transactions
    too early), rollback, restore thd->wait_for_commit_ptr.
  • Open a new file handle for the relay log file of the starting GTID of the
    event group, and seek to that GTID's position. Start reading and executing
    events from that relay log until we reach and successfully execute the
    event that caused the retry, then switch back to executing events queued in
    memory.
  • When retrying, we need to handle local rotate events (to be able to switch
    to a new relay log). Master does not switch logs within an event group, but
    we still need to handle and rollback in the case where master crashed in
    the middle of writing an event group (we get a rotate+format description
    before the end of the event group).
  • Implement reference counting for relay log files. Whenever an event is
    queued for a worker, increment the reference count for the containing relay
    log file. When an event group has completed, decrement the reference count
    for the relay log files of each event contained in the event group (so
    remember each relay log file used in the event group, and the count of
    events in each of them; an event group can span multiple relay log files).
  • Change automatic relay log purge to use the reference counting, so that we
    do not purge relay log files while a transaction may still need them for
    retry.
  • Implement the proper logic in the worker threads so that they are able to
    execute events either from the queued list or from re-reading from the
    relay logs. Also need to handle that the retry again fails, and a second
    retry is necessary.


 Comments   
Comment by Kristian Nielsen [ 2013-11-22 ]

Note that MySQL 5.6 multi-threaded slave does not support retrying transactions
in case of transient errors.

Comment by Kristian Nielsen [ 2014-06-10 ]

Temporary re-assigned to Serg for review

Comment by Kristian Nielsen [ 2014-07-11 ]

Pushed to 10.0.13.

Generated at Thu Feb 08 07:02:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.