[MDEV-7847] "Slave worker thread retried transaction 10 time(s) in vain, giving up", followed by replication hanging - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.0.16, 10.1.3
Fix Version/s: 10.0.18, 10.1.4
Component/s: Replication
Labels:
- parallelslave
- replication

Description

A user sees an error in test environment, repeated several times. Once in
10.1.3, once in a 10.1 tree with ~~MDEV-7825~~ fixed, once in 10.0.16.

The error log shows a transaction failing due to too many retries:

150317 19:34:45 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.

150317 19:34:45 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid X-Y-Z, Internal MariaDB error code: 1213

150317 19:34:45 [Warning] Slave: Connection was killed Error_code: 1927

150317 19:34:45 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213

In each of three different failure instances, there are 3 transactions getting these
10-times-retry failures.

(The error is caused by ~~MDEV-7882~~).

After the errors, the replication threads hang. The hang looks different in
the three instances. In one case, a worker thread was stuck in
mark_start_commit_inner(), which suggests that the GCO list has become
corrupted and contains a loop that the thread is iterating through
infinitely.

The problem turns out to be incorrect GCO lifetime management in the error
case. After an error that requires the slave to stop, the worker threads do
not respect commit order, and this can lead to the GCO being freed too
early. Then after freeing the GCO another worker threads tries to call
mark_start_commit() on it. This way, the wakeup of the transactions in
following event groups can be lost, causing the hang. Or the access_after_free
could also lead to a looped GCO list, causing the infinite loop that was seen
in one case.

Attachments

Activity

People

Assignee:: Kristian Nielsen

Reporter:: Kristian Nielsen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2015-03-26 15:30

Updated:: 2015-03-30 17:50

Resolved:: 2015-03-30 17:50

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.