Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.0.16, 10.1.3
Description
A user sees an error in test environment, repeated several times. Once in
10.1.3, once in a 10.1 tree with MDEV-7825 fixed, once in 10.0.16.
The error log shows a transaction failing due to too many retries:
150317 19:34:45 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable.
|
150317 19:34:45 [ERROR] Slave SQL: Deadlock found when trying to get lock; try restarting transaction, Gtid X-Y-Z, Internal MariaDB error code: 1213
|
150317 19:34:45 [Warning] Slave: Connection was killed Error_code: 1927
|
150317 19:34:45 [Warning] Slave: Deadlock found when trying to get lock; try restarting transaction Error_code: 1213
|
In each of three different failure instances, there are 3 transactions getting these
10-times-retry failures.
(The error is caused by MDEV-7882).
After the errors, the replication threads hang. The hang looks different in
the three instances. In one case, a worker thread was stuck in
mark_start_commit_inner(), which suggests that the GCO list has become
corrupted and contains a loop that the thread is iterating through
infinitely.
The problem turns out to be incorrect GCO lifetime management in the error
case. After an error that requires the slave to stop, the worker threads do
not respect commit order, and this can lead to the GCO being freed too
early. Then after freeing the GCO another worker threads tries to call
mark_start_commit() on it. This way, the wakeup of the transactions in
following event groups can be lost, causing the hang. Or the access_after_free
could also lead to a looped GCO list, causing the infinite loop that was seen
in one case.