Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.0.17, 10.1.3
Description
This problem was discovered as part of MDEV-7847. But these are two logically
distinct problems (slave threads hanging vs. excessive transaction retry), so
filing a distinct bug to keep the separation.
If conflicting transactions T1 and T2 are run in parallel, then we may need to
deadlock kill T2 if it is holding a row lock that T1 needs. However, there is
no guarantee that T1 will get the lock when T2 is rolled back. If we are
unlucky, T2 may have time to re-take the lock, requiring another deadlock
kill.
In fact, in the scenario that discovered MDEV-7847, as well as in testing
while working on that bug, we easily saw T2 ending up retrying 10 times, in
cases where there were many conflicting transactions executed in
parallel. This typically results in replication stopping with an error (10 is
the default maximum retries allowed).
In 10.1 "optimistic" mode, this problem is actually taken care of. After the
first deadlock kill of T2, it will execute wait_for_prior_commit() before
making a retry. This ensures that any earlier transactions that might conflict
will be allowed to get the locks and complete before the retry of T2, thus
avoiding the need for multiple retries.
So in "conservative" mode (and in 10.0), we should just do the same wait
before retry of T2. In conservative mode, conflicts are very rare, so there is
no performance considerations to not do it, and it avoids this potential
problem with excessive retries.