Details
-
Bug
-
Status: Confirmed (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.5, 10.6, 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL)
-
None
Description
I don't know if anything can be done about it, but I want it at least recorded somewhere, as I spent a lot of time trying to understand why non-concurrent SQL in general and binlog replay in particular fail the way the shouldn't.
Single-threaded DDL/DML can occasionally fail with lock wait timeout error. It is reproducible rather easily on 10.5+. I didn't try to make an MTR test case for it, because I can only make a crude non-deterministic one, while it's probably fairly simple for InnoDB people to create a synchronized one if necessary.
Apparently what's happening is that InnoDB purge takes an MDL lock (the courtesy of MDEV-16678), and if it holds it long enough, DDL waiting for the lock will fail.
With normal DDL without WAIT/NOWAIT clause it's an unlikely scenario, as lock wait timeout variable values are high enough (and if it is observed, they can be increased, so it's not that critical, just very confusing). However DDL with WAIT and especially NOWAIT poses a real problem.
If a user application can at least handle the error (hopefully not many real-life applications assume that DDL is always run non-concurrently, or if they do, why do they put WAIT/NOWAIT there to begin with), and replication deals with it via transaction retries, I can't see any way to work around it upon binary log replaying, and with big enough data and a big enough log, it will almost inevitably happen.