[MDEV-20946] Hard FTWRL deadlock under user level locks Created: 2019-11-01 Updated: 2021-03-10 Resolved: 2021-03-10 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Locking |
| Affects Version/s: | 10.4 |
| Fix Version/s: | 10.4.19 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sergey Vojtovich | Assignee: | Michael Widenius |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
After BACKUP STAGE will likely be affected by this when it receives fixes for some outstanding deadlocks. |
| Comments |
| Comment by Rinat Ibragimov (Inactive) [ 2020-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test case in the issue will block even if all statements involving GET_LOCK/RELEASE_LOCK are removed. In that case it will be equivalent to:
Connection con1 issues LOCK TABLES t1 WRITES and stays alive, preventing other locks from being taken. Then connection default issues FLUSH TABLES WITH READ LOCK which waits until UNLOCK TABLES. But mysqltest never get there, it waits on reap until FTWRL statement retires, which never happens. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2020-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
rinat.ibragimov, it certainly would, but what you suggest is not a deadlock: you have idle LOCK TABLES connection, which is allowed to progress. While in my test they're waiting for each other. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rinat Ibragimov (Inactive) [ 2020-06-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
svoj, I should've mentioned explicitly that this is not a server deadlock, but instead a test that will time out even if the bug is fixed. It may be fine as an illustration, but won't work as a test. By the way, was it written by hand? Or it's an output from Random Query Generator? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rinat Ibragimov (Inactive) [ 2020-06-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A follow-up. Even if user locks are kept in the test, but the order in which "sent" queries are "reaped" changed, the test passes. Here is the changed test body:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaj Arnö [ 2020-06-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I asked Monty to have a look. He said "I looked at the original test case and I agree that it looks wrong: The original 'reap' should block as there is no reason to assume that the getlock on con1 would allow flush tables to go trough. However the "fixed" test case doesn't make any sense either as the disconnect of con1, before "reap", frees all it's locks. Ok to close unless Svoj has a good argument against it." | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2020-06-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Well, an argument comes from gdb I guess:
These two threads represent FTWRL and SELECT GET_LOCK("l1", 10) waiting for each other. Good enough? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rinat Ibragimov (Inactive) [ 2020-06-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
svoj, I think I'm starting to understand what you said before. But it's not about an internal error. It's about user convenience. You want to protect users from shooting their foot by combining FTWRL with GET_LOCK's. Currently, if a user tries to make a deadlock by only calling GET_LOCK's in a conflicting way from different connections, they get a warning instead. But this safety net doesn't work when FTWRL or LOCK TABLES are involved. And you want that safety net to be present at all times. Is my understanding correct? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rinat Ibragimov (Inactive) [ 2020-06-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There is a tight loop in FTWRL code path which tries to acquire read lock again and again. With introduction of another lock attempt from GET_LOCK(), FTWRL becomes a "victim" of the deadlock detector, but since it tries to reacquire lock again, nothing changes. It keeps spinning there. Although it's possible to tune deadlock weights for that particular case, deadlocks in other scenarios may appear. For example, making all weights the same will introduce a deadlock in DROP TABLE. So, to fix the issue with FTWRL spinning in a tight loop, dynamic weights are used. Each time a lock is selected as a victim, its dynamic weight is increased. That doesn't have any effect is that lock is then reset immediately. But if it's going to be reacquired, it may be selected again, as in FTWRL vs GET_LOCK() case. Eventually, its weight becomes high enough that the deadlock detector selects another lock as a target. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2020-11-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The pull request looks fine. However, as a reference I think it's good to document that possible ways to fix this bug (if it ever re-appears): 1) Increase weights of mdl locks if killed by deadlock handler. This helps with mdl locks that are re-acquired at once after a deadlock (like FTWRL and table locks) as it ensures that all conflicting mdl locks are aborted eventually.
The solution we are going to use is 1) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2021-03-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Review & commit |