[MDEV-27701] Race on trx->lock.wait_lock between lock_rec_move() and lock_sys_t::cancel() Created: 2022-02-01 Updated: 2023-05-18 Resolved: 2023-02-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6.6 |
| Fix Version/s: | 10.11.3, 11.0.1, 10.6.13, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Matthias Leich | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | race, regression-10.6, rr-profile | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
The MDEV-27701.test.diff The scenario is the following at least for 10.6: 1. trx2 acquires lock on some record; 2. trx3 requests conflicting with trx2 lock on the record, creates waiting lock and suspends. After that trx3 waiting is finished by timeout and trx3 tries to cancel waiting in lock_sys_t::cancel() call. 3. trx2 starts record updating. It updates clustered record without changing ordering fields and changing the size of non-ordering fields. In that case inplace update is impossible, that's why the combination of delete+insert is executed. Before deleting record locks are moved from the deleting record to infimum, then the record is deleted, then new record is inserted, and then the locks are moved from the infimum to the inserted record. See the code between lock_rec_store_on_page_infimum() and lock_rec_restore_from_page_infimum() calls for details. So trx2 moved trx3 locks from the deleted record to the newly inserted record, changing trx->lock.wait_lock for trx3 in lock_rec_create_low(). lock_rec_create_low() changes trx->lock.wait_lock without locking lock_sys.wait_mutex. I suppose it was done because it was guessed that lock_rec_create_low() can be invoked only from the transaction owning the mutex. As we can see, lock_rec_move() can be invoked not from lock owning transaction. 4. trx3 checks ut_ad(trx->lock.wait_lock == lock) in lock_sys_t::cancel(). Note the check happens under lock_sys.wait_mutex, which was acquired in lock_wait(). as trx->lock.wait_lock was changed by trx2 after locks moving, the assertion fails. |
| Comments |
| Comment by Marko Mäkelä [ 2022-08-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
vlad.lesin, can you debug this? Or would this already be fixed by one of your recent changes?
The assertion failed in Thread 3 for me. The lock had been replaced in Thread 39 executing ROLLBACK:
At this time, the to-be-crashing Thread 3 was already executing on that assertion that would fail moments later:
Because this is an unoptimized build, the read of trx->lock.wait_lock would occur in a separate procedure:
The modification by Thread 39 occurred between the first mov and the add. So, the value had already been modified by the time it was being read. The corresponding source code is:
It is worth noting that Thread 39 was not holding lock_sys.wait_mutex at the time it updated the trx->lock.wait_lock. I may have wrongly assumed in lock_rec_create_low() that waiting locks are always being created for the transaction that is being executed in the current thread. It is definitely not the case here. This entire code in lock_rec_create_low() may need to be revised:
When we are moving a waiting lock request of a ‘foreign’ thread, we must hold lock_sys.wait_mutex. Note: Starting with A possible fix would be to make all callers of lock_rec_move() check for the return value, which would indicate if any LOCK_WAIT records were encountered (and initially skipped). If they were, then we would have to retry the operation. Any code like the following would have to be adjusted:
In lock_rec_move() we could mysql_mutex_trylock(&lock_sys.wait_mutex) upon encountering the first LOCK_WAIT object. If it succeeds, fine. If not, the LockMultiGuard latches on the hash table cells must be released, lock_sys.wait_mutex acquired, and the hash table latches reacquired again. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-01-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue affects not only lock_rec_move(), but also lock_move_reorganize_page(), lock_move_rec_list_end(), lock_move_rec_list_start(). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-02-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The solution, proposed by Marko above, does not work, because in the middle of locks moving locks queue becomes invalid, as some locks have already been moved, and some waiting locks don't have conflicting locks in the queue. If we release cell latch for such cell with invalid lock queue to acquire wait_lock, some other thread can add new lock to the queue, and as the queue is invalid, there can be wrong decision about transaction blocking necessity. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
vlad.lesin, right, moving the locks must be atomic. Maybe I thought that the exclusive index tree leaf page latch would prevent any concurrent access to the index record locks on that page. I realize that when locks are released on transaction commit, no page latch would be acquired. How about the following? First, acquire LockMultiGuard latches and determine if lock_sys.wait_mutex needs to be acquired. If yes, and mysql_mutex_trylock(&lock_sys.wait_mutex) fails, release the latches (without having modified any locks yet), acquire them in the correct order. Finally, while holding the necessary latches, proceed to move the locks, and release the latches. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2023-02-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue is that trx->lock.wait_trx can be changed by other thread even if lock_sys.wait_mutex is held by the current thread. lock_sys.wait_mutex is held if something is added to or removed from waiting queues. Each queue at one hand belongs to the certain rec lock hash cell. At the other hand, each waiting transaction contains pointer to waited transaction, so we have some waiting graph, which is used to find deadlocks. The vertices of the graph are all active transactions, and the edges are dependencies. And lock_sys.wait_mutex is mostly used to keep the graph in consistent state to let deadlock detector to traverse the graph. Deadlock resolver uses Brent's cycle detection algorithm for each transaction in a global set, protected by lock_sys.wait_mutex. The other version of deadlock resolving function is one for the certain transaction instead of all transactions in the set. So, lock_sys.wait_mutex protects: And lock_sys.wait_mutex is also used as a mutex for condition variable trx->lock.cond in lock_wait(), i.e. when some transaction is awakened after suspending on some record(or table) lock, wait_mutex is held by the transaction. Summarizing all the above, I see the following way to fix the bug. We could change locking order and acquire cell latches before lock_sys.wait_lock. In this case there is no need to reacquire cell latches, we can just acquire wait_mutex from lock_rec_create_low() when we want to change trx->lock.wait_lock. But this approach can potentially has negative performance impact. When we acquire separate cell latches instead of locking one global hash mutex, we distribute lock requests among hash table array elements, and such a way decrease the overall contention of the threads requesting the latches. If after some separate cell latch acquiring we are waiting for global lock_sys.wait_mutex, we increase the time during which the cell latch is held, increasing the probability of requesting the same cell latch by another thread. So overall contention on hash cell latches is also increased. I would say, we need performance testing after implementing this variant. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2023-02-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
origin/bb-10.6- |