[MDEV-27921] Spurious row locks acquired by an insert Created: 2022-02-23 Updated: 2023-04-27 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Data Manipulation - Insert, Locking, Storage Engine - InnoDB |
| Affects Version/s: | 10.2.27 |
| Fix Version/s: | 10.4, 10.5, 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Sasha Pachev | Assignee: | Vladislav Lesin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux |
||
| Description |
|
The issue originally manifested as INSERTs into a small table with a primary key and a unique key taking a long time due to frequent deadlocks in production. We have not yet been able to reproduce it in a test environment. However, we have been able to grab mysqld from the problematic state and explore it. To facilitate the exploration of the core I wrote a Python module to be used with GDB where I ported some common InnoDB routines to Python so we can examine complex InnoDB data structures (https://github.com/spachev/python-innodb-gdb). Unfortunately, I am not able to publicly share the core. However, I can report that by calling innodb_gdb.print_trx_locks() I see that one of the transactions is holding 200+ shared record locks on various unique index entries with different values of the primary key while trying to obtain an exclusive lock on the same index with the same value for the unique. I have verified that the connection at the time was in AUTOCOMMIT mode and no BEGIN or other transaction starting statements were issued by checking the appropriate flags in thd. Is this a bug, or is there potentially a legitimate use case when you could get this to happen while inserting just one record? |
| Comments |
| Comment by Marko Mäkelä [ 2022-02-23 ] | ||||||||||||||
|
spachev, an INSERT into a table that lacks any secondary indexes, InnoDB can use implicit locking. That is, if the hidden system column DB_TRX_ID in a clustered index leaf page record is referring to an active transaction, other transactions will treat that record as explicitly locked. If a locking conflict occurs, then the waiting transaction will convert an implicit lock that is held by the conflicting transaction into an explicit one, and then create a waiting lock request for itself. I do not remember off the top of my head whether secondary indexes would always involve explicit locking (such as gap locks). In MDEV-16406 you can find some notes on this. Are you able to repeat this problem with something newer than 10.2.27? In 10.6, the lock_sys was heavily refactored, which could affect the reproducibility of timing-sensitive bugs. Already in 10.3 there are some fixes that are not present in the 10.3 series, such as I think that vlad.lesin knows the locking code best, after fixing | ||||||||||||||
| Comment by Marko Mäkelä [ 2022-02-23 ] | ||||||||||||||
|
It would be useful to know what the indexes in the table look like. In InnoDB, each index is kind of a table on its own. The more indexes, the better chance for locking conflicts. | ||||||||||||||
| Comment by Sasha Pachev [ 2022-02-24 ] | ||||||||||||||
|
Columns and the table renamed to avoid possibly revealing sensitive information. The problem index is "name". Another detail - when the deadlock happens, one of the shared row locks currently being held by the transaction is for the same index record as the exclusive lock that it has obtained. The workload that leads to this involves a number of INSERT IGNORE with the result being the record not being inserted due to the duplicate key conflict on name. | ||||||||||||||
| Comment by Sasha Pachev [ 2022-02-24 ] | ||||||||||||||
|
Another detail - all of the locks, all the spurious shared locks as well as the one exclusive lock have info bits 0x20, which is deleted mark. | ||||||||||||||
| Comment by Elena Stepanova [ 2022-04-08 ] | ||||||||||||||
The question is still relevant. 10.2 is coming to EOL soon, it looks unlikely that the problem would be fixed till then, so unless it's happening on higher versions, it will be discarded as irrelevant after we stop releasing 10.2. | ||||||||||||||
| Comment by Sasha Pachev [ 2022-04-08 ] | ||||||||||||||
|
No, because this case happens under a very particular production load pattern, and we are not running anything newer than 10.2 with those patterns in production. | ||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-10 ] | ||||||||||||||
|
spachev, did you check the lock mode of those 200+ record lock bitmaps? Did you try using the READ COMMITTED isolation level? I am wondering if the problematic locks are gap locks, and whether this bug could have been fixed in |