[MDEV-25594] Crash in deadlock checker under high load Created: 2021-05-04 Updated: 2022-03-25 Resolved: 2021-06-09 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5.9, 10.5.10 |
| Fix Version/s: | 10.6.1, 10.2.40, 10.3.31, 10.4.21, 10.5.11 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Kyle Joiner (Inactive) | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Environment: |
Ubuntu Linux |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Description |
|
Under a High load with multiple rollbacks a crash is observed at: 2021-05-03 16:51:03 0x7fe56f780700 InnoDB: Assertion failure in file /home/jenkins/workspace/MariaDBE-Custom-DEB/label/ubuntu-1804/MariaDBEnterprise/storage/innobase/lock/lock0lock.cc line 6780 |
| Comments |
| Comment by Marko Mäkelä [ 2021-05-07 ] | |||||||||||||||||||
|
In the reported build, line number 6780 is the last one in the following code snippet:
This is similar to the race condition that was introduced in 10.3.4 and fixed in Without having more information of the crash, all we can do is conduct very careful source code review. The deadlock checker and the entire lock_sys was heavily refactored in 10.6 ( | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-07 ] | |||||||||||||||||||
|
Not only the macro check_trx_state() but also the macro assert_trx_nonlocking_or_in_list() and the inline function trx_state_eq() are somewhat inaccurate (sloppy). I would replace them with stricter debug assertions, depending on the context. The only suspect for this assertion failure remains the assignment to TRX_STATE_NOT_STARTED in trx_t::commit_in_memory(), for an autocommit transactions that was supposed to be non-locking. A source code comment says that it is not protected by any mutex. It remains a mystery why such transactions would participate in the deadlock check at all. There is the counter trx_t::will_lock (which is only being read as a Boolean flag) that could play a role here. Usually, it would be set to prevent a lazily started transaction from being flagged as read-only:
The only pieces of code where this field may be set to nonzero on an already started transaction are ha_innobase::check_if_supported_inplace_alter() (ALTER TABLE, CREATE INDEX, DROP INDEX, OPTIMIZE) and ha_innobase::index_read() when using SPATIAL INDEX. kjoiner, is SPATIAL INDEX involved here? Or was any DDL operation in progress or just completed during the crash? | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-07 ] | |||||||||||||||||||
|
If my suspicion about SPATIAL INDEX is correct, I think that the following should ensure that already started autocommit non-locking transactions will remain non-locking:
I would also apply a larger refactoring to improve the debug checks. That refactoring would also remove the assertion that failed; the check would only exist in debug builds. Once a fix has been validated by the customer, I think that it should be applied to 10.3 and 10.4 as well. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-07 ] | |||||||||||||||||||
|
elenst pointed out that MDEV-21987 is a bug with similar symptoms. It involves versioned tables. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-12 ] | |||||||||||||||||||
|
kjoiner, do we have any results from the debug executable yet? Any debug assertion failures? | |||||||||||||||||||
| Comment by Kyle Joiner (Inactive) [ 2021-05-17 ] | |||||||||||||||||||
|
The debug executable has never crashed. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-18 ] | |||||||||||||||||||
|
Even though we did not find the root cause of the problem yet, I pushed the SPATIAL INDEX fix to 10.2 and merged up to 10.5. I also pushed the change of trx_t::will_lock to bool to 10.5. These will probably not explain the reported crash. I’m leaving this ticket open until more information is available of the crash. It could be very helpful to have a core dump (along with a copy of the executable and shared libraries that produced it). | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-09 ] | |||||||||||||||||||
|
A development snapshot that did not differ much from the 10.5.10 release and included some fixes was provided to the customer, and the problems did not occur anymore. We cannot conclude if that is thanks to the cleanup (such as changing trx_t::will_lock from a counter to bool) or due to other changes between 10.5.9 and 10.5.10, possibly in the SQL layer, so that a transaction abort would no longer be ignored. (One example of that is in MDEV-21987.) I think that we can close this for now anyway. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-07-27 ] | |||||||||||||||||||
|
I ported the additional code cleanup from 10.5 to earlier major versions. | |||||||||||||||||||
| Comment by Daniel Black [ 2022-03-25 ] | |||||||||||||||||||
|
mosmani, that's something different. Please create a new bug report with a more complete backtrace. This assertion in btr_check_blob_fil_page_type isn't something I've found in existing issues. Please include details of the sorts of SQL operations performed on blob columns especially if the error log doesn't include a SQL query. |