[MDEV-25594] Crash in deadlock checker under high load Created: 2021-05-04  Updated: 2022-03-25  Resolved: 2021-06-09

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.9, 10.5.10
Fix Version/s: 10.6.1, 10.2.40, 10.3.31, 10.4.21, 10.5.11

Type: Bug Priority: Critical
Reporter: Kyle Joiner (Inactive) Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: None
Environment:

Ubuntu Linux


Issue Links:
Duplicate
is duplicated by MDEV-26291 ER_FILSORT_ABORT triggers - crash que... Closed
Problem/Incident
causes MDEV-26123 Using Spatial Indexes results in Upda... Stalled
Relates
relates to MDEV-21987 Assertion failure in file lock0lock.c... Confirmed
relates to MDEV-24738 Improve the InnoDB deadlock checker Closed
relates to MDEV-25125 Draft: InnoDB: Assertion failure in ... Closed
relates to MDEV-25163 Rowid filter does not process storage... Closed
relates to MDEV-26050 InnoDB: Assertion failure in lock0loc... Closed

 Description   

Under a High load with multiple rollbacks a crash is observed at:

2021-05-03 16:51:03 0x7fe56f780700 InnoDB: Assertion failure in file /home/jenkins/workspace/MariaDBE-Custom-DEB/label/ubuntu-1804/MariaDBEnterprise/storage/innobase/lock/lock0lock.cc line 6780



 Comments   
Comment by Marko Mäkelä [ 2021-05-07 ]

In the reported build, line number 6780 is the last one in the following code snippet:

const trx_t*
DeadlockChecker::check_and_resolve(const lock_t* lock, trx_t* trx)
{
        ut_ad(lock_mutex_own());
        ut_ad(trx_mutex_own(trx));
        check_trx_state(trx);

This is similar to the race condition that was introduced in 10.3.4 and fixed in MDEV-25125. I believe that the reason of the crash is that trx->state was changed to TRX_STATE_NOT_STARTED shortly before a deadlock between transactions would have been reported.

Without having more information of the crash, all we can do is conduct very careful source code review. The deadlock checker and the entire lock_sys was heavily refactored in 10.6 (MDEV-24738, MDEV-20612). It might thus be that the 10.6.0 alpha release is not affected by this. I will check all the potentially affected major version branches 10.3, 10.4, 10.6 once I have conducted an analysis on 10.5.

Comment by Marko Mäkelä [ 2021-05-07 ]

Not only the macro check_trx_state() but also the macro assert_trx_nonlocking_or_in_list() and the inline function trx_state_eq() are somewhat inaccurate (sloppy). I would replace them with stricter debug assertions, depending on the context.

The only suspect for this assertion failure remains the assignment to TRX_STATE_NOT_STARTED in trx_t::commit_in_memory(), for an autocommit transactions that was supposed to be non-locking. A source code comment says that it is not protected by any mutex. It remains a mystery why such transactions would participate in the deadlock check at all.

There is the counter trx_t::will_lock (which is only being read as a Boolean flag) that could play a role here. Usually, it would be set to prevent a lazily started transaction from being flagged as read-only:

	if (!trx_is_started(trx)) {
		++trx->will_lock;
	}

The only pieces of code where this field may be set to nonzero on an already started transaction are ha_innobase::check_if_supported_inplace_alter() (ALTER TABLE, CREATE INDEX, DROP INDEX, OPTIMIZE) and ha_innobase::index_read() when using SPATIAL INDEX.

kjoiner, is SPATIAL INDEX involved here? Or was any DDL operation in progress or just completed during the crash?

Comment by Marko Mäkelä [ 2021-05-07 ]

If my suspicion about SPATIAL INDEX is correct, I think that the following should ensure that already started autocommit non-locking transactions will remain non-locking:

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
index b2908fc9819..48cb81ce2c0 100644
--- a/storage/innobase/handler/ha_innodb.cc
+++ b/storage/innobase/handler/ha_innodb.cc
@@ -8804,8 +8804,12 @@ ha_innobase::index_read(
 
 	/* For R-Tree index, we will always place the page lock to
 	pages being searched */
-	if (dict_index_is_spatial(index)) {
-		++m_prebuilt->trx->will_lock;
+	if (index->is_spatial() && !m_prebuilt->trx->will_lock) {
+		if (trx_is_started(m_prebuilt->trx)) {
+			DBUG_RETURN(HA_ERR_READ_ONLY_TRANSACTION);
+		} else {
+			m_prebuilt->trx->will_lock = true;
+		}
 	}
 
 	/* Note that if the index for which the search template is built is not

I would also apply a larger refactoring to improve the debug checks. That refactoring would also remove the assertion that failed; the check would only exist in debug builds. Once a fix has been validated by the customer, I think that it should be applied to 10.3 and 10.4 as well.

Comment by Marko Mäkelä [ 2021-05-07 ]

elenst pointed out that MDEV-21987 is a bug with similar symptoms. It involves versioned tables.

Comment by Marko Mäkelä [ 2021-05-12 ]

kjoiner, do we have any results from the debug executable yet? Any debug assertion failures?

Comment by Kyle Joiner (Inactive) [ 2021-05-17 ]

The debug executable has never crashed.

Comment by Marko Mäkelä [ 2021-05-18 ]

Even though we did not find the root cause of the problem yet, I pushed the SPATIAL INDEX fix to 10.2 and merged up to 10.5. I also pushed the change of trx_t::will_lock to bool to 10.5. These will probably not explain the reported crash. I’m leaving this ticket open until more information is available of the crash. It could be very helpful to have a core dump (along with a copy of the executable and shared libraries that produced it).

Comment by Marko Mäkelä [ 2021-06-09 ]

A development snapshot that did not differ much from the 10.5.10 release and included some fixes was provided to the customer, and the problems did not occur anymore.

We cannot conclude if that is thanks to the cleanup (such as changing trx_t::will_lock from a counter to bool) or due to other changes between 10.5.9 and 10.5.10, possibly in the SQL layer, so that a transaction abort would no longer be ignored. (One example of that is in MDEV-21987.) I think that we can close this for now anyway.

Comment by Marko Mäkelä [ 2021-07-27 ]

I ported the additional code cleanup from 10.5 to earlier major versions.

Comment by Daniel Black [ 2022-03-25 ]

mosmani, that's something different.

Please create a new bug report with a more complete backtrace.

This assertion in btr_check_blob_fil_page_type isn't something I've found in existing issues.

Please include details of the sorts of SQL operations performed on blob columns especially if the error log doesn't include a SQL query.

Generated at Thu Feb 08 09:38:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.