[MDEV-25594] Crash in deadlock checker under high load - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.5.9, 10.5.10
Fix Version/s: 10.6.1, 10.2.40, 10.3.31, 10.4.21, 10.5.11
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
Ubuntu Linux

Description

Under a High load with multiple rollbacks a crash is observed at:

2021-05-03 16:51:03 0x7fe56f780700 InnoDB: Assertion failure in file /home/jenkins/workspace/MariaDBE-Custom-DEB/label/ubuntu-1804/MariaDBEnterprise/storage/innobase/lock/lock0lock.cc line 6780

Attachments

Issue Links

causes

MDEV-26123 Using Spatial Indexes results in Update locks cannot be acquired during a READ UNCOMMITTED transaction

Stalled

is duplicated by

MDEV-26291 ER_FILSORT_ABORT triggers - crash query when "for update" in a select statement

Closed

relates to

MDEV-21987 Assertion failure in file lock0lock.cc line 6901 / DeadlockChecker::check_and_resolve

Confirmed

MDEV-24738 Improve the InnoDB deadlock checker

Closed

MDEV-25125 Draft: InnoDB: Assertion failure in trx0i_s.cc line 1167 upon SELECT * FROM information_schema.innodb_trx

Closed

MDEV-25163 Rowid filter does not process storage engine error correctly.

Closed

MDEV-26050 InnoDB: Assertion failure in lock0lock.cc line 6942

Closed

(2 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2021-05-07 06:36

In the reported build, line number 6780 is the last one in the following code snippet:

const trx_t*

DeadlockChecker::check_and_resolve(const lock_t* lock, trx_t* trx)

        ut_ad(lock_mutex_own());

        ut_ad(trx_mutex_own(trx));

        check_trx_state(trx);

This is similar to the race condition that was introduced in 10.3.4 and fixed in ~~MDEV-25125~~. I believe that the reason of the crash is that trx->state was changed to TRX_STATE_NOT_STARTED shortly before a deadlock between transactions would have been reported.

Without having more information of the crash, all we can do is conduct very careful source code review. The deadlock checker and the entire lock_sys was heavily refactored in 10.6 (~~MDEV-24738~~, ~~MDEV-20612~~). It might thus be that the 10.6.0 alpha release is not affected by this. I will check all the potentially affected major version branches 10.3, 10.4, 10.6 once I have conducted an analysis on 10.5.

Marko Mäkelä added a comment - 2021-05-07 06:36 In the reported build, line number 6780 is the last one in the following code snippet: const trx_t* DeadlockChecker::check_and_resolve( const lock_t* lock, trx_t* trx) { ut_ad(lock_mutex_own()); ut_ad(trx_mutex_own(trx)); check_trx_state(trx); This is similar to the race condition that was introduced in 10.3.4 and fixed in MDEV-25125 . I believe that the reason of the crash is that trx->state was changed to TRX_STATE_NOT_STARTED shortly before a deadlock between transactions would have been reported. Without having more information of the crash, all we can do is conduct very careful source code review. The deadlock checker and the entire lock_sys was heavily refactored in 10.6 ( MDEV-24738 , MDEV-20612 ). It might thus be that the 10.6.0 alpha release is not affected by this. I will check all the potentially affected major version branches 10.3, 10.4, 10.6 once I have conducted an analysis on 10.5.

Marko Mäkelä added a comment - 2021-05-07 08:20

Not only the macro check_trx_state() but also the macro assert_trx_nonlocking_or_in_list() and the inline function trx_state_eq() are somewhat inaccurate (sloppy). I would replace them with stricter debug assertions, depending on the context.

The only suspect for this assertion failure remains the assignment to TRX_STATE_NOT_STARTED in trx_t::commit_in_memory(), for an autocommit transactions that was supposed to be non-locking. A source code comment says that it is not protected by any mutex. It remains a mystery why such transactions would participate in the deadlock check at all.

There is the counter trx_t::will_lock (which is only being read as a Boolean flag) that could play a role here. Usually, it would be set to prevent a lazily started transaction from being flagged as read-only:

	if (!trx_is_started(trx)) {

		++trx->will_lock;

The only pieces of code where this field may be set to nonzero on an already started transaction are ha_innobase::check_if_supported_inplace_alter() (ALTER TABLE, CREATE INDEX, DROP INDEX, OPTIMIZE) and ha_innobase::index_read() when using SPATIAL INDEX.

kjoiner, is SPATIAL INDEX involved here? Or was any DDL operation in progress or just completed during the crash?

Marko Mäkelä added a comment - 2021-05-07 08:20 Not only the macro check_trx_state() but also the macro assert_trx_nonlocking_or_in_list() and the inline function trx_state_eq() are somewhat inaccurate (sloppy). I would replace them with stricter debug assertions, depending on the context. The only suspect for this assertion failure remains the assignment to TRX_STATE_NOT_STARTED in trx_t::commit_in_memory() , for an autocommit transactions that was supposed to be non-locking. A source code comment says that it is not protected by any mutex. It remains a mystery why such transactions would participate in the deadlock check at all. There is the counter trx_t::will_lock (which is only being read as a Boolean flag) that could play a role here. Usually, it would be set to prevent a lazily started transaction from being flagged as read-only: if (!trx_is_started(trx)) { ++trx->will_lock; } The only pieces of code where this field may be set to nonzero on an already started transaction are ha_innobase::check_if_supported_inplace_alter() ( ALTER TABLE , CREATE INDEX , DROP INDEX , OPTIMIZE ) and ha_innobase::index_read() when using SPATIAL INDEX . kjoiner , is SPATIAL INDEX involved here? Or was any DDL operation in progress or just completed during the crash?

Marko Mäkelä added a comment - 2021-05-07 09:31

If my suspicion about SPATIAL INDEX is correct, I think that the following should ensure that already started autocommit non-locking transactions will remain non-locking:

diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc

index b2908fc9819..48cb81ce2c0 100644

--- a/storage/innobase/handler/ha_innodb.cc

+++ b/storage/innobase/handler/ha_innodb.cc

@@ -8804,8 +8804,12 @@ ha_innobase::index_read(

 	/* For R-Tree index, we will always place the page lock to

 	pages being searched */

-	if (dict_index_is_spatial(index)) {

-		++m_prebuilt->trx->will_lock;

+	if (index->is_spatial() && !m_prebuilt->trx->will_lock) {

+		if (trx_is_started(m_prebuilt->trx)) {

+			DBUG_RETURN(HA_ERR_READ_ONLY_TRANSACTION);

+		} else {

+			m_prebuilt->trx->will_lock = true;

+		}

 	/* Note that if the index for which the search template is built is not

I would also apply a larger refactoring to improve the debug checks. That refactoring would also remove the assertion that failed; the check would only exist in debug builds. Once a fix has been validated by the customer, I think that it should be applied to 10.3 and 10.4 as well.

Marko Mäkelä added a comment - 2021-05-07 09:31 If my suspicion about SPATIAL INDEX is correct, I think that the following should ensure that already started autocommit non-locking transactions will remain non-locking: diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc index b2908fc9819..48cb81ce2c0 100644 --- a/storage/innobase/handler/ha_innodb.cc +++ b/storage/innobase/handler/ha_innodb.cc @@ -8804,8 +8804,12 @@ ha_innobase::index_read( /* For R-Tree index, we will always place the page lock to pages being searched */ - if (dict_index_is_spatial(index)) { - ++m_prebuilt->trx->will_lock; + if (index->is_spatial() && !m_prebuilt->trx->will_lock) { + if (trx_is_started(m_prebuilt->trx)) { + DBUG_RETURN(HA_ERR_READ_ONLY_TRANSACTION); + } else { + m_prebuilt->trx->will_lock = true; + } } /* Note that if the index for which the search template is built is not I would also apply a larger refactoring to improve the debug checks. That refactoring would also remove the assertion that failed; the check would only exist in debug builds. Once a fix has been validated by the customer, I think that it should be applied to 10.3 and 10.4 as well.

Marko Mäkelä added a comment - 2021-05-07 09:35

elenst pointed out that MDEV-21987 is a bug with similar symptoms. It involves versioned tables.

Marko Mäkelä added a comment - 2021-05-07 09:35 elenst pointed out that MDEV-21987 is a bug with similar symptoms. It involves versioned tables.

Marko Mäkelä added a comment - 2021-05-12 15:06

kjoiner, do we have any results from the debug executable yet? Any debug assertion failures?

Marko Mäkelä added a comment - 2021-05-12 15:06 kjoiner , do we have any results from the debug executable yet? Any debug assertion failures?

Kyle Joiner (Inactive) added a comment - 2021-05-17 13:59

The debug executable has never crashed.

Kyle Joiner (Inactive) added a comment - 2021-05-17 13:59 The debug executable has never crashed.

Marko Mäkelä added a comment - 2021-05-18 07:32

Even though we did not find the root cause of the problem yet, I pushed the SPATIAL INDEX fix to 10.2 and merged up to 10.5. I also pushed the change of trx_t::will_lock to bool to 10.5. These will probably not explain the reported crash. I’m leaving this ticket open until more information is available of the crash. It could be very helpful to have a core dump (along with a copy of the executable and shared libraries that produced it).

Marko Mäkelä added a comment - 2021-05-18 07:32 Even though we did not find the root cause of the problem yet, I pushed the SPATIAL INDEX fix to 10.2 and merged up to 10.5. I also pushed the change of trx_t::will_lock to bool to 10.5. These will probably not explain the reported crash. I’m leaving this ticket open until more information is available of the crash. It could be very helpful to have a core dump (along with a copy of the executable and shared libraries that produced it).

Marko Mäkelä added a comment - 2021-06-09 19:15

A development snapshot that did not differ much from the 10.5.10 release and included some fixes was provided to the customer, and the problems did not occur anymore.

We cannot conclude if that is thanks to the cleanup (such as changing trx_t::will_lock from a counter to bool) or due to other changes between 10.5.9 and 10.5.10, possibly in the SQL layer, so that a transaction abort would no longer be ignored. (One example of that is in MDEV-21987.) I think that we can close this for now anyway.

Marko Mäkelä added a comment - 2021-06-09 19:15 A development snapshot that did not differ much from the 10.5.10 release and included some fixes was provided to the customer, and the problems did not occur anymore. We cannot conclude if that is thanks to the cleanup (such as changing trx_t::will_lock from a counter to bool ) or due to other changes between 10.5.9 and 10.5.10, possibly in the SQL layer, so that a transaction abort would no longer be ignored. (One example of that is in MDEV-21987 .) I think that we can close this for now anyway.

Marko Mäkelä added a comment - 2021-07-27 07:06

I ported the additional code cleanup from 10.5 to earlier major versions.

Marko Mäkelä added a comment - 2021-07-27 07:06 I ported the additional code cleanup from 10.5 to earlier major versions .

Daniel Black added a comment - 2022-03-25 05:19

mosmani, that's something different.

Please create a new bug report with a more complete backtrace.

This assertion in btr_check_blob_fil_page_type isn't something I've found in existing issues.

Please include details of the sorts of SQL operations performed on blob columns especially if the error log doesn't include a SQL query.

Daniel Black added a comment - 2022-03-25 05:19 mosmani , that's something different. Please create a new bug report with a more complete backtrace. This assertion in btr_check_blob_fil_page_type isn't something I've found in existing issues. Please include details of the sorts of SQL operations performed on blob columns especially if the error log doesn't include a SQL query.

People

Assignee:: Marko Mäkelä

Reporter:: Kyle Joiner (Inactive)

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2021-05-04 18:04

Updated:: 2024-07-07 21:16

Resolved:: 2021-06-09 19:15

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server