[MDEV-26903] Assertion `ctx->trx->state == TRX_STATE_ACTIVE' fails upon concurrent DROP INDEX - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6, 10.7(EOL)
Fix Version/s: 10.6.5
Component/s: Storage Engine - InnoDB
Labels:
- regression

Description

The test case for reproducing purposes only, not suitable for the test suite!

The test is concurrent and non-deterministic. It fails for me, but it can vary on different machines and builds. There is no need to run it with --repeat, as it already has a loop inside, which can be adjusted, although it can be converted into a one-loop-multiple-repeat variation. Also, a random grammar or mysqlslap can be created based on the same idea

--source include/have_innodb.inc

SET innodb_lock_wait_timeout= 1, lock_wait_timeout= 1;

CREATE TABLE t1 (

  pk INT AUTO_INCREMENT,

  a INT,

  b VARCHAR(10),

  PRIMARY KEY (pk),

  KEY (a),

  KEY (b)

) ENGINE=InnoDB;

CREATE TABLE t2 (

  pk INT AUTO_INCREMENT,

  a INT,

  b VARCHAR(10),

  PRIMARY KEY (pk),

  KEY (a),

  FOREIGN KEY (b) REFERENCES t1(b)

) ENGINE=InnoDB;

ALTER TABLE t1 ADD FOREIGN KEY (a) REFERENCES t2(a);

--connect (con1,localhost,root,,test)

SET innodb_lock_wait_timeout= 1, lock_wait_timeout= 1;

--let $run = 1000

while ($run)

  --connection default

  ALTER TABLE t2 ADD INDEX IF NOT EXISTS idx (a);

  ALTER TABLE t1 ADD INDEX IF NOT EXISTS idx (a);

  --connection con1

  --send

    DROP INDEX idx ON t2;

  --connection default

  --error 0,ER_LOCK_WAIT_TIMEOUT

    DROP INDEX idx ON t1;

  --connection con1

  --error 0,ER_LOCK_WAIT_TIMEOUT

  --reap

  --dec $run

# Cleanup

SET FOREIGN_KEY_CHECKS= OFF;

DROP TABLE t1, t2;

10.6 1193a793
mariadbd: /data/src/10.6/storage/innobase/handler/handler0alter.cc:8683: bool rollback_inplace_alter_table(Alter_inplace_info, const TABLE, row_prebuilt_t*): Assertion `ctx->trx->state == TRX_STATE_ACTIVE' failed.
211026 2:02:34 [ERROR] mysqld got signal 6 ;

#6 0x00007f8a52536729 in __assert_fail_base (fmt=0x7f8a526cc588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5567109861d0 "ctx->trx->state == TRX_STATE_ACTIVE", file=0x5567109808d0 "/data/src/10.6/storage/innobase/handler/handler0alter.cc", line=8683, function=<optimized out>) at assert.c:92
#7 0x00007f8a52547f36 in __GI___assert_fail (assertion=0x5567109861d0 "ctx->trx->state == TRX_STATE_ACTIVE", file=0x5567109808d0 "/data/src/10.6/storage/innobase/handler/handler0alter.cc", line=8683, function=0x556710987028 "bool rollback_inplace_alter_table(Alter_inplace_info, const TABLE, row_prebuilt_t*)") at assert.c:101
#8 0x0000556710105b9c in rollback_inplace_alter_table (ha_alter_info=0x7f8a47765c70, table=0x7f8a1cc9ae78, prebuilt=0x7f8a1cc9b968) at /data/src/10.6/storage/innobase/handler/handler0alter.cc:8683
#9 0x00005567100f4f7a in ha_innobase::commit_inplace_alter_table (this=0x7f8a1cc99980, altered_table=0x7f8a47765d30, ha_alter_info=0x7f8a47765c70, commit=false) at /data/src/10.6/storage/innobase/handler/handler0alter.cc:10791
#10 0x000055670fc71fb0 in handler::ha_commit_inplace_alter_table (this=0x7f8a1cc99980, altered_table=0x7f8a47765d30, ha_alter_info=0x7f8a47765c70, commit=false) at /data/src/10.6/sql/handler.cc:5194
#11 0x000055670f9ce31f in mysql_inplace_alter_table (thd=0x7f8a1c000db8, table_list=0x7f8a1c0156b8, table=0x7f8a1cc9ae78, altered_table=0x7f8a47765d30, ha_alter_info=0x7f8a47765c70, target_mdl_request=0x7f8a47766530, ddl_log_state=0x7f8a47765c10, trigger_param=0x7f8a47766100, alter_ctx=0x7f8a47767090) at /data/src/10.6/sql/sql_table.cc:7535
#12 0x000055670f9d6674 in mysql_alter_table (thd=0x7f8a1c000db8, new_db=0x7f8a1c0156d0, new_name=0x7f8a1c0156e0, create_info=0x7f8a47768000, table_list=0x7f8a1c0156b8, alter_info=0x7f8a47767f10, order_num=0, order=0x0, ignore=false, if_exists=false) at /data/src/10.6/sql/sql_table.cc:10204
#13 0x000055670f8cadc7 in mysql_execute_command (thd=0x7f8a1c000db8, is_called_from_prepared_stmt=false) at /data/src/10.6/sql/sql_parse.cc:4211
#14 0x000055670f8d7cb3 in mysql_parse (thd=0x7f8a1c000db8, rawbuf=0x7f8a1c0155c0 "DROP INDEX idx ON t1", length=20, parser_state=0x7f8a47768480) at /data/src/10.6/sql/sql_parse.cc:8030
#15 0x000055670f8c40e3 in dispatch_command (command=COM_QUERY, thd=0x7f8a1c000db8, packet=0x7f8a1c00b879 "DROP INDEX idx ON t1", packet_length=20, blocking=true) at /data/src/10.6/sql/sql_parse.cc:1896
#16 0x000055670f8c2a7f in do_command (thd=0x7f8a1c000db8, blocking=true) at /data/src/10.6/sql/sql_parse.cc:1404
#17 0x000055670fa81123 in do_handle_one_connection (connect=0x5567126809d8, put_in_cache=true) at /data/src/10.6/sql/sql_connect.cc:1418
#18 0x000055670fa80db3 in handle_one_connection (arg=0x556712632e08) at /data/src/10.6/sql/sql_connect.cc:1312
#19 0x000055670ffb204d in pfs_spawn_thread (arg=0x5567126b2a28) at /data/src/10.6/storage/perfschema/pfs.cc:2201
#20 0x00007f8a52a60609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#21 0x00007f8a52633293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Reproducible on 10.6/10.7 debug builds.
Not reproducible on 10.5.

The failure apparently started happening after this commit:

commit c3c53926c467c95386ae98d61ada87294bd61478

Author: Marko Mäkelä

Date:   Mon Oct 18 18:03:12 2021 +0300

    MDEV-26554: Races between INSERT on child and DDL on parent table

    The SQL layer never acquires metadata locks (MDL) on the tables

Attachments

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2021-10-26 05:44

Unfortunately, I cannot get an rr replay trace of this. I do get core dumps easily.

This looks like a ‘dormant’ bug that was not directly caused by the ~~MDEV-26554~~ fix. The ~~MDEV-26554~~ fix added InnoDB table lock acquisition to ha_innobase::commit_inplace_alter_table() with commit=true. Those locks can only be acquired inside an active InnoDB transaction, specifically, inside ctx->trx. But, here we have a not-yet-started ctx->trx. This would suggest that ha_innobase::commit_inplace_alter_table() is being invoked with commit=false because of a MDL upgrade timeout.

With a simple fix, I got the test to fail differently.

10.6 1193a793c40b806c6f1f007bbd87f4d9a73e686d with patch
mysqltest: At line 46: query 'reap' failed with wrong errno ER_LOCK_DEADLOCK (1213): 'Deadlock found when trying to get lock; try restarting transaction', instead of (0)...

After I changed the test to accept also that outcome:

--error 0,ER_LOCK_WAIT_TIMEOUT,ER_LOCK_DEADLOCK

it passed with flying colours in an AddressSanitizer build. My fix is simple:

diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc

index 4fc4751ff1c..adeaf87f7fe 100644

--- a/storage/innobase/handler/handler0alter.cc

+++ b/storage/innobase/handler/handler0alter.cc

@@ -8678,6 +8678,8 @@ inline bool rollback_inplace_alter_table(Alter_inplace_info *ha_alter_info,

     /* If we have not started a transaction yet,

     (almost) nothing has been or needs to be done. */

     dict_sys.lock(SRW_LOCK_CALL);

+  else if (ctx->trx->state == TRX_STATE_NOT_STARTED)

+    goto free_and_exit;

   else if (ctx->new_table)

     ut_ad(ctx->trx->state == TRX_STATE_ACTIVE);

I think that this error handling path was previously unreachable because we would have crashed earlier due to ~~MDEV-26554~~. The problematic code was introduced already in part 3 of ~~MDEV-25506~~.

I suspect that mysql_inplace_alter_table() initiated rollback because of MDL upgrade timeout, but I have yet to confirm that and to create a test case:

  // Upgrade to EXCLUSIVE before commit.

  if (wait_while_table_is_used(thd, table, HA_EXTRA_PREPARE_FOR_RENAME))

    goto rollback;

Marko Mäkelä added a comment - 2021-10-26 05:44 Unfortunately, I cannot get an rr replay trace of this. I do get core dumps easily. This looks like a ‘dormant’ bug that was not directly caused by the MDEV-26554 fix. The MDEV-26554 fix added InnoDB table lock acquisition to ha_innobase::commit_inplace_alter_table() with commit=true . Those locks can only be acquired inside an active InnoDB transaction, specifically, inside ctx->trx . But, here we have a not-yet-started ctx->trx . This would suggest that ha_innobase::commit_inplace_alter_table() is being invoked with commit=false because of a MDL upgrade timeout. With a simple fix, I got the test to fail differently. 10.6 1193a793c40b806c6f1f007bbd87f4d9a73e686d with patch mysqltest: At line 46: query 'reap' failed with wrong errno ER_LOCK_DEADLOCK (1213): 'Deadlock found when trying to get lock; try restarting transaction', instead of (0)... After I changed the test to accept also that outcome: --error 0,ER_LOCK_WAIT_TIMEOUT,ER_LOCK_DEADLOCK it passed with flying colours in an AddressSanitizer build. My fix is simple: diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc index 4fc4751ff1c..adeaf87f7fe 100644 --- a/storage/innobase/handler/handler0alter.cc +++ b/storage/innobase/handler/handler0alter.cc @@ -8678,6 +8678,8 @@ inline bool rollback_inplace_alter_table(Alter_inplace_info *ha_alter_info, /* If we have not started a transaction yet, (almost) nothing has been or needs to be done. */ dict_sys.lock(SRW_LOCK_CALL); + else if (ctx->trx->state == TRX_STATE_NOT_STARTED) + goto free_and_exit; else if (ctx->new_table) { ut_ad(ctx->trx->state == TRX_STATE_ACTIVE); I think that this error handling path was previously unreachable because we would have crashed earlier due to MDEV-26554 . The problematic code was introduced already in part 3 of MDEV-25506 . I suspect that mysql_inplace_alter_table() initiated rollback because of MDL upgrade timeout, but I have yet to confirm that and to create a test case: // Upgrade to EXCLUSIVE before commit. if (wait_while_table_is_used(thd, table, HA_EXTRA_PREPARE_FOR_RENAME)) goto rollback;

Marko Mäkelä added a comment - 2021-10-26 06:44

It looks like we really need a deadlock and not a ‘garden variety’ MDL timeout. I failed to trigger the error with the following:

--source include/have_innodb.inc

CREATE TABLE t1(a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB;

BEGIN;

SELECT * FROM t1;

--connect (con1,localhost,root,,test)

SET lock_wait_timeout=1;

--error ER_LOCK_WAIT_TIMEOUT

DROP INDEX b ON t1;

--disconnect con1

--connection default

DROP TABLE t1;

With an additional patch, I determined that the deadlock was reported by InnoDB:

Version: '10.6.5-MariaDB-debug-log'  socket: '/dev/shm/10.6/mysql-test/var/tmp/17/mysqld.1.sock'  port: 16320  Source distribution

deadlock t2

mariadbd: /mariadb/10.6/storage/innobase/handler/handler0alter.cc:8684: bool rollback_inplace_alter_table(Alter_inplace_info*, const TABLE*, row_prebuilt_t*): Assertion `ctx->trx->state == TRX_STATE_ACTIVE' failed.

This crashed for table t2. The instrumentation patch was as follows:

diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc

index 4fc4751ff1c..05e9202a561 100644

--- a/storage/innobase/handler/handler0alter.cc

+++ b/storage/innobase/handler/handler0alter.cc

@@ -10884,6 +10884,10 @@ ha_innobase::commit_inplace_alter_table(

 		if (error != DB_SUCCESS) {

 lock_fail:

+			if (error == DB_DEADLOCK) {

+				fprintf(stderr, "deadlock %s\n",

+					table_share->table_name.str);

+			}

 			my_error_innodb(

 				error, table_share->table_name.str, 0);

 			if (fts_exist) {

Yes, on deadlock the transaction would have been rolled back. It looks like the added table locking in ~~MDEV-26554~~ increased the likelihood of deadlocks, and thus the omission in the error handling became more prominent.

Armed with this knowledge, I tried to come up with a better test case. I was lazy and tried to use debug injection:

--source include/have_innodb.inc

--source include/have_debug.inc

CREATE TABLE t1(a INT PRIMARY KEY, b INT, INDEX(b)) ENGINE=InnoDB;

SET @save_dbug=@@debug_dbug;

SET debug_dbug='+d,innodb_report_deadlock';

--error ER_LOCK_DEADLOCK

ALTER TABLE t1 DROP INDEX b, ALGORITHM=INPLACE;

SET debug_dbug=@save_dbug;

DROP TABLE t1;

Alas, this would hit an error later on a record lock (which is actually impossible, because we would be holding exclusive table locks at that point), and reverting the first hunk of the following patch would make no difference to the outcome.

diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc

index 4fc4751ff1c..52790bc17b0 100644

--- a/storage/innobase/handler/handler0alter.cc

+++ b/storage/innobase/handler/handler0alter.cc

@@ -8678,6 +8678,8 @@ inline bool rollback_inplace_alter_table(Alter_inplace_info *ha_alter_info,

     /* If we have not started a transaction yet,

     (almost) nothing has been or needs to be done. */

     dict_sys.lock(SRW_LOCK_CALL);

+  else if (ctx->trx->state == TRX_STATE_NOT_STARTED)

+    goto free_and_exit;

   else if (ctx->new_table)

     ut_ad(ctx->trx->state == TRX_STATE_ACTIVE);

@@ -10323,6 +10325,9 @@ commit_try_norebuild(

 			case DB_LOCK_WAIT_TIMEOUT:

 				my_error(ER_LOCK_WAIT_TIMEOUT, MYF(0));

 				break;

+			case DB_DEADLOCK:

+				my_error(ER_LOCK_DEADLOCK, MYF(0));

+				break;

 			default:

 				sql_print_error("InnoDB: %s: %s\n", op,

 						ut_strerr(error));

Finally, injecting a fault already to the table lock acquisition (and adjusting the test to use it) made the trick:

diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc

index 234d215d1df..f088d0186bd 100644

--- a/storage/innobase/lock/lock0lock.cc

+++ b/storage/innobase/lock/lock0lock.cc

@@ -3433,6 +3433,7 @@ lock_table_other_has_incompatible(

 static dberr_t lock_table_low(dict_table_t *table, lock_mode mode,

                               que_thr_t *thr, trx_t *trx)

+  DBUG_EXECUTE_IF("innodb_table_deadlock", return DB_DEADLOCK;);

   lock_t *wait_for=

     lock_table_other_has_incompatible(trx, LOCK_WAIT, table, mode);

   dberr_t err= DB_SUCCESS;

This would repeat the reported assertion failure when the fix of rollback_inplace_alter_table() is reverted.

I am not going to modify commit_try_norebuild() in my fix. In fact, I think that also DB_LOCK_WAIT_TIMEOUT should be impossible in that function, ever since ~~MDEV-25919~~ simplified the locking. That could be removed later, after some extensive stress testing.

Marko Mäkelä added a comment - 2021-10-26 06:44 It looks like we really need a deadlock and not a ‘garden variety’ MDL timeout. I failed to trigger the error with the following: --source include/have_innodb.inc CREATE TABLE t1(a INT PRIMARY KEY , b INT UNIQUE ) ENGINE=InnoDB; BEGIN ; SELECT * FROM t1; --connect (con1,localhost,root,,test) SET lock_wait_timeout=1; --error ER_LOCK_WAIT_TIMEOUT DROP INDEX b ON t1; --disconnect con1 --connection default DROP TABLE t1; With an additional patch, I determined that the deadlock was reported by InnoDB: Version: '10.6.5-MariaDB-debug-log' socket: '/dev/shm/10.6/mysql-test/var/tmp/17/mysqld.1.sock' port: 16320 Source distribution deadlock t2 mariadbd: /mariadb/10.6/storage/innobase/handler/handler0alter.cc:8684: bool rollback_inplace_alter_table(Alter_inplace_info*, const TABLE*, row_prebuilt_t*): Assertion `ctx->trx->state == TRX_STATE_ACTIVE' failed. This crashed for table t2. The instrumentation patch was as follows: diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc index 4fc4751ff1c..05e9202a561 100644 --- a/storage/innobase/handler/handler0alter.cc +++ b/storage/innobase/handler/handler0alter.cc @@ -10884,6 +10884,10 @@ ha_innobase::commit_inplace_alter_table( if (error != DB_SUCCESS) { lock_fail: + if (error == DB_DEADLOCK) { + fprintf(stderr, "deadlock %s\n", + table_share->table_name.str); + } my_error_innodb( error, table_share->table_name.str, 0); if (fts_exist) { Yes, on deadlock the transaction would have been rolled back. It looks like the added table locking in MDEV-26554 increased the likelihood of deadlocks, and thus the omission in the error handling became more prominent. Armed with this knowledge, I tried to come up with a better test case. I was lazy and tried to use debug injection: --source include/have_innodb.inc --source include/have_debug.inc CREATE TABLE t1(a INT PRIMARY KEY , b INT , INDEX (b)) ENGINE=InnoDB; SET @save_dbug=@@debug_dbug; SET debug_dbug= '+d,innodb_report_deadlock' ; --error ER_LOCK_DEADLOCK ALTER TABLE t1 DROP INDEX b, ALGORITHM=INPLACE; SET debug_dbug=@save_dbug; DROP TABLE t1; Alas, this would hit an error later on a record lock (which is actually impossible, because we would be holding exclusive table locks at that point), and reverting the first hunk of the following patch would make no difference to the outcome. diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc index 4fc4751ff1c..52790bc17b0 100644 --- a/storage/innobase/handler/handler0alter.cc +++ b/storage/innobase/handler/handler0alter.cc @@ -8678,6 +8678,8 @@ inline bool rollback_inplace_alter_table(Alter_inplace_info *ha_alter_info, /* If we have not started a transaction yet, (almost) nothing has been or needs to be done. */ dict_sys.lock(SRW_LOCK_CALL); + else if (ctx->trx->state == TRX_STATE_NOT_STARTED) + goto free_and_exit; else if (ctx->new_table) { ut_ad(ctx->trx->state == TRX_STATE_ACTIVE); @@ -10323,6 +10325,9 @@ commit_try_norebuild( case DB_LOCK_WAIT_TIMEOUT: my_error(ER_LOCK_WAIT_TIMEOUT, MYF(0)); break; + case DB_DEADLOCK: + my_error(ER_LOCK_DEADLOCK, MYF(0)); + break; default: sql_print_error("InnoDB: %s: %s\n", op, ut_strerr(error)); Finally, injecting a fault already to the table lock acquisition (and adjusting the test to use it) made the trick: diff --git a/storage/innobase/lock/lock0lock.cc b/storage/innobase/lock/lock0lock.cc index 234d215d1df..f088d0186bd 100644 --- a/storage/innobase/lock/lock0lock.cc +++ b/storage/innobase/lock/lock0lock.cc @@ -3433,6 +3433,7 @@ lock_table_other_has_incompatible( static dberr_t lock_table_low(dict_table_t *table, lock_mode mode, que_thr_t *thr, trx_t *trx) { + DBUG_EXECUTE_IF("innodb_table_deadlock", return DB_DEADLOCK;); lock_t *wait_for= lock_table_other_has_incompatible(trx, LOCK_WAIT, table, mode); dberr_t err= DB_SUCCESS; This would repeat the reported assertion failure when the fix of rollback_inplace_alter_table() is reverted. I am not going to modify commit_try_norebuild() in my fix. In fact, I think that also DB_LOCK_WAIT_TIMEOUT should be impossible in that function, ever since MDEV-25919 simplified the locking. That could be removed later, after some extensive stress testing.

People

Assignee:: Marko Mäkelä

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-10-25 23:07

Updated:: 2021-10-26 07:28

Resolved:: 2021-10-26 07:28

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server