[MDEV-25902] Unexpected ER_LOCK_WAIT_TIMEOUT and result after timeout, assertion failure err != DB_DUPLICATE_KEY Created: 2021-06-12  Updated: 2023-11-06  Resolved: 2021-07-01

Status: Closed
Project: MariaDB Server
Component/s: Locking, Partitioning, Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: 10.6.3

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: regression

Issue Links:
Duplicate
is duplicated by MDEV-24581 MDL race condition in trans_rollback_... Closed
Problem/Incident
is caused by MDEV-25506 Atomic DDL: .frm file is removed and ... Closed

 Description   

--source include/have_partition.inc
--source include/have_innodb.inc
 
CREATE TABLE t1 (pk int primary key) ENGINE=InnoDB;
CREATE TABLE t2 (pk int primary key) ENGINE=InnoDB;
 
--connect (con1,localhost,root,,test)
 
START TRANSACTION;
INSERT INTO t2 (pk) VALUES (1);
SAVEPOINT sp;
INSERT INTO t1 (pk) VALUES (1);
ROLLBACK TO SAVEPOINT sp;
 
--connection default
SET innodb_lock_wait_timeout= 1;
--error ER_LOCK_WAIT_TIMEOUT
ALTER TABLE t1 PARTITION BY HASH(pk);
 
SHOW CREATE TABLE t1;
 
--connection con1
COMMIT;
 
--connection default
ALTER TABLE t2 PARTITION BY HASH(pk);
 
# Cleanup
--disconnect con1
--connection default
DROP TABLE t1, t2;

In the test case above, the problem starts at

10.6 6c39eaeb

--error ER_LOCK_WAIT_TIMEOUT
ALTER TABLE t1 PARTITION BY HASH(pk);
ERROR HY000: Lock wait timeout exceeded; try restarting transaction

It looks new, 10.5 doesn't time out here.

Further, even though the query fails, the partitioning is still performed:

ALTER TABLE t1 PARTITION BY HASH(pk);
ERROR HY000: Lock wait timeout exceeded; try restarting transaction
SHOW CREATE TABLE t1;
Table	Create Table
t1	CREATE TABLE `t1` (
  `pk` int(11) NOT NULL,
  PRIMARY KEY (`pk`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
 PARTITION BY HASH (`pk`)

And then, further partitioning on a different table returns an unexpected error (on non-debug) or a debug assertion failure:

mysqltest: At line 26: query 'ALTER TABLE t2 PARTITION BY HASH(pk)' failed: ER_TABLE_EXISTS_ERROR (1050): Table './test/#sql-backup-3ea389-4' already exists

Please note that the assertion failure is just an aftermath, the problems start long before it.

10.6 6c39eaeb

mariadbd: /data/src/10.6/storage/innobase/row/row0mysql.cc:2962: dberr_t row_rename_table_for_mysql(const char*, const char*, trx_t*, bool): Assertion `err != DB_DUPLICATE_KEY' failed.
210612 20:24:56 [ERROR] mysqld got signal 6 ;
 
#7  0x00007f81ceefef36 in __GI___assert_fail (assertion=0x559b433158a5 "err != DB_DUPLICATE_KEY", file=0x559b43313c08 "/data/src/10.6/storage/innobase/row/row0mysql.cc", line=2962, function=0x559b43315778 "dberr_t row_rename_table_for_mysql(const char*, const char*, trx_t*, bool)") at assert.c:101
#8  0x0000559b42b901f5 in row_rename_table_for_mysql (old_name=0x7f81c8141ca0 "test/t2", new_name=0x7f81c8141aa0 "test/#sql-backup-3ea506-4", trx=0x7f81c93f33c0, use_fk=true) at /data/src/10.6/storage/innobase/row/row0mysql.cc:2962
#9  0x0000559b429f6814 in innobase_rename_table (trx=0x7f81c93f33c0, from=0x7f81c81427e0 "./test/t2", to=0x7f81c81429e0 "./test/#sql-backup-3ea506-4", use_fk=true) at /data/src/10.6/storage/innobase/handler/ha_innodb.cc:13423
#10 0x0000559b429de12f in ha_innobase::rename_table (this=0x7f819c018288, from=0x7f81c81427e0 "./test/t2", to=0x7f81c81429e0 "./test/#sql-backup-3ea506-4") at /data/src/10.6/storage/innobase/handler/ha_innodb.cc:13671
#11 0x0000559b425542af in handler::ha_rename_table (this=0x7f819c018288, from=0x7f81c81427e0 "./test/t2", to=0x7f81c81429e0 "./test/#sql-backup-3ea506-4") at /data/src/10.6/sql/handler.cc:5270
#12 0x0000559b422bcfb1 in mysql_rename_table (base=0x559b46095668, old_db=0x7f81c81445e0, old_name=0x7f81c81445f0, new_db=0x7f81c81445e0, new_name=0x7f81c8143140, id=0x7f81c8144680, flags=2) at /data/src/10.6/sql/sql_table.cc:4972
#13 0x0000559b422cd6a1 in mysql_alter_table (thd=0x7f819c000db8, new_db=0x7f819c0059c8, new_name=0x7f819c005de0, create_info=0x7f81c81453e0, table_list=0x7f819c0156b0, alter_info=0x7f81c81452f0, order_num=0, order=0x0, ignore=false, if_exists=false) at /data/src/10.6/sql/sql_table.cc:10521
#14 0x0000559b42381fad in Sql_cmd_alter_table::execute (this=0x7f819c0160f8, thd=0x7f819c000db8) at /data/src/10.6/sql/sql_alter.cc:550
#15 0x0000559b421c9562 in mysql_execute_command (thd=0x7f819c000db8) at /data/src/10.6/sql/sql_parse.cc:5983
#16 0x0000559b421cf708 in mysql_parse (thd=0x7f819c000db8, rawbuf=0x7f819c0155c0 "ALTER TABLE t2 PARTITION BY HASH(pk)", length=36, parser_state=0x7f81c8146480) at /data/src/10.6/sql/sql_parse.cc:8016
#17 0x0000559b421bbb92 in dispatch_command (command=COM_QUERY, thd=0x7f819c000db8, packet=0x7f819c00b879 "", packet_length=36, blocking=true) at /data/src/10.6/sql/sql_parse.cc:1897
#18 0x0000559b421ba537 in do_command (thd=0x7f819c000db8, blocking=true) at /data/src/10.6/sql/sql_parse.cc:1406
#19 0x0000559b42376d71 in do_handle_one_connection (connect=0x559b465ec328, put_in_cache=true) at /data/src/10.6/sql/sql_connect.cc:1410
#20 0x0000559b42376acd in handle_one_connection (arg=0x559b466152a8) at /data/src/10.6/sql/sql_connect.cc:1312
#21 0x0000559b428e5ed9 in pfs_spawn_thread (arg=0x559b465ebe78) at /data/src/10.6/storage/perfschema/pfs.cc:2201
#22 0x00007f81cf416609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#23 0x00007f81cefea293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



 Comments   
Comment by Marko Mäkelä [ 2021-06-12 ]

This is probably due to MDEV-25506 part 3, which refactored DDL operations that could involve deleting .ibd files or internally dropping tables (such as an old copy of a table in ALTER TABLE). We will first lock all tables, then lock the data dictionary, and finally attempt to modify the dictionary tables and commit. Should a further lock conflict occur during the final phase of modifying the dictionary tables, we will immediately fail with a lock wait timeout error. It looks like the assertion needs to be relaxed due to this.

Comment by Marko Mäkelä [ 2021-06-14 ]

The timeout is reported while ALTER TABLE t1 PARTITION BY HASH(pk) is dropping the original table (which carries a #sql-backup- name). The subsequent ALTER TABLE t2 would attempt to use the same #sql-backup- name, even though the previous ALTER TABLE t1 failed to drop that table.

The timeout occurs due to a lock conflict on table t1. The conflicting lock had been created by the transaction in con1. The ROLLBACK TO SAVEPOINT will not release the lock on table t1 inside InnoDB. InnoDB was not even asked about that: there was no call to innobase_rollback_to_savepoint_can_release_mdl() before ALTER TABLE t1 PARTITION BY HASH(pk) was executed.

This is a metadata locking (MDL) bug that likely was worked around by the background DROP TABLE queue that MDEV-25506 part 3 removed. The MDL for t1 was wrongly released here, without even consulting InnoDB:

10.6 82c07b178abfa9395f1d67747148a431d6eb6ae7

(rr) bt
#0  MDL_context::release_lock (this=0x7f02c4001c28, duration=MDL_TRANSACTION, ticket=0x7f02c4922b40)
    at /mariadb/10.6/sql/mdl.cc:2815
#1  0x0000555f98026220 in MDL_context::release_locks_stored_before (this=0x7f02c4001c28, 
    duration=MDL_TRANSACTION, sentinel=0x7f02c4013ce0) at /mariadb/10.6/sql/mdl.cc:2873
#2  MDL_context::rollback_to_savepoint (this=0x7f02c4001c28, mdl_savepoint=
      @0x7f02c401a0d0: {m_stmt_ticket = 0x0, m_trans_ticket = 0x7f02c4013ce0})
    at /mariadb/10.6/sql/mdl.cc:3037
#3  0x0000555f9802e318 in trans_rollback_to_savepoint (thd=thd@entry=0x7f02c4001af8, 
    name=<optimized out>) at /mariadb/10.6/sql/transaction.cc:709
#4  0x0000555f97eeb65c in mysql_execute_command (thd=thd@entry=0x7f02c4001af8)
    at /mariadb/10.6/sql/sql_parse.cc:5680
#5  0x0000555f97ee5e5e in mysql_parse (thd=thd@entry=0x7f02c4001af8, rawbuf=<optimized out>, 
    length=<optimized out>, parser_state=<optimized out>, parser_state@entry=0x7f02e406c580)
    at /mariadb/10.6/sql/sql_parse.cc:8016
#6  0x0000555f97ee375a in dispatch_command (command=<optimized out>, command@entry=COM_QUERY, 
    thd=thd@entry=0x7f02c4001af8, packet=packet@entry=0x7f02c400bc99 "ROLLBACK TO SAVEPOINT sp", 
    packet_length=3288416368, packet_length@entry=24, blocking=true)
    at /mariadb/10.6/sql/sql_parse.cc:1897

Comment by Elena Stepanova [ 2021-06-28 ]

Raising to a blocker not as much due to the assertion failure but because of ALTER being performed even though it returns an error.

Comment by Marko Mäkelä [ 2021-06-30 ]

Would you agree with my proposed fix? With that, we would never release any MDL after ROLLBACK TO SAVEPOINT if the transaction had acquired any locks in InnoDB. Only in that way can the MDL perform its intended task of detecting conflicts between DML and DDL.

I think that a follow-up performance improvement could be filed, to enhance the handlerton::savepoint_rollback_can_release_mdl interface, so that the storage engines can indicate which MDL are OK to release.

Comment by Sergei Golubchik [ 2021-07-01 ]

commit 161e4bfafd2 is ok to push

Generated at Thu Feb 08 09:41:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.