[MDEV-28317] Assertion failures in row_undo_mod upon restoring backup Created: 2022-04-14  Updated: 2022-04-19  Resolved: 2022-04-19

Status: Closed
Project: MariaDB Server
Component/s: mariabackup, Storage Engine - InnoDB
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7
Fix Version/s: 10.2.44, 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3, 10.9.1

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: foreign-keys, rr-profile-analyzed

Issue Links:
Relates
relates to MDEV-17187 table doesn't exist in engine after A... Closed

 Description   

10.7 e87c710dfc

#2  0x00007f9d6ecc83fa in __assert_fail_base (fmt=0x7f9d6ee4f6c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x557f49908bc0 "node->table->is_temporary() || lock_table_has_locks(node->table)", 
    file=file@entry=0x557f499082d8 "/home/mariadb/node_table_is/10.7/storage/innobase/row/row0umod.cc", line=line@entry=1341, 
    function=function@entry=0x557f4990aa60 <row_undo_mod(undo_node_t*, que_thr_t*)::__PRETTY_FUNCTION__> "dberr_t row_undo_mod(undo_node_t*, que_thr_t*)") at assert.c:92
#3  0x00007f9d6ecc8472 in __GI___assert_fail (assertion=0x557f49908bc0 "node->table->is_temporary() || lock_table_has_locks(node->table)", 
    file=0x557f499082d8 "/home/mariadb/node_table_is/10.7/storage/innobase/row/row0umod.cc", line=1341, 
    function=0x557f4990aa60 <row_undo_mod(undo_node_t*, que_thr_t*)::__PRETTY_FUNCTION__> "dberr_t row_undo_mod(undo_node_t*, que_thr_t*)") at assert.c:101
#4  0x0000557f491a0c0f in row_undo_mod (node=0x4af00c001918, thr=0x4af00c001728) at /home/mariadb/node_table_is/10.7/storage/innobase/row/row0umod.cc:1341
#5  0x0000557f48faeb29 in row_undo (node=0x4af00c001918, thr=0x4af00c001728) at /home/mariadb/node_table_is/10.7/storage/innobase/row/row0undo.cc:413
#6  0x0000557f48faed4d in row_undo_step (thr=0x4af00c001728) at /home/mariadb/node_table_is/10.7/storage/innobase/row/row0undo.cc:460
#7  0x0000557f48ef76d4 in que_thr_step (thr=0x4af00c001728) at /home/mariadb/node_table_is/10.7/storage/innobase/que/que0que.cc:651
#8  0x0000557f48ef7964 in que_run_threads_low (thr=0x4af00c001728) at /home/mariadb/node_table_is/10.7/storage/innobase/que/que0que.cc:709
#9  0x0000557f48ef7a78 in que_run_threads (thr=0x4af00c001728) at /home/mariadb/node_table_is/10.7/storage/innobase/que/que0que.cc:729
#10 0x0000557f48ffe841 in trx_rollback_active (trx=0xbf8145a4390) at /home/mariadb/node_table_is/10.7/storage/innobase/trx/trx0roll.cc:602
#11 0x0000557f48fff084 in trx_rollback_recovered (all=true) at /home/mariadb/node_table_is/10.7/storage/innobase/trx/trx0roll.cc:740
#12 0x0000557f48fff37c in trx_rollback_all_recovered () at /home/mariadb/node_table_is/10.7/storage/innobase/trx/trx0roll.cc:786
#13 0x0000557f491b34be in tpool::task_group::execute (this=0x557f4abb2d80 <rollback_all_recovered_group>, t=0x557f4abb2e20 <rollback_all_recovered_task>)
    at /home/mariadb/node_table_is/10.7/tpool/task_group.cc:55
#14 0x0000557f491b383a in tpool::task::execute (this=0x557f4abb2e20 <rollback_all_recovered_task>) at /home/mariadb/node_table_is/10.7/tpool/task.cc:32
#15 0x0000557f491acdfc in tpool::thread_pool_generic::worker_main (this=0x557f4b3d7fa0, thread_var=0x557f4b487030) at /home/mariadb/node_table_is/10.7/tpool/tpool_generic.cc:588
#16 0x0000557f491b059f in std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__f=
    @0x7ffdf4002228: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x557f491acd64 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>, __t=@0x7ffdf4002220: 0x557f4b3d7fa0) at /usr/include/c++/7/bits/invoke.h:73
#17 0x0000557f491af739 in std::__invoke<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> (__fn=
    @0x7ffdf4002228: (void (tpool::thread_pool_generic::*)(tpool::thread_pool_generic * const, tpool::worker_data *)) 0x557f491acd64 <tpool::thread_pool_generic::worker_main(tpool::worker_data*)>) at /usr/include/c++/7/bits/invoke.h:95
#18 0x0000557f491b3193 in std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::_M_invoke<0ul, 1ul, 2ul> (this=0x7ffdf4002218) at /usr/include/c++/7/thread:234
#19 0x0000557f491b312d in std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> >::operator() (
    this=0x7ffdf4002218) at /usr/include/c++/7/thread:243
#20 0x0000557f491b310c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*> > >::_M_run (this=0x7ffdf4002210) at /usr/include/c++/7/thread:186
#21 0x00007ffdf02cb6df in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007ffdefff46db in start_thread (arg=0x24a57c4e2700) at pthread_create.c:463
#23 0x00007f9d6edb961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

10.4 3327bb6098

#2  0x000032b9322683fa in __assert_fail_base (fmt=0x32b9323ef6c0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x5650e10ed860 "lock_table_has_locks(index->table)", file=file@entry=0x5650e10ed4a8 "/home/mariadb/node_table_is/10.4/storage/innobase/row/row0umod.cc", 
    line=line@entry=284, 
    function=function@entry=0x5650e10eea20 <row_undo_mod_clust(undo_node_t*, que_thr_t*)::__PRETTY_FUNCTION__> "dberr_t row_undo_mod_clust(undo_node_t*, que_thr_t*)") at assert.c:92
#3  0x000032b932268472 in __GI___assert_fail (assertion=0x5650e10ed860 "lock_table_has_locks(index->table)", 
    file=0x5650e10ed4a8 "/home/mariadb/node_table_is/10.4/storage/innobase/row/row0umod.cc", line=284, 
    function=0x5650e10eea20 <row_undo_mod_clust(undo_node_t*, que_thr_t*)::__PRETTY_FUNCTION__> "dberr_t row_undo_mod_clust(undo_node_t*, que_thr_t*)") at assert.c:101
#4  0x00005650e0a73875 in row_undo_mod_clust (node=0x7f2454001ad0, thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/row/row0umod.cc:284
#5  0x00005650e0a77634 in row_undo_mod (node=0x7f2454001ad0, thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/row/row0umod.cc:1397
#6  0x00005650e082eb53 in row_undo (node=0x7f2454001ad0, thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/row/row0undo.cc:431
#7  0x00005650e082ee11 in row_undo_step (thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/row/row0undo.cc:489
#8  0x00005650e076b5ee in que_thr_step (thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/que/que0que.cc:1036
#9  0x00005650e076b8a1 in que_run_threads_low (thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/que/que0que.cc:1100
#10 0x00005650e076baef in que_run_threads (thr=0x7f2454001900) at /home/mariadb/node_table_is/10.4/storage/innobase/que/que0que.cc:1140
#11 0x00005650e0897682 in trx_rollback_active (trx=0x5650e27f7118) at /home/mariadb/node_table_is/10.4/storage/innobase/trx/trx0roll.cc:648
#12 0x00005650e0897f96 in trx_rollback_recovered (all=true) at /home/mariadb/node_table_is/10.4/storage/innobase/trx/trx0roll.cc:805
#13 0x00005650e08982df in trx_rollback_all_recovered () at /home/mariadb/node_table_is/10.4/storage/innobase/trx/trx0roll.cc:860
#14 0x00002aec6b2036db in start_thread (arg=0x56f84b06e700) at pthread_create.c:463
#15 0x000032b93235961f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

rr profiles are available.

The failures are currently fairly well reproducible with a concurrent test:

git clone https://github.com/MariaDB/randgen --branch mdev28317 rqg-mdev28317
cd rqg-mdev28317
. ./mdev28317.cmd <basedir>

Couldn't reproduce on 10.2.



 Comments   
Comment by Marko Mäkelä [ 2022-04-19 ]

For the first trace I analyzed:

rr replay var_10_7/rr_profile_1649941320

the reason for the failure is that trx_resurrect_table_locks() failed to load the table definition, due to dict_load_table_on_id() detecting the following error:

10.7 e87c710dfc802254a59fa1bb0f9f43afaf3ef301

2022-04-14 13:02:10 0 [Warning] InnoDB: Load table `test`.`B` failed, the table has missing foreign key indexes. Turn off 'foreign_key_checks' and try again.

Later, rollback did load the table definition without any problem at all:

10.7 e87c710dfc802254a59fa1bb0f9f43afaf3ef301

#0  dict_load_table_on_id (table_id=72, ignore_err=DICT_ERR_IGNORE_FK_NOKEY)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0load.cc:2543
#1  0x0000557f490dfc86 in dict_table_open_on_id<false> (table_id=72, 
    dict_locked=false, table_op=DICT_TABLE_OP_NORMAL, thd=0x0, mdl=0x0)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0dict.cc:860
#2  0x0000557f491a0514 in row_undo_mod_parse_undo_rec (node=0x4af00c001918, 
    dict_locked=false)
    at /home/mariadb/node_table_is/10.7/storage/innobase/row/row0umod.cc:1220

I think that the following patch should fix this:

diff --git a/storage/innobase/include/dict0types.h b/storage/innobase/include/dict0types.h
--- a/storage/innobase/include/dict0types.h
+++ b/storage/innobase/include/dict0types.h
@@ -72,7 +72,7 @@ enum dict_err_ignore_t {
 	DICT_ERR_IGNORE_FK_NOKEY = 1,	/*!< ignore error if any foreign
 					key is missing */
 	DICT_ERR_IGNORE_INDEX = 2,	/*!< ignore corrupted indexes */
-	DICT_ERR_IGNORE_RECOVER_LOCK = 4,
+	DICT_ERR_IGNORE_RECOVER_LOCK = 5,
 					/*!< Used when recovering table locks
 					for resurrected transactions.
 					Silently load a missing

As far as I can tell, this fix should tame a check that only makes sense during DDL operations but not when loading a table definition:

10.7 e87c710dfc802254a59fa1bb0f9f43afaf3ef301

#0  dict_foreign_error_report (file=0x557f4b78c3c0, fk=0x557f4b73d008, 
    msg=0x557f498be320 "there is no index in the table which would contain\nthe columns as the first columns, or the data types in the\ntable do not match the ones in the referenced table\nor one of the ON ... SET NULL columns "...)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0dict.cc:2899
#1  0x0000557f490d8d4f in dict_foreign_add_to_cache (foreign=0x557f4b73d008, 
    col_names=0x0, check_charsets=true, 
    ignore_err=DICT_ERR_IGNORE_RECOVER_LOCK)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0dict.cc:3020
#2  0x0000557f490efe4c in dict_load_foreign (
    table_name=0x557f4b738b78 "test/B", col_names=0x0, trx_id=0, 
    check_recursive=true, check_charsets=true, id=..., 
    ignore_err=DICT_ERR_IGNORE_RECOVER_LOCK, 
    fk_tables=std::deque with 1 element = {...})
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0load.cc:2987
#3  0x0000557f490f04c8 in dict_load_foreigns (
    table_name=0x557f4b738b78 "test/B", col_names=0x0, trx_id=0, 
    check_charsets=true, ignore_err=DICT_ERR_IGNORE_RECOVER_LOCK, 
    fk_tables=std::deque with 1 element = {...})
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0load.cc:3121
#4  0x0000557f490edda7 in dict_load_table_one (name=..., 
    ignore_err=DICT_ERR_IGNORE_RECOVER_LOCK, 
    fk_tables=std::deque with 1 element = {...})
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0load.cc:2470
#5  0x0000557f490ee27d in dict_sys_t::load_table (
    this=0x557f4a11d700 <dict_sys>, name=..., 
    ignore=DICT_ERR_IGNORE_RECOVER_LOCK)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0load.cc:2527
#6  0x0000557f490ee643 in dict_load_table_on_id (table_id=24, 
    ignore_err=DICT_ERR_IGNORE_RECOVER_LOCK)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0load.cc:2595
#7  0x0000557f490dfc86 in dict_table_open_on_id<false> (table_id=24, 
    dict_locked=false, table_op=DICT_TABLE_OP_LOAD_TABLESPACE, thd=0x0, 
    mdl=0x0)
    at /home/mariadb/node_table_is/10.7/storage/innobase/dict/dict0dict.cc:860
#8  0x0000557f4900b020 in trx_resurrect_table_locks (trx=0xbf8145a31e0, 
    undo=0x557f4b764908)
    at /home/mariadb/node_table_is/10.7/storage/innobase/trx/trx0trx.cc:602
#9  0x0000557f4900b626 in trx_resurrect (undo=0x557f4b764908, 
    rseg=0x557f4a115580 <trx_sys+40512>, start_time=1649941330, 
    start_time_micro=23151458861874, rows_to_undo=0x7ffdefc7de10)
    at /home/mariadb/node_table_is/10.7/storage/innobase/trx/trx0trx.cc:685
#10 0x0000557f4900b940 in trx_lists_init_at_db_start ()
    at /home/mariadb/node_table_is/10.7/storage/innobase/trx/trx0trx.cc:735

Now that the error would be ignored, both the table and the incompletely resolved FOREIGN KEY constraint will remain in the cache, and the table IX lock will be resurrected. Any subsequent DML operations that could potentially violate FOREIGN KEY constraints will be blocked thanks to the fix of MDEV-17187.

Comment by Marko Mäkelä [ 2022-04-19 ]

The following test case reproduces the crash:

--source include/have_innodb.inc
 
SET foreign_key_checks=0;
CREATE TABLE parent(a INT PRIMARY KEY, INDEX(a)) ENGINE=InnoDB;
CREATE TABLE child(a INT, FOREIGN KEY(a) REFERENCES parent(a)) ENGINE=InnoDB;
INSERT INTO child VALUES(1);
SHOW CREATE TABLE child;
ALTER TABLE child DROP INDEX a;
 
connect(incomplete, localhost, root,,);
BEGIN;
DELETE FROM child;
 
connection default;
INSERT INTO parent SET a=0;
 
--let $shutdown_timeout=0
--source include/restart_mysqld.inc
disconnect incomplete;
 
DROP TABLE child,parent;

For me on a 10.6 ASAN build, it would fail in a strange way:

mysqltest: At line 21: query 'DROP TABLE child,parent' failed: ER_LOCK_WAIT_TIMEOUT (1205): Lock wait timeout exceeded; try restarting transaction

According to the server error log I can see that the rollback of the recovered incomplete transaction had already hit the assertion failure, but it had not triggered the SIGABRT yet.

On 10.2, the test does not crash (possibly because the debug assertion is missing), but it does fail:

10.2 e4e25d2bacc067417c35750f5f6c44cad10c81de

2022-04-19 11:31:31 139793338746816 [Warning] InnoDB: Load table `test`.`child` failed, the table has missing foreign key indexes. Turn off 'foreign_key_checks' and try again.
^ Found warnings in /dev/shm/10.2o/mysql-test/var/log/mysqld.1.err

With a redefinition of DICT_ERR_IGNORE_RECOVER_LOCK to include DICT_ERR_IGNORE_FK_NOKEY, it does not produce that warning.

Generated at Thu Feb 08 09:59:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.