[MDEV-23721] Assertion ‘node->table->is_temporary() || lock_table_has_locks(node->table)’ failed in row_undo_ins Created: 2020-09-11  Updated: 2022-03-09  Resolved: 2022-03-09

Status: Closed
Project: MariaDB Server
Component/s: Encryption, Storage Engine - InnoDB
Affects Version/s: None
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Cannot Reproduce Votes: 0
Labels: need_rr

Issue Links:
Blocks
blocks MDEV-23484 Rollback unnecessarily acquires dict_... Closed
Relates
relates to MDEV-25506 Atomic DDL: .frm file is removed and ... Closed
relates to MDEV-27234 InnoDB dictionary recovery wrongly us... Closed

 Description   

...
2020-09-10  9:18:44 0 [Note] Recovering after a crash using mysql-bin
2020-09-10  9:18:44 0 [Note] InnoDB: Loading buffer pool(s) from /dev/shm/vardir/1599750400/110/1/data/ib_buffer_pool
2020-09-10  9:18:45 0 [Note] InnoDB: Buffer pool(s) load completed at 200910  9:18:45
2020-09-10  9:18:47 0 [Note] InnoDB: To roll back: 4 transactions, 193 rows
2020-09-10  9:18:48 0 [Note] Starting crash recovery...
2020-09-10  9:18:48 0 [Note] Crash recovery finished.
2020-09-10  9:18:52 0 [Note] InnoDB: Rolled back recovered transaction 102402
mysqld: storage/innobase/row/row0uins.cc:532: dberr_t row_undo_ins(undo_node_t*, que_thr_t*): Assertion `node->table->is_temporary() || lock_table_has_locks(node->table)' failed.
...
(rr) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x0000405d491ac859 in __GI_abort () at abort.c:79
#2  0x0000405d491ac729 in __assert_fail_base (fmt=0x405d49342588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x557283fde500 "node->table->is_temporary() || lock_table_has_locks(node->table)", 
    file=0x557283fde920 "storage/innobase/row/row0uins.cc", line=532, function=<optimized out>) at assert.c:92
#3  0x0000405d491bdf36 in __GI___assert_fail (assertion=assertion@entry=0x557283fde500 "node->table->is_temporary() || lock_table_has_locks(node->table)", 
    file=file@entry=0x557283fde920 "storage/innobase/row/row0uins.cc", line=line@entry=532, function=function@entry=0x557283fde580 "dberr_t row_undo_ins(undo_node_t*, que_thr_t*)") at assert.c:101
#4  0x00005572832d80ea in row_undo_ins (node=node@entry=0x61b000051020, thr=thr@entry=0x6170000311e8) at storage/innobase/include/dict0mem.h:1795
#5  0x0000557282cf2b21 in row_undo (node=node@entry=0x61b000051020, thr=thr@entry=0x6170000311e8) at storage/innobase/row/row0undo.cc:433
#6  0x0000557282cf7602 in row_undo_step (thr=thr@entry=0x6170000311e8) at /storage/innobase/row/row0undo.cc:494
#7  0x0000557282b2b96b in que_thr_step (thr=thr@entry=0x6170000311e8) at storage/innobase/que/que0que.cc:945
#8  0x0000557282b2c8dc in que_run_threads_low (thr=thr@entry=0x6170000311e8) at storage/innobase/que/que0que.cc:1009
#9  0x0000557282b2d618 in que_run_threads (thr=0x6170000311e8) at storage/innobase/que/que0que.cc:1049
#10 0x0000557282e1a38d in trx_rollback_active (trx=trx@entry=0x3002289a20b8) at storage/innobase/trx/trx0roll.cc:628
#11 0x0000557282e223d6 in trx_rollback_recovered (all=all@entry=true) at storage/innobase/trx/trx0roll.cc:782
#12 0x0000557282e234e1 in trx_rollback_all_recovered () at storage/innobase/trx/trx0roll.cc:836
#13 0x00007c755c863609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#14 0x0000405d492a9103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(rr)
 
Workflow
========
1. Start the server, load some initial data
2. Several concurrent sessions run a DDL/DML mix
3. At some point of time (2. is ongoing) the server process gets killed with ABRT
4. Attempt to restart that server
 
MariaDB
=======
origin/bb-10.5-MDEV-23399 29e8b54c1289f212eb8e820fab584b0769c6c69e 2020-09-10T14:28:45+03:00
The server claims to be a 10.5.6-MariaDB-debug-log.
 
RQG
====
git clone https://github.com/mleich1/rqg --branch experimental RQG
origin/experimental 26f374c2dbcebdbe72321f3728c564f1c327bba6 2020-09-02T13:10:22+02:00
 
perl rqg.pl \
--grammar=conf/mariadb/innodb_compression_encryption.yy \
--gendata=conf/mariadb/innodb_compression_encryption.zz \
--mysqld=--plugin-load-add=file_key_management.so \
--mysqld=--loose-file-key-management-filename=/home/mleich/RQG/conf/mariadb/encryption_keys.txt \
--mysqld=--innodb-encrypt-log \
--mysqld=--innodb-encrypt-tables \
--reporters=CrashRecovery1 \
--redefine=conf/mariadb/redefine_innodb_undo.yy \
--mysqld=--innodb-immediate-scrub-data-uncompressed=1 \
--mysqld=--innodb_use_native_aio=1 \
--mysqld=--innodb_lock_schedule_algorithm=fcfs \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--connect_timeout=60 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--no-mask \
--queries=10000000 \
--seed=random \
--reporters=Backtrace \
--reporters=ErrorLog \
--reporters=Deadlock1 \
--validators=None \
--mysqld=--log_output=none \
--mysqld=--log-bin \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--engine=InnoDB \
--restart_timeout=120 \
--duration=300 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
--mysqld=--loose-innodb-sync-debug \
--mysqld=--innodb_stats_persistent=off \
--mysqld=--loose-max-statement-time=30 \
--threads=33 \
--mysqld=--innodb_page_size=4K \
--mysqld=--innodb-buffer-pool-size=8M \
--duration=300 \
--no_mask \
--workdir=<local settings> \
--vardir=<local settings> \
--mtr-build-thread=<local settings> \
--basedir1=<local settings> \
--script_debug=_nix_ \
--rr=Server \
--rr_options=--chaos
 
Innodb compression and encryption are involved but its unknown if
they are required for the current defect.
The likelihood to replay the current problem is rather low.



 Comments   
Comment by Marko Mäkelä [ 2020-09-11 ]

I performed an initial analysis of the trace. The problem appears to be that the name of the table does not match the tablespace until the very last call of fil_space_for_table_exists_in_mem(). That is why early calls of fil_ibd_open() returned NULL or !table->is_readable() held (due to !table->space) in trx_resurrect_table_locks(). However, the final call to fil_table_accessible() in row_undo_ins_parse_undo_rec() would hold, and upon rolling back the changes, we would notice that no table lock was resurrected.

It seems to me that MDEV-23484 will require a follow-up fix to make it compatible with the rollback of RENAME operations (which was implemented in MDEV-14717).

Because trx_resurrect_table_locks() is evicting the unreadable table, and because the table will be loaded with the correct name in dict_table_open_on_id() by row_undo_ins_parse_undo_rec(), it seems that the SYS_TABLES.NAME must have been corrected by a rollback of a dictionary transaction.

Possibly the fix should be that trx_resurrect_table_locks() should resurrect table locks even if !table->space.

It is unclear yet whether 10.2 is affected by this. MDEV-12266 changed some related code in 10.3.

Comment by Marko Mäkelä [ 2020-10-15 ]

For fixing the bug, I will need another trace, because we lost the original one.

Comment by Marko Mäkelä [ 2022-03-09 ]

I think that this might already have been fixed in 10.6. If not, MDEV-27234 could share a root cause with this.

Comment by Marko Mäkelä [ 2022-03-09 ]

mleich, is this still repeatable?

Comment by Matthias Leich [ 2022-03-09 ]

The problem was no more observed since ~ Spring 2021.

Generated at Thu Feb 08 09:24:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.