[MDEV-17843] Assertion `page_rec_is_leaf(rec)' failed in lock_rec_queue_validate upon SHOW ENGINE INNODB STATUS Created: 2018-11-27  Updated: 2022-05-11  Resolved: 2022-05-04

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.0, 10.3, 10.4, 10.6
Fix Version/s: 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3, 10.9.1

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Vladislav Lesin
Resolution: Fixed Votes: 0
Labels: affects-tests, rr-profile-analyzed

Attachments: File MDEV-17843.cfg     File MDEV-17843.yy     File RR_REPLAY_SIMP.sh    
Issue Links:
Duplicate
is duplicated by MDEV-25711 Assertion `page_rec_is_leaf(rec)' fai... Closed
Relates
relates to MDEV-20605 Awaken transaction can miss inserted ... Closed
relates to MDEV-14409 Assertion `page_rec_is_leaf(rec)' fai... Closed

 Description   

Note: MDEV-14409 has been fixed, and indeed, 10.2 does not fail anymore, but 10.3 and 10.4 still do, a bit differently.

10.3 95f3c142a4f

mysqld: /data/src/10.3/storage/innobase/lock/lock0lock.cc:4883: bool lock_rec_queue_validate(bool, const buf_block_t*, const rec_t*, const dict_index_t*, const ulint*): Assertion `page_rec_is_leaf(rec)' failed.
181204 14:09:53 [ERROR] mysqld got signal 6 ;
 
#7  0x00007f8cc4682ee2 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#8  0x000055a03583663d in lock_rec_queue_validate (locked_lock_trx_sys=true, block=0x7f8c9fa4ee68, rec=0x7f8ca054c070 "supremum", index=0x7f8c5402eae8, offsets=0x7f8cc11af800) at /data/src/10.3/storage/innobase/lock/lock0lock.cc:4883
#9  0x000055a0358377a3 in lock_rec_validate_page (block=0x7f8c9fa4ee68) at /data/src/10.3/storage/innobase/lock/lock0lock.cc:5096
#10 0x000055a035837bed in lock_rec_block_validate (space_id=9, page_no=3) at /data/src/10.3/storage/innobase/lock/lock0lock.cc:5198
#11 0x000055a035837fb9 in lock_validate () at /data/src/10.3/storage/innobase/lock/lock0lock.cc:5272
#12 0x000055a035836072 in lock_print_info_all_transactions (file=0x55a037dafdd0) at /data/src/10.3/storage/innobase/lock/lock0lock.cc:4775
#13 0x000055a035971397 in srv_printf_innodb_monitor (file=0x55a037dafdd0, nowait=0, trx_start_pos=0x7f8cc11b03b0, trx_end=0x7f8cc11b03b8) at /data/src/10.3/storage/innobase/srv/srv0srv.cc:1259
#14 0x000055a0357bcc97 in innodb_show_status (hton=0x55a037c94280, thd=0x7f8c50000b00, stat_print=0x55a035496581 <stat_print(THD*, char const*, size_t, char const*, size_t, char const*, size_t)>) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:15665
#15 0x000055a0357bd704 in innobase_show_status (hton=0x55a037c94280, thd=0x7f8c50000b00, stat_print=0x55a035496581 <stat_print(THD*, char const*, size_t, char const*, size_t, char const*, size_t)>, stat_type=HA_ENGINE_STATUS) at /data/src/10.3/storage/innobase/handler/ha_innodb.cc:16083
#16 0x000055a03549691b in ha_show_status (thd=0x7f8c50000b00, db_type=0x55a037c94280, stat=HA_ENGINE_STATUS) at /data/src/10.3/sql/handler.cc:5896
#17 0x000055a035191899 in mysql_execute_command (thd=0x7f8c50000b00) at /data/src/10.3/sql/sql_parse.cc:4007
#18 0x000055a03519e7cb in mysql_parse (thd=0x7f8c50000b00, rawbuf=0x7f8c50015a68 "SHOW ENGINE INNODB STATUS /* QNO 107 CON_ID 12 */", length=49, parser_state=0x7f8cc11b1640, is_com_multi=false, is_next_command=false) at /data/src/10.3/sql/sql_parse.cc:8091
#19 0x000055a03518b97f in dispatch_command (command=COM_QUERY, thd=0x7f8c50000b00, packet=0x7f8c50421231 "", packet_length=50, is_com_multi=false, is_next_command=false) at /data/src/10.3/sql/sql_parse.cc:1850
#20 0x000055a03518a3a3 in do_command (thd=0x7f8c50000b00) at /data/src/10.3/sql/sql_parse.cc:1395
#21 0x000055a0352f20a0 in do_handle_one_connection (connect=0x55a0387bbe60) at /data/src/10.3/sql/sql_connect.cc:1402
#22 0x000055a0352f1e24 in handle_one_connection (arg=0x55a0387bbe60) at /data/src/10.3/sql/sql_connect.cc:1308
#23 0x00007f8cc613e494 in start_thread (arg=0x7f8cc11b2700) at pthread_create.c:333
#24 0x00007f8cc473f93f in clone () from /lib/x86_64-linux-gnu/libc.so.6
 
Query (0x7f8c50015a68): SHOW ENGINE INNODB STATUS /* QNO 107 CON_ID 12 */
Connection ID (thread ID): 12
Status: NOT_KILLED

The grammar and test are the same as in MDEV-14409:

RQG grammar mdev14409.yy

thread2_init:
  CREATE TABLE `t1` (
    `t1` TEXT,
    `v1` VARCHAR(1),
    `t2` TEXT,
    `v2` VARCHAR(1),
    UNIQUE (`v2`),
    UNIQUE (`t2`(255))
  ) ENGINE=InnoDB ROW_FORMAT=DYNAMIC
  ; CREATE TABLE `t2` LIKE `t1`
  ; CREATE TABLE `t3` LIKE `t1`
  ; CREATE TABLE `t4` LIKE `t1`
  ; INSERT IGNORE INTO `t1` VALUES vals,vals,vals,vals,vals,vals,vals,vals,vals,vals 
  ; INSERT IGNORE INTO `t2` VALUES vals,vals,vals,vals,vals,vals,vals,vals,vals,vals 
  ; INSERT IGNORE INTO `t3` VALUES vals,vals,vals,vals,vals,vals,vals,vals,vals,vals 
  ; INSERT IGNORE INTO `t4` VALUES vals,vals,vals,vals,vals,vals,vals,vals,vals,vals 
  ; SELECT * FROM `t1` INTO OUTFILE 'load.data.t1'
  ; SELECT * FROM `t2` INTO OUTFILE 'load.data.t2'
  ; SELECT * FROM `t3` INTO OUTFILE 'load.data.t3'
  ; SELECT * FROM `t4` INTO OUTFILE 'load.data.t4'
;
 
vals:
  (_char(1024),_char(1),_char(1024),_char(1));
 
thread1:
  SHOW ENGINE INNODB STATUS;
 
query:
  DELETE FROM my_table LIMIT _digit |
  LOAD DATA INFILE { "'load.data.t".$prng->int(1,4)."'" } REPLACE INTO TABLE my_table
;
 
my_table:
  t1 | t2 | t3 | t4 ;

Run as

perl ./runall-new.pl  --basedir=/data/bld/10.2 --vardir=/dev/shm/vardir --duration=300 --threads=2 --grammar=mdev14409.yy --skip-gendata

Don't forget to use your basedir and check whether the vardir path is okay for you.

Note: If something does not work with your version of RQG, try https://github.com/MariaDB/randgen branch elenst-mdev14409.



 Comments   
Comment by Matthias Leich [ 2019-11-05 ]

Exact that assert upon SHOW ENGINE INNODB STATUS observed
- on 10.5 51fb39bf175b275c91468ea6cd1cc82c1395c16d 2019-11-04
- the server claims to be a 10.5.0
too.

Comment by Matthias Leich [ 2020-10-14 ]

 
Replay of the problem on
origin/bb-10.5-MDEV-23855 82ffb426bcb4832631d86443ab3781bbd6de3b23 2020-10-14T17:28:26+03:00
compiled with debug, asan and Og.
The server claims to be a 10.5.7.
 
[rr 2815157 156984]mysqld: /home/mleich/Server/bb-10.5-MDEV-23855/storage/innobase/lock/lock0lock.cc:4663: bool lock_rec_queue_validate(bool, const buf_block_t*, const rec_t*, const dict_index_t*, const rec_offs*): Assertion `page_rec_is_leaf(rec)' failed.
201014 17:03:47 [ERROR] mysqld got signal 6 ;
 
Query (0x62b0002c3238): SHOW ENGINE INNODB STATUS
 
Connection ID (thread ID): 47
Status: NOT_KILLED
 
rr:/home/mleich/RQG/storage/1602686246/tmp/dev/shm/vardir/1602686246/91/1/rr
_RR_TRACE_DIR="." rr replay --mark-stdio
 
Other files which might be of interest:
/home/mleich/RQG/storage/1602686246/001535.tgz     <= Archive with the remainings of the test.
/home/mleich/RQG/storage/1602686246/tmp/dev/shm/vardir/1602686246/91/1/mysql.err
/home/mleich/RQG/storage/1602686246/tmp/rqg.log   <= RQG log
 
RQG
====
git clone https://github.com/mleich1/rqg --branch experimental RQG
 
perl rqg.pl \
--gendata=conf/mariadb/oltp.zz \
--grammar=conf/mariadb/oltp.yy \
--reporters=Mariabackup_linux \
--mysqld=--innodb_use_native_aio=1 \
--mysqld=--innodb_lock_schedule_algorithm=fcfs \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--connect_timeout=60 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--lock-wait-timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--no-mask \
--queries=10000000 \
--seed=random \
--reporters=Backtrace \
--reporters=ErrorLog \
--reporters=Deadlock1 \
--validators=None \
--mysqld=--log_output=none \
--mysqld=--log-bin \
--mysqld=--log_bin_trust_function_creators=1 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--engine=InnoDB \
--restart_timeout=120 \
--duration=400 \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=400 \
--mysqld=--loose-innodb-sync-debug \
--mysqld=--innodb_stats_persistent=off \
--mysqld=--loose-max-statement-time=30 \
--threads=33 \
--mysqld=--innodb_page_size=8K \
--mysqld=--innodb-buffer-pool-size=8M \
--duration=400 \
--no_mask \
--workdir=<local settings> \
--vardir=<local settings> \
--mtr-build-thread=<local settings> \
--basedir1=<local settings> \
--script_debug=_nix_ \
--rr=Server \
--rr_options=--chaos

Comment by Marko Mäkelä [ 2021-02-18 ]

I analyzed a recent failure. The clustered index root page ceases to be a leaf page a little earlier:

10.6 43b239a08151f053a536187a65a874bf3586ff00

#1  0x00005578a963be33 in btr_page_empty (block=0x552c22fd2a00, 
    page_zip=<optimized out>, index=<optimized out>, level=<optimized out>, 
    mtr=0x23b047a85560)
    at /Server/bb-10.6-MDEV-24915/storage/innobase/btr/btr0btr.cc:1723
#2  0x00005578a96414b6 in btr_root_raise_and_insert (flags=flags@entry=0, 
    cursor=cursor@entry=0x23b047a84d20, offsets=offsets@entry=0x23b047a84b60, 
    heap=heap@entry=0x23b047a84b40, tuple=tuple@entry=0x6170002994a0, 
    n_ext=<optimized out>, mtr=<optimized out>)
    at /Server/bb-10.6-MDEV-24915/storage/innobase/btr/btr0btr.cc:2022
#3  0x00005578a966ec9a in btr_cur_pessimistic_insert (flags=flags@entry=0, 
    cursor=cursor@entry=0x23b047a84d20, offsets=offsets@entry=0x23b047a84b60, 
    heap=heap@entry=0x23b047a84b40, entry=entry@entry=0x6170002994a0, 
    rec=rec@entry=0x23b047a84b80, big_rec=<optimized out>, 
    n_ext=<optimized out>, thr=<optimized out>, mtr=<optimized out>)
    at /Server/bb-10.6-MDEV-24915/storage/innobase/btr/btr0cur.cc:3770

At this time, the lock_validate() was trying to acquire the page latch for this very page:

10.6 43b239a08151f053a536187a65a874bf3586ff00

#14 mtr_t::page_lock (this=this@entry=0x4500136878d0, block=block@entry=0x552c22fd2a00, rw_latch=rw_latch@entry=2) at /Server/bb-10.6-MDEV-24915/storage/innobase/mtr/mtr0mtr.cc:1071
#15 0x00005578a96e276f in buf_page_get_low (page_id=<optimized out>, zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /Server/bb-10.6-MDEV-24915/storage/innobase/buf/buf0buf.cc:3278
#16 0x00005578a92f33a5 in lock_rec_block_validate (page_id=<optimized out>) at /Server/bb-10.6-MDEV-24915/storage/innobase/include/fil0fil.h:682
#17 0x00005578a93045a0 in lock_validate () at /Server/bb-10.6-MDEV-24915/storage/innobase/lock/lock0lock.cc:4640
#18 0x00005578a930584a in lock_print_info_all_transactions (file=file@entry=0x615000002d80) at /Server/bb-10.6-MDEV-24915/storage/innobase/lock/lock0lock.cc:4194
#19 0x00005578a954a41b in srv_printf_innodb_monitor (file=<optimized out>, nowait=nowait@entry=0, trx_start_pos=trx_start_pos@entry=0x4500136887a0, trx_end=trx_end@entry=0x4500136887c0) at /Server/bb-10.6-MDEV-24915/storage/innobase/srv/srv0srv.cc:812
#20 0x00005578a91aecc9 in innodb_show_status (stat_print=<optimized out>, thd=<optimized out>, hton=<optimized out>) at /Server/bb-10.6-MDEV-24915/storage/innobase/handler/ha_innodb.cc:15619
#21 innobase_show_status (hton=<optimized out>, thd=<optimized out>, stat_print=<optimized out>, stat_type=<optimized out>) at /Server/bb-10.6-MDEV-24915/storage/innobase/handler/ha_innodb.cc:15701

The code in lock_rec_block_validate() seems invalid. Here is the 10.2 version:

		block = buf_page_get_gen(
			page_id_t(space_id, page_no),
			page_size_t(space->flags),
			RW_X_LATCH, NULL,
			BUF_GET_POSSIBLY_FREED,
			__FILE__, __LINE__, &mtr, &err);
 
		if (err != DB_SUCCESS) {
			ib::error() << "Lock rec block validate failed for tablespace "
				   << space->name
				   << " space_id " << space_id
				   << " page_no " << page_no << " err " << err;
		}
 
		if (block) {
			buf_block_dbg_add_level(block, SYNC_NO_ORDER_CHECK);
 
			ut_ad(lock_rec_validate_page(block));
		}

At this point, because we had released the lock_sys->mutex (in 10.6, lock_sys.latch), there is no guarantee that locks actually exist for this page, or even that the tablespace still exists. The warning message is unnecessary. We fail to check block->page.file_page_was_freed after using BUF_GET_POSSIBLY_FREED.

In lock_rec_queue_validate(), rec must not be dereferenced before we have ensured that locks actually exist for the page.

Side note: lock_rec_queue_validate() only seems to cover B-tree record locks, not SPATIAL INDEX locks.

Comment by Roel Van de Paar [ 2021-05-18 ]

Additional testcase in MDEV-25711, which I marked as duplicate of this one.

Comment by Matthias Leich [ 2022-04-13 ]

How to replay with my current version of RQG
 
git clone https://github.com/mleich1/rqg --branch experimental RQG
# GIT_SHOW: HEAD -> experimental, origin/experimental 358366308288eaadbcb62822ef4faadf6e7aebc8 2022-03-28T18:04:44+02:00
# rqg.pl  : Version 4.0.4 (2021-12)
Place the uploaded file in teh RQG top level directory.
./RR_REPLAY_SIMP.sh MDEV-17843.cfg <path to your binaries>
 
The replay of the problem on some development tree
origin/bb-10.6-MDEV-21423 705d1ffdec97929c90a1d7d4adf2a8c044864020 2022-04-08T19:54:48+03:00
is quite fast.
 
last_result_dir/result.txt containing around end some aggregation like
STATISTICS: RQG runs -- Verdict
STATISTICS:        1 -- 'replay'           -- Replay of desired effect ...
                                       This means the text pattern typical for MDEV-17843 was observed.
...
STATISTICS:      120 -- 'ignore_stopped'   -- RQG run stopped by rqg_batch.pl ...
                                         Typical reason is that the RQG batch tool killed ongoing RQG runs
                                          because we already have a replay.
...
STATISTICS Frequency -- extra_info
STATISTICS        57 -- 'ignore_stopped -- work flow 13'
STATISTICS        63 -- 'ignore_stopped -- work flow 5'
STATISTICS         1 -- 'ignore_unwanted -- STATUS_SERVER_CRASHED--ASAN-fail2--MDEV-17843--rr--TBR-1432'
                                The text pattern for MDEV-17843 was observed but the rr tracing failed.
STATISTICS         1 -- 'replay -- STATUS_SERVER_CRASHED--MDEV-17843--rr--TBR-1432'
                                We have one replayer with good rr trace.
 
 grep STATUS_SERVER_CRASHED--MDEV-17843--rr--TBR-1432 last_result_dir/result.txt 
       2022-04-12T17:28:05 |  ... | replay           | 000057.log ...
"suggests" to look into
last_result_dir/000057.log    -- Protocol of the RQG run
last_result_dir/000057.tar.xz -- Archived remains (rr trace, datadir, ...) of that test run
 
                
                                

Comment by Matthias Leich [ 2022-04-22 ]

origin/bb-10.3-MDEV-17843-lock_rec_queue_validate debf02e344b0078fddc5186f9af5c9ff9cdecb24 2022-04-14T15:05:13+03:00
behaved well in RQG testing. No new bad effect met.

Generated at Thu Feb 08 08:39:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.