[MDEV-25072] Hang + Heap Corruption: SIGABRT in __libc_message from _int_free on DBUGCloseFile - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.5.7, 10.5.8, 10.5.9, 10.6
Fix Version/s: 10.6.0, 10.5.10
Component/s: Debug, Storage Engine - InnoDB
Labels:
- corruption
- debug

Description

Keep repeating the following testcase repeatedly (using the CLI) until the server crashes:

DROP DATABASE test;

CREATE DATABASE test;

USE test;

SET GLOBAL innodb_limit_optimistic_insert_debug=7;

CREATE TABLE t1 (c1 INT) PARTITION BY HASH (c1) PARTITIONS 15;

SET GLOBAL innodb_change_buffering_debug=1;

DROP TABLE t1;

SELECT SUBSTRING ('00', 1, 1);

Note that this can easily take 20+ minutes as the server gets very slow when executing this. Using a separate connection, you can see things like:

10.6.0 27d66d644cf2ebe9201e0362f2050036cce2908a (Debug)
10.6.0>show processlist;
+----+------+-----------+------+---------+------+----------------+------------------+----------+
\| Id \| User \| Host \| db \| Command \| Time \| State \| Info \| Progress \|
+----+------+-----------+------+---------+------+----------------+------------------+----------+
\| 4 \| root \| localhost \| test \| Query \| 562 \| closing tables \| DROP TABLE t1 \| 0.000 \|
\| 5 \| root \| localhost \| test \| Query \| 0 \| starting \| show processlist \| 0.000 \|
+----+------+-----------+------+---------+------+----------------+------------------+----------+
2 rows in set (0.000 sec)

And the eventual outcome is:

10.6.0 27d66d644cf2ebe9201e0362f2050036cce2908a (Debug)
Core was generated by `/test/MD260221-mariadb-10.6.0-linux-x86_64-dbg/bin/mysqld --no-defaults --core-'.
Program terminated with signal SIGABRT, Aborted.
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6)
at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
[Current thread is 1 (Thread 0x1500804a6700 (LWP 213801))]
(gdb) bt
#0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
#1 0x0000560223b5eab8 in my_write_core (sig=sig@entry=6) at /test/10.6_dbg/mysys/stacktrace.c:424
#2 0x00005602232f3317 in handle_fatal_signal (sig=6) at /test/10.6_dbg/sql/signal_handler.cc:331
#3 <signal handler called>
#4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#5 0x0000150083064859 in __GI_abort () at abort.c:79
#6 0x00001500830cf3ee in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x1500831f9285 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#7 0x00001500830d747c in malloc_printerr (str=str@entry=0x1500831f74ae "free(): invalid pointer") at malloc.c:5347
#8 0x00001500830d8cac in _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:4173
#9 0x0000560223b7a814 in DBUGCloseFile (cs=cs@entry=0x150034000ce0, new_value=new_value@entry=0x0) at /test/10.6_dbg/dbug/dbug.c:2012
#10 0x0000560223b7a90c in FreeState (cs=cs@entry=0x150034000ce0, free_state=free_state@entry=1) at /test/10.6_dbg/dbug/dbug.c:1634
#11 0x0000560223b7b1c9 in _db_pop_ () at /test/10.6_dbg/dbug/dbug.c:932
#12 0x0000560223b63856 in safe_mutex_lock (mp=mp@entry=0x56022457acc0 <buf_pool>, my_flags=my_flags@entry=0, file=file@entry=0x560223f819c8 "/test/10.6_dbg/storage/innobase/buf/buf0buf.cc", line=line@entry=3171) at /test/10.6_dbg/mysys/thr_mutex.c:398
#13 0x00005602239c3ab9 in inline_mysql_mutex_lock (src_line=3171, src_file=0x560223f819c8 "/test/10.6_dbg/storage/innobase/buf/buf0buf.cc", that=0x56022457acc0 <buf_pool>) at /test/10.6_dbg/include/mysql/psi/mysql_thread.h:765
#14 buf_page_get_low (page_id=page_id@entry={m_id = 315}, zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=2, guess=<optimized out>, guess@entry=0x0, mode=mode@entry=11, mtr=mtr@entry=0x1500804a1ca0, err=<optimized out>, allow_ibuf_merge=<optimized out>) at /test/10.6_dbg/storage/innobase/buf/buf0buf.cc:3171
#15 0x00005602239c3d21 in buf_page_get_gen (page_id={m_id = 315}, zip_size=zip_size@entry=0, rw_latch=rw_latch@entry=2, guess=guess@entry=0x0, mode=mode@entry=11, mtr=mtr@entry=0x1500804a1ca0, err=0x1500804a0b8c, allow_ibuf_merge=true) at /test/10.6_dbg/storage/innobase/buf/buf0buf.cc:3341
#16 0x0000560223984fdf in btr_cur_search_to_nth_level_func (index=index@entry=0x560226fbbbf8, level=<optimized out>, tuple=tuple@entry=0x1500340ba758, mode=mode@entry=PAGE_CUR_LE, latch_mode=<optimized out>, latch_mode@entry=4098, cursor=cursor@entry=0x1500804a1a20, ahi_latch=<optimized out>, mtr=<optimized out>, autoinc=<optimized out>) at /test/10.6_dbg/storage/innobase/include/dict0mem.h:1165
#17 0x00005602238afb68 in btr_pcur_open_low (index=index@entry=0x560226fbbbf8, level=<optimized out>, level@entry=0, tuple=tuple@entry=0x1500340ba758, mode=mode@entry=PAGE_CUR_LE, latch_mode=latch_mode@entry=4098, cursor=cursor@entry=0x1500804a1a20, autoinc=0, mtr=0x1500804a1ca0) at /test/10.6_dbg/storage/innobase/include/btr0pcur.ic:439
#18 0x00005602238b055e in row_search_index_entry (index=index@entry=0x560226fbbbf8, entry=entry@entry=0x1500340ba758, mode=4098, pcur=pcur@entry=0x1500804a1a20, mtr=mtr@entry=0x1500804a1ca0) at /test/10.6_dbg/storage/innobase/row/row0row.cc:1301
#19 0x00005602238d8448 in row_upd_sec_index_entry (node=node@entry=0x1500340f9e98, thr=thr@entry=0x1500340fa6e0) at /test/10.6_dbg/storage/innobase/row/row0upd.cc:2068
#20 0x00005602238d9638 in row_upd_sec_step (thr=0x1500340fa6e0, node=0x1500340f9e98) at /test/10.6_dbg/storage/innobase/row/row0upd.cc:2267
#21 row_upd (thr=0x1500340fa6e0, node=0x1500340f9e98) at /test/10.6_dbg/storage/innobase/row/row0upd.cc:3050
#22 row_upd_step (thr=thr@entry=0x1500340fa6e0) at /test/10.6_dbg/storage/innobase/row/row0upd.cc:3165
#23 0x000056022380e994 in que_thr_step (thr=0x1500340fa6e0) at /test/10.6_dbg/storage/innobase/que/que0que.cc:635
#24 que_run_threads_low (thr=0x1500340fa6e0) at /test/10.6_dbg/storage/innobase/que/que0que.cc:709
#25 que_run_threads (thr=0x1500340fa6e0) at /test/10.6_dbg/storage/innobase/que/que0que.cc:729
#26 0x000056022380f021 in que_eval_sql (info=info@entry=0x1500340413f8, sql=sql@entry=0x560223f5d100 "PROCEDURE DROP_TABLE_PROC () IS\ntid CHAR;\niid CHAR;\nDECLARE CURSOR cur_idx IS\nSELECT ID FROM SYS_INDEXES\nWHERE TABLE_ID = tid FOR UPDATE;\nBEGIN\nSELECT ID INTO tid FROM SYS_TABLES\nWHERE NAME = :name FO"..., reserve_dict_mutex=reserve_dict_mutex@entry=false, trx=trx@entry=0x1500807612a0) at /test/10.6_dbg/storage/innobase/que/que0que.cc:779
#27 0x000056022387e3cb in row_drop_table_for_mysql (name=<optimized out>, name@entry=0x1500804a2c40 "test/t1#P#p5", trx=trx@entry=0x1500807612a0, sqlcom=sqlcom@entry=SQLCOM_DROP_TABLE, create_failed=create_failed@entry=false, nonatomic=<optimized out>, nonatomic@entry=true) at /test/10.6_dbg/storage/innobase/row/row0mysql.cc:3593
#28 0x00005602237120bb in ha_innobase::delete_table (this=this@entry=0x150034017428, name=<optimized out>, name@entry=0x1500804a3550 "./test/t1#P#p5", sqlcom=sqlcom@entry=SQLCOM_DROP_TABLE) at /test/10.6_dbg/storage/innobase/handler/ha_innodb.cc:13032
#29 0x00005602236fab58 in ha_innobase::delete_table (this=0x150034017428, name=0x1500804a3550 "./test/t1#P#p5") at /test/10.6_dbg/storage/innobase/handler/ha_innodb.cc:13159
#30 0x0000560223626742 in ha_partition::del_ren_table (this=this@entry=0x150034014528, from=from@entry=0x1500804a48a0 "./test/t1", to=to@entry=0x0) at /test/10.6_dbg/sql/ha_partition.cc:2495
#31 0x00005602236269c4 in ha_partition::delete_table (this=0x150034014528, name=0x1500804a48a0 "./test/t1") at /test/10.6_dbg/sql/ha_partition.cc:631
#32 0x00005602232fa284 in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /test/10.6_dbg/sql/handler.cc:564
#33 0x0000560223303f01 in ha_delete_table (thd=thd@entry=0x150034000db8, hton=hton@entry=0x560226d00128, path=path@entry=0x1500804a48a0 "./test/t1", db=db@entry=0x1500804a46a0, alias=alias@entry=0x1500804a46b0, generate_warning=<optimized out>) at /test/10.6_dbg/sql/handler.cc:2771
#34 0x00005602230f22ef in mysql_rm_table_no_locks (thd=thd@entry=0x150034000db8, tables=tables@entry=0x150034013c10, if_exists=if_exists@entry=false, drop_temporary=drop_temporary@entry=false, drop_view=drop_view@entry=false, drop_sequence=drop_sequence@entry=false, dont_log_query=false, dont_free_locks=false) at /test/10.6_dbg/sql/sql_table.cc:2508
#35 0x00005602230f405b in mysql_rm_table (thd=thd@entry=0x150034000db8, tables=tables@entry=0x150034013c10, if_exists=<optimized out>, drop_temporary=<optimized out>, drop_sequence=<optimized out>, dont_log_query=dont_log_query@entry=false) at /test/10.6_dbg/sql/sql_table.cc:2137
#36 0x0000560223031310 in mysql_execute_command (thd=thd@entry=0x150034000db8) at /test/10.6_dbg/sql/structs.h:559
#37 0x000056022301acdc in mysql_parse (thd=thd@entry=0x150034000db8, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x1500804a53d0) at /test/10.6_dbg/sql/sql_parse.cc:7972
#38 0x0000560223028b3b in dispatch_command (command=command@entry=COM_QUERY, thd=thd@entry=0x150034000db8, packet=packet@entry=0x15003400b319 "DROP TABLE t1", packet_length=packet_length@entry=13, blocking=blocking@entry=true) at /test/10.6_dbg/sql/sql_class.h:1295
#39 0x000056022302bfbe in do_command (thd=0x150034000db8, blocking=blocking@entry=true) at /test/10.6_dbg/sql/sql_parse.cc:1397
#40 0x0000560223187c2e in do_handle_one_connection (connect=<optimized out>, connect@entry=0x5602271c9468, put_in_cache=put_in_cache@entry=true) at /test/10.6_dbg/sql/sql_connect.cc:1410
#41 0x0000560223188331 in handle_one_connection (arg=arg@entry=0x5602271c9468) at /test/10.6_dbg/sql/sql_connect.cc:1312
#42 0x000056022363c9b9 in pfs_spawn_thread (arg=0x5602270edc98) at /test/10.6_dbg/storage/perfschema/pfs.cc:2201
#43 0x0000150083572609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#44 0x0000150083161293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

This bug seems to show an underlying issue, and is closely related to another sporadic and hard to define bug (logged in MDEV-25071) I have been chasing for a long time, which happens when creating many partitions and shows the same "closing tables" as follows:

10.6.0>show processlist;

+----+------+-----------+------+---------+------+----------------+--------------------+----------+

| Id | User | Host      | db   | Command | Time | State          | Info               | Progress |

+----+------+-----------+------+---------+------+----------------+--------------------+----------+

|  5 | root | localhost | test | Query   |   23 | closing tables | DROP DATABASE test |    0.000 |

|  7 | root | localhost | test | Query   |    0 | starting       | show processlist   |    0.000 |

+----+------+-----------+------+---------+------+----------------+--------------------+----------+

2 rows in set (0.000 sec)

I marked this particular bug as critical as it has cost many hours in test result debugging and is only present in 10.6 from what I can tell.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev-25072_asan.err
15 kB
2021-03-24 05:48
mdev-25072_tsan.err
1.42 MB
2021-03-24 05:48

Issue Links

is caused by

MDEV-23399 10.5 performance regression with IO-bound tpcc

Closed

relates to

MDEV-24911 Missing warning before [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex

Open

Activity

Ascending order - Click to sort in descending order

View 14 older comments

Marko Mäkelä added a comment - 2021-03-30 08:13

elenst pointed out that I had forgotten the ‘Simon says’ (ENGINE=InnoDB) and was actually testing MyISAM. Based on her advice, I tried a simpler test, but it does not hang for me:

--source include/have_innodb.inc

--source include/have_partition.inc

--disable_query_log

let $n=10000;

while ($n) {

SET GLOBAL innodb_limit_optimistic_insert_debug=7;

CREATE TABLE t1 (c1 INT) ENGINE=InnoDB PARTITION BY HASH (c1) PARTITIONS 15;

SET GLOBAL innodb_change_buffering_debug=1;

DROP TABLE t1;

dec $n;

Marko Mäkelä added a comment - 2021-03-30 08:13 elenst pointed out that I had forgotten the ‘Simon says’ ( ENGINE=InnoDB ) and was actually testing MyISAM. Based on her advice, I tried a simpler test, but it does not hang for me: --source include/have_innodb.inc --source include/have_partition.inc --disable_query_log let $n=10000; while ($n) { SET GLOBAL innodb_limit_optimistic_insert_debug=7; CREATE TABLE t1 (c1 INT ) ENGINE=InnoDB PARTITION BY HASH (c1) PARTITIONS 15; SET GLOBAL innodb_change_buffering_debug=1; DROP TABLE t1; dec $n; }

Marko Mäkelä added a comment - 2021-03-30 08:44

With that test, I was able to repeat some hangs. The dict_sys.mutex is held by Thread 7, which is waiting for something in DBUG while trying to acquire buf_pool.mutex (which is perfectly valid to be acquired while holding dict_sys.mutex):

Thread 7 (Thread 0x7fb549f80700 (LWP 2397063)):

#0  0x000055cb71729d4e in DbugParse (cs=<optimized out>, cs@entry=0x7fb51c000ce0, control=<optimized out>, control@entry=0x55cb71971a65 "") at /mariadb/bb-10.6-monty/dbug/dbug.c:703

#1  0x000055cb71729e28 in _db_push_ (control=0x55cb71971a65 "") at /mariadb/bb-10.6-monty/dbug/dbug.c:865

#2  0x000055cb7171086a in safe_mutex_lock (mp=mp@entry=0x55cb7204fcc0 <buf_pool>, my_flags=my_flags@entry=0, file=file@entry=0x55cb71a3d4a8 "/mariadb/bb-10.6-monty/storage/innobase/buf/buf0flu.cc", line=line@entry=1551) at /mariadb/bb-10.6-monty/mysys/thr_mutex.c:236

#3  0x000055cb7170b6ab in psi_mutex_lock (that=0x55cb7204fcc0 <buf_pool>, file=0x55cb71a3d4a8 "/mariadb/bb-10.6-monty/storage/innobase/buf/buf0flu.cc", line=1551) at /mariadb/bb-10.6-monty/mysys/my_thr_init.c:487

#4  0x000055cb715af08f in inline_mysql_mutex_lock (that=<optimized out>, src_file=<optimized out>, src_line=1551) at /mariadb/bb-10.6-monty/include/mysql/psi/mysql_thread.h:746

#5  buf_flush_lists (max_n=max_n@entry=18446744073709551615, lsn=lsn@entry=18446744073709551615) at /mariadb/bb-10.6-monty/storage/innobase/buf/buf0flu.cc:1551

#6  0x000055cb715974e2 in buf_page_get_low (page_id={m_id = 500}, zip_size=<optimized out>, zip_size@entry=32, rw_latch=2, rw_latch@entry=0, guess=<optimized out>, guess@entry=0x7fb549f7b690, mode=<optimized out>, mode@entry=1, mtr=<optimized out>, mtr@entry=0x4, err=0x7fb549f7a9ac, allow_ibuf_merge=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/buf/buf0buf.cc:2971

#7  0x000055cb71598f45 in buf_page_get_gen (page_id=<optimized out>, page_id@entry={m_id = 140416614318493}, zip_size=<optimized out>, zip_size@entry=8589934593, rw_latch=<optimized out>, rw_latch@entry=1240991488, guess=<optimized out>, guess@entry=0x7fb549f7b980, mode=<optimized out>, mode@entry=11, mtr=<optimized out>, err=0x2, allow_ibuf_merge=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/buf/buf0buf.cc:3103

#8  0x000055cb71556788 in btr_cur_search_to_nth_level_func (index=<optimized out>, index@entry=0x55cb741a2308, level=<optimized out>, level@entry=0, tuple=<optimized out>, tuple@entry=0x7fb51c2f32b8, mode=<optimized out>, mode@entry=PAGE_CUR_LE, latch_mode=<optimized out>, latch_mode@entry=4098, cursor=<optimized out>, cursor@entry=0x7fb549f7b980, ahi_latch=0x0, mtr=0x7fb549f7bfc8, autoinc=0) at /mariadb/bb-10.6-monty/storage/innobase/btr/btr0cur.cc:1610

#9  0x000055cb714a259a in btr_pcur_open_low (index=index@entry=0x55cb741a2308, level=0, tuple=tuple@entry=0x7fb51c2f32b8, mode=PAGE_CUR_LE, latch_mode=latch_mode@entry=4098, cursor=cursor@entry=0x7fb549f7b980, autoinc=0, mtr=mtr@entry=0x7fb549f7bfc8) at /mariadb/bb-10.6-monty/storage/innobase/include/btr0pcur.ic:439

#10 0x000055cb714a2b4b in row_search_index_entry (index=index@entry=0x55cb741a2308, entry=entry@entry=0x7fb51c2f32b8, mode=4098, pcur=pcur@entry=0x7fb549f7b980, mtr=mtr@entry=0x7fb549f7bfc8) at /mariadb/bb-10.6-monty/storage/innobase/row/row0row.cc:1301

#11 0x000055cb714c5461 in row_upd_sec_index_entry (node=0x7fb51c2deed8, thr=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:2065

#12 row_upd_sec_step (node=0x7fb51c2deed8, thr=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:2264

#13 row_upd (node=node@entry=0x7fb51c2deed8, thr=<optimized out>, thr@entry=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:2994

#14 0x000055cb714c3d5e in row_upd_step (thr=thr@entry=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:3109

#15 0x000055cb7140e1b4 in que_thr_step (thr=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:635

#16 que_run_threads_low (thr=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:709

#17 que_run_threads (thr=thr@entry=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:729

#18 0x000055cb7140e9c4 in que_eval_sql (info=info@entry=0x55cb741ca518, sql=<optimized out>, reserve_dict_mutex=false, trx=trx@entry=0x7fb54b21a2a0) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:779

#19 0x000055cb71470999 in row_drop_table_for_mysql (name=<optimized out>, name@entry=0x7fb549f7cdc0 "test/t1#P#p1", trx=trx@entry=0x7fb54b21a2a0, sqlcom=sqlcom@entry=SQLCOM_DROP_TABLE, create_failed=<optimized out>, nonatomic=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/row/row0mysql.cc:3602

#20 0x000055cb713150c6 in ha_innobase::delete_table (this=<optimized out>, name=0x7fb549f7da30 "./test/t1#P#p1", sqlcom=SQLCOM_DROP_TABLE) at /mariadb/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:13088

#21 0x000055cb71241d0f in ha_partition::del_ren_table (this=<optimized out>, from=0x7fb549f7ed30 "./test/t1", to=<optimized out>) at /mariadb/bb-10.6-monty/sql/ha_partition.cc:2509

#22 0x000055cb70fc1bbe in hton_drop_table (hton=<optimized out>, path=0x1 <error: Cannot access memory at address 0x1>) at /mariadb/bb-10.6-monty/sql/handler.cc:577

In safe_mutex_lock(), the DBUG_PUSH_EMPTY was added to supposedly speed up the code, but I suspect that it is actually slowing down builds with cmake -DWITH_DBUG_TRACE=OFF. Anyway, that should not be the cause of the hang.

Note: buf_pool.mutex is available according to the core dump. Also Thread 14 is trying to acquire it.

Thread 12 and Thread 13 are both trying to acquire dict_sys.mutex, which is held by Thread 7.

It is unclear to me why that Thread 7 would not be able to proceed. I will try to reproduce this with other build options, to see whether SAFE_MUTEX or DBUG are contributing to the hang.

Marko Mäkelä added a comment - 2021-03-30 08:44 With that test, I was able to repeat some hangs. The dict_sys.mutex is held by Thread 7, which is waiting for something in DBUG while trying to acquire buf_pool.mutex (which is perfectly valid to be acquired while holding dict_sys.mutex ): Thread 7 (Thread 0x7fb549f80700 (LWP 2397063)): #0 0x000055cb71729d4e in DbugParse (cs=<optimized out>, cs@entry=0x7fb51c000ce0, control=<optimized out>, control@entry=0x55cb71971a65 "") at /mariadb/bb-10.6-monty/dbug/dbug.c:703 #1 0x000055cb71729e28 in _db_push_ (control=0x55cb71971a65 "") at /mariadb/bb-10.6-monty/dbug/dbug.c:865 #2 0x000055cb7171086a in safe_mutex_lock (mp=mp@entry=0x55cb7204fcc0 <buf_pool>, my_flags=my_flags@entry=0, file=file@entry=0x55cb71a3d4a8 "/mariadb/bb-10.6-monty/storage/innobase/buf/buf0flu.cc", line=line@entry=1551) at /mariadb/bb-10.6-monty/mysys/thr_mutex.c:236 #3 0x000055cb7170b6ab in psi_mutex_lock (that=0x55cb7204fcc0 <buf_pool>, file=0x55cb71a3d4a8 "/mariadb/bb-10.6-monty/storage/innobase/buf/buf0flu.cc", line=1551) at /mariadb/bb-10.6-monty/mysys/my_thr_init.c:487 #4 0x000055cb715af08f in inline_mysql_mutex_lock (that=<optimized out>, src_file=<optimized out>, src_line=1551) at /mariadb/bb-10.6-monty/include/mysql/psi/mysql_thread.h:746 #5 buf_flush_lists (max_n=max_n@entry=18446744073709551615, lsn=lsn@entry=18446744073709551615) at /mariadb/bb-10.6-monty/storage/innobase/buf/buf0flu.cc:1551 #6 0x000055cb715974e2 in buf_page_get_low (page_id={m_id = 500}, zip_size=<optimized out>, zip_size@entry=32, rw_latch=2, rw_latch@entry=0, guess=<optimized out>, guess@entry=0x7fb549f7b690, mode=<optimized out>, mode@entry=1, mtr=<optimized out>, mtr@entry=0x4, err=0x7fb549f7a9ac, allow_ibuf_merge=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/buf/buf0buf.cc:2971 #7 0x000055cb71598f45 in buf_page_get_gen (page_id=<optimized out>, page_id@entry={m_id = 140416614318493}, zip_size=<optimized out>, zip_size@entry=8589934593, rw_latch=<optimized out>, rw_latch@entry=1240991488, guess=<optimized out>, guess@entry=0x7fb549f7b980, mode=<optimized out>, mode@entry=11, mtr=<optimized out>, err=0x2, allow_ibuf_merge=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/buf/buf0buf.cc:3103 #8 0x000055cb71556788 in btr_cur_search_to_nth_level_func (index=<optimized out>, index@entry=0x55cb741a2308, level=<optimized out>, level@entry=0, tuple=<optimized out>, tuple@entry=0x7fb51c2f32b8, mode=<optimized out>, mode@entry=PAGE_CUR_LE, latch_mode=<optimized out>, latch_mode@entry=4098, cursor=<optimized out>, cursor@entry=0x7fb549f7b980, ahi_latch=0x0, mtr=0x7fb549f7bfc8, autoinc=0) at /mariadb/bb-10.6-monty/storage/innobase/btr/btr0cur.cc:1610 #9 0x000055cb714a259a in btr_pcur_open_low (index=index@entry=0x55cb741a2308, level=0, tuple=tuple@entry=0x7fb51c2f32b8, mode=PAGE_CUR_LE, latch_mode=latch_mode@entry=4098, cursor=cursor@entry=0x7fb549f7b980, autoinc=0, mtr=mtr@entry=0x7fb549f7bfc8) at /mariadb/bb-10.6-monty/storage/innobase/include/btr0pcur.ic:439 #10 0x000055cb714a2b4b in row_search_index_entry (index=index@entry=0x55cb741a2308, entry=entry@entry=0x7fb51c2f32b8, mode=4098, pcur=pcur@entry=0x7fb549f7b980, mtr=mtr@entry=0x7fb549f7bfc8) at /mariadb/bb-10.6-monty/storage/innobase/row/row0row.cc:1301 #11 0x000055cb714c5461 in row_upd_sec_index_entry (node=0x7fb51c2deed8, thr=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:2065 #12 row_upd_sec_step (node=0x7fb51c2deed8, thr=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:2264 #13 row_upd (node=node@entry=0x7fb51c2deed8, thr=<optimized out>, thr@entry=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:2994 #14 0x000055cb714c3d5e in row_upd_step (thr=thr@entry=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/row/row0upd.cc:3109 #15 0x000055cb7140e1b4 in que_thr_step (thr=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:635 #16 que_run_threads_low (thr=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:709 #17 que_run_threads (thr=thr@entry=0x7fb51c2df720) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:729 #18 0x000055cb7140e9c4 in que_eval_sql (info=info@entry=0x55cb741ca518, sql=<optimized out>, reserve_dict_mutex=false, trx=trx@entry=0x7fb54b21a2a0) at /mariadb/bb-10.6-monty/storage/innobase/que/que0que.cc:779 #19 0x000055cb71470999 in row_drop_table_for_mysql (name=<optimized out>, name@entry=0x7fb549f7cdc0 "test/t1#P#p1", trx=trx@entry=0x7fb54b21a2a0, sqlcom=sqlcom@entry=SQLCOM_DROP_TABLE, create_failed=<optimized out>, nonatomic=<optimized out>) at /mariadb/bb-10.6-monty/storage/innobase/row/row0mysql.cc:3602 #20 0x000055cb713150c6 in ha_innobase::delete_table (this=<optimized out>, name=0x7fb549f7da30 "./test/t1#P#p1", sqlcom=SQLCOM_DROP_TABLE) at /mariadb/bb-10.6-monty/storage/innobase/handler/ha_innodb.cc:13088 #21 0x000055cb71241d0f in ha_partition::del_ren_table (this=<optimized out>, from=0x7fb549f7ed30 "./test/t1", to=<optimized out>) at /mariadb/bb-10.6-monty/sql/ha_partition.cc:2509 #22 0x000055cb70fc1bbe in hton_drop_table (hton=<optimized out>, path=0x1 <error: Cannot access memory at address 0x1>) at /mariadb/bb-10.6-monty/sql/handler.cc:577 In safe_mutex_lock() , the DBUG_PUSH_EMPTY was added to supposedly speed up the code , but I suspect that it is actually slowing down builds with cmake -DWITH_DBUG_TRACE=OFF . Anyway, that should not be the cause of the hang. Note: buf_pool.mutex is available according to the core dump. Also Thread 14 is trying to acquire it. Thread 12 and Thread 13 are both trying to acquire dict_sys.mutex , which is held by Thread 7. It is unclear to me why that Thread 7 would not be able to proceed. I will try to reproduce this with other build options, to see whether SAFE_MUTEX or DBUG are contributing to the hang.

Marko Mäkelä added a comment - 2021-03-31 10:41

I analyzed the rr replay trace that elenst provided to me. I just now realized that there was no reason for the two SET GLOBAL to be inside the loop in my mtr test case. The key here is that

SET GLOBAL innodb_change_buffering_debug=1;

is causing contention. In that trace, rr record appears to have played its trick of unfair scheduling. DROP TABLE t1 would be waiting forever for buf_pool.mutex while holding dict_sys.mutex and dict_sys.latch:

10.6 49ddfb63780c55b4133755c3479e13c7ea9a9d7b
#4 0x0000555a5dbc0a91 in inline_mysql_mutex_lock (that=0x555a5eaaacc0 <buf_pool>, src_file=0x555a5e2aae08 "/home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc", src_line=2933)
at /home/mdbe/MDEV-25297/10.6/include/mysql/psi/mysql_thread.h:746
#5 0x0000555a5dbc803c in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x7ff674155ed0, err=0x7ff674154c2c, allow_ibuf_merge=true)
at /home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc:2933

At least one thread is constantly acquiring and releasing buf_pool.mutex:

#0  buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=15, mtr=0x7ff5489f90f0, err=0x7ff5489f7e5c, allow_ibuf_merge=true) at /home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc:2935

#1  0x0000555a5dbc8804 in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=15, mtr=0x7ff5489f90f0, err=0x7ff5489f7e5c, allow_ibuf_merge=true)

    at /home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc:3104

#2  0x0000555a5db8d77d in btr_cur_search_to_nth_level_func (index=0x555a61813698, level=0, tuple=0x7ff528004aa8, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x7ff5489f8e50, ahi_latch=0x0, mtr=0x7ff5489f90f0,

    autoinc=0) at /home/mdbe/MDEV-25297/10.6/storage/innobase/btr/btr0cur.cc:1610

#3  0x0000555a5dac4068 in btr_pcur_open_low (index=0x555a61813698, level=0, tuple=0x7ff528004aa8, mode=PAGE_CUR_LE, latch_mode=8194, cursor=0x7ff5489f8e50, autoinc=0, mtr=0x7ff5489f90f0)

    at /home/mdbe/MDEV-25297/10.6/storage/innobase/include/btr0pcur.ic:439

#4  0x0000555a5dac8120 in row_search_index_entry (index=0x555a61813698, entry=0x7ff528004aa8, mode=8194, pcur=0x7ff5489f8e50, mtr=0x7ff5489f90f0)

    at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0row.cc:1301

#5  0x0000555a5dabbbbf in row_purge_remove_sec_if_poss_leaf (node=0x555a618210c8, index=0x555a61813698, entry=0x7ff528004aa8) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:473

#6  0x0000555a5dabc0e3 in row_purge_remove_sec_if_poss (node=0x555a618210c8, index=0x555a61813698, entry=0x7ff528004aa8) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:582

#7  0x0000555a5dabc2f7 in row_purge_del_mark (node=0x555a618210c8) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:652

#8  0x0000555a5dabdd27 in row_purge_record_func (node=0x555a618210c8, undo_rec=0x7ff66400aea8 "", thr=0x555a61820ed8, updated_extern=false) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:1066

Both are accessing the same page (334 in the system tablespace), the secondary index ID_IND of SYS_TABLES. Because of the debug setting, they are trying to evict the page from the buffer pool:

#if defined UNIV_DEBUG || defined UNIV_IBUF_DEBUG

re_evict:

	if (mode != BUF_GET_IF_IN_POOL

	    && mode != BUF_GET_IF_IN_POOL_OR_WATCH) {

	} else if (!ibuf_debug) {

	} else if (fil_space_t* space = fil_space_t::get(page_id.space())) {

		/* Try to evict the block from the buffer pool, to use the

		insert buffer (change buffer) as much as possible. */

		mysql_mutex_lock(&buf_pool.mutex);

		fix_block->unfix();

		/* Blocks cannot be relocated or enter or exit the

		buf_pool while we are holding the buf_pool.mutex. */

		const bool evicted = buf_LRU_free_page(&fix_block->page, true);

		space->release();

…

		if (!fix_block->page.oldest_modification()) {

			goto re_evict;

So, both threads would keep acquiring and releasing fil_system.mutex and buf_pool.mutex in a busy loop. I think that the following should fix the livelock between the two threads:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc

index 93178bada97..ace267db9af 100644

--- a/storage/innobase/buf/buf0buf.cc

+++ b/storage/innobase/buf/buf0buf.cc

@@ -2971,7 +2971,8 @@ buf_page_get_low(

 		buf_flush_lists(ULINT_UNDEFINED, LSN_MAX);

 		buf_flush_wait_batch_end_acquiring_mutex(false);

-		if (!fix_block->page.oldest_modification()) {

+		if (fix_block->page.buf_fix_count() == 1

+		    && !fix_block->page.oldest_modification()) {

 			goto re_evict;

Another problem that is illustrated by this test case is that the internal watchdog is not displaying any warning messages before killing the server:

2021-03-31  8:35:56 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

Marko Mäkelä added a comment - 2021-03-31 10:41 I analyzed the rr replay trace that elenst provided to me. I just now realized that there was no reason for the two SET GLOBAL to be inside the loop in my mtr test case. The key here is that SET GLOBAL innodb_change_buffering_debug=1; is causing contention. In that trace, rr record appears to have played its trick of unfair scheduling. DROP TABLE t1 would be waiting forever for buf_pool.mutex while holding dict_sys.mutex and dict_sys.latch : 10.6 49ddfb63780c55b4133755c3479e13c7ea9a9d7b #4 0x0000555a5dbc0a91 in inline_mysql_mutex_lock (that=0x555a5eaaacc0 <buf_pool>, src_file=0x555a5e2aae08 "/home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc", src_line=2933) at /home/mdbe/MDEV-25297/10.6/include/mysql/psi/mysql_thread.h:746 #5 0x0000555a5dbc803c in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x7ff674155ed0, err=0x7ff674154c2c, allow_ibuf_merge=true) at /home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc:2933 At least one thread is constantly acquiring and releasing buf_pool.mutex : #0 buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=15, mtr=0x7ff5489f90f0, err=0x7ff5489f7e5c, allow_ibuf_merge=true) at /home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc:2935 #1 0x0000555a5dbc8804 in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=15, mtr=0x7ff5489f90f0, err=0x7ff5489f7e5c, allow_ibuf_merge=true) at /home/mdbe/MDEV-25297/10.6/storage/innobase/buf/buf0buf.cc:3104 #2 0x0000555a5db8d77d in btr_cur_search_to_nth_level_func (index=0x555a61813698, level=0, tuple=0x7ff528004aa8, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x7ff5489f8e50, ahi_latch=0x0, mtr=0x7ff5489f90f0, autoinc=0) at /home/mdbe/MDEV-25297/10.6/storage/innobase/btr/btr0cur.cc:1610 #3 0x0000555a5dac4068 in btr_pcur_open_low (index=0x555a61813698, level=0, tuple=0x7ff528004aa8, mode=PAGE_CUR_LE, latch_mode=8194, cursor=0x7ff5489f8e50, autoinc=0, mtr=0x7ff5489f90f0) at /home/mdbe/MDEV-25297/10.6/storage/innobase/include/btr0pcur.ic:439 #4 0x0000555a5dac8120 in row_search_index_entry (index=0x555a61813698, entry=0x7ff528004aa8, mode=8194, pcur=0x7ff5489f8e50, mtr=0x7ff5489f90f0) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0row.cc:1301 #5 0x0000555a5dabbbbf in row_purge_remove_sec_if_poss_leaf (node=0x555a618210c8, index=0x555a61813698, entry=0x7ff528004aa8) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:473 #6 0x0000555a5dabc0e3 in row_purge_remove_sec_if_poss (node=0x555a618210c8, index=0x555a61813698, entry=0x7ff528004aa8) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:582 #7 0x0000555a5dabc2f7 in row_purge_del_mark (node=0x555a618210c8) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:652 #8 0x0000555a5dabdd27 in row_purge_record_func (node=0x555a618210c8, undo_rec=0x7ff66400aea8 "", thr=0x555a61820ed8, updated_extern=false) at /home/mdbe/MDEV-25297/10.6/storage/innobase/row/row0purge.cc:1066 Both are accessing the same page (334 in the system tablespace), the secondary index ID_IND of SYS_TABLES . Because of the debug setting, they are trying to evict the page from the buffer pool: #if defined UNIV_DEBUG || defined UNIV_IBUF_DEBUG re_evict: if (mode != BUF_GET_IF_IN_POOL && mode != BUF_GET_IF_IN_POOL_OR_WATCH) { } else if (!ibuf_debug) { } else if (fil_space_t* space = fil_space_t::get(page_id.space())) { /* Try to evict the block from the buffer pool, to use the insert buffer (change buffer) as much as possible. */ mysql_mutex_lock(&buf_pool.mutex); fix_block->unfix(); /* Blocks cannot be relocated or enter or exit the buf_pool while we are holding the buf_pool.mutex. */ const bool evicted = buf_LRU_free_page(&fix_block->page, true ); space->release(); … if (!fix_block->page.oldest_modification()) { goto re_evict; } So, both threads would keep acquiring and releasing fil_system.mutex and buf_pool.mutex in a busy loop. I think that the following should fix the livelock between the two threads: diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index 93178bada97..ace267db9af 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -2971,7 +2971,8 @@ buf_page_get_low( buf_flush_lists(ULINT_UNDEFINED, LSN_MAX); buf_flush_wait_batch_end_acquiring_mutex(false); - if (!fix_block->page.oldest_modification()) { + if (fix_block->page.buf_fix_count() == 1 + && !fix_block->page.oldest_modification()) { goto re_evict; } Another problem that is illustrated by this test case is that the internal watchdog is not displaying any warning messages before killing the server: 2021-03-31 8:35:56 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.mutex. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

Marko Mäkelä added a comment - 2021-03-31 11:08

Roel, are you able to repeat the memory corruption inside DBUG? I was not. I think that that one should be filed as a separate ticket.

Does the patch work for you? I think that the livelock is present already in 10.5.7 due to ~~MDEV-23399~~.

Marko Mäkelä added a comment - 2021-03-31 11:08 Roel , are you able to repeat the memory corruption inside DBUG? I was not. I think that that one should be filed as a separate ticket. Does the patch work for you? I think that the livelock is present already in 10.5.7 due to MDEV-23399 .

Roel Van de Paar added a comment - 2021-04-01 00:26 - edited

Verified that a 10.6 debug build from today (1/4/21 at revision 1bd4115841ecded24217e5d753ed4d9822b4cffd) no longer crashes on either testcase.
Verified with a ASAN/UBSAN build of the same revision and neither testcase produces any ASAN/UBSAN errors.

Thus deemed fixed in full for the time being. If any related issues occur, will log a new bug.

Roel Van de Paar added a comment - 2021-04-01 00:26 - edited Verified that a 10.6 debug build from today (1/4/21 at revision 1bd4115841ecded24217e5d753ed4d9822b4cffd) no longer crashes on either testcase. Verified with a ASAN/UBSAN build of the same revision and neither testcase produces any ASAN/UBSAN errors. Thus deemed fixed in full for the time being. If any related issues occur, will log a new bug.

MariaDB Server

Hang + Heap Corruption: SIGABRT in __libc_message from _int_free on DBUGCloseFile

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration