[MDEV-29590] Deadlock between ibuf_insert_to_index_page_low() and ibuf_delete_for_discarded_space() - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6.10
Fix Version/s: 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1
Component/s: Storage Engine - InnoDB
Labels:
- rr-profile-analyzed

Description

origin/bb-10.6-MDEV-22718 62de4addd247fceced29674b4a29dfec72b4340b 2022-09-19T15:30:09+03:00

But per Marko the problem is in the main tree 10.6 too.

Scenario:

1. Start the DB server and generate some initial data.

2. Nine concurrent sessions run a DDL/DML mix.

After some time the DB server aborts with

# 2022-09-19T14:28:20 [237985] | [rr 240412 607243]2022-09-19 14:24:50 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

pluto:/data/results/1663596022/TBR-788$ _RR_TRACE_DIR=./1/rr/ rr replay --mark-stdio

# git clone https://github.com/mleich1/rqg --branch experimental RQG

# GIT_SHOW: HEAD -> experimental, origin/experimental 8b78c9ad5db438b8a938faeac86dddf1e4094fdd 2022-09-19T14:29:39+02:00

# rqg.pl  : Version 4.0.6 (2022-05)

# $RQG_HOME/rqg.pl \

# --views \

# --grammar=conf/mariadb/partitions_innodb.yy \

# --redefine=conf/mariadb/alter_table.yy \

# --redefine=conf/mariadb/instant_add.yy \

# --redefine=conf/mariadb/modules/alter_table_columns.yy \

# --redefine=conf/mariadb/bulk_insert.yy \

# --redefine=conf/mariadb/modules/foreign_keys.yy \

# --redefine=conf/mariadb/modules/locks.yy \

# --redefine=conf/mariadb/modules/sql_mode.yy \

# --redefine=conf/mariadb/versioning.yy \

# --redefine=conf/mariadb/sequences.yy \

# --redefine=conf/mariadb/modules/locks-10.4-extra.yy \

# --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \

# --mysqld=--loose-idle_write_transaction_timeout=0 \

# --mysqld=--loose-idle_transaction_timeout=0 \

# --mysqld=--loose-idle_readonly_transaction_timeout=0 \

# --mysqld=--connect_timeout=60 \

# --mysqld=--interactive_timeout=28800 \

# --mysqld=--slave_net_timeout=60 \

# --mysqld=--net_read_timeout=30 \

# --mysqld=--net_write_timeout=60 \

# --mysqld=--loose-table_lock_wait_timeout=50 \

# --mysqld=--wait_timeout=28800 \

# --mysqld=--lock-wait-timeout=86400 \

# --mysqld=--innodb-lock-wait-timeout=50 \

# --no-mask \

# --queries=10000000 \

# --seed=random \

# --reporters=Backtrace \

# --reporters=ErrorLog \

# --reporters=Deadlock1 \

# --validators=None \

# --mysqld=--log_output=none \

# --mysqld=--log_bin_trust_function_creators=1 \

# --mysqld=--loose-debug_assert_on_not_freed_memory=0 \

# --engine=InnoDB \

# --restart_timeout=240 \

# --mysqld=--plugin-load-add=file_key_management.so \

# --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \

# --mysqld=--plugin-load-add=provider_lzo.so \

# --mysqld=--plugin-load-add=provider_bzip2.so \

# --mysqld=--plugin-load-add=provider_lzma.so \

# --mysqld=--plugin-load-add=provider_snappy.so \

# --mysqld=--plugin-load-add=provider_lz4.so \

# --duration=300 \

# --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \

# --mysqld=--innodb_file_per_table=1 \

# --mysqld=--loose-innodb_read_only_compressed=OFF \

# --mysqld=--loose-innodb-sync-debug \

# --mysqld=--innodb_stats_persistent=on \

# --mysqld=--innodb_adaptive_hash_index=off \

# --mysqld=--log-bin \

# --mysqld=--sync-binlog=1 \

# --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \

# --mysqld=--loose-max-statement-time=30 \

# --threads=9 \

# --mysqld=--innodb-use-native-aio=0 \

# --mysqld=--loose-gdb \

# --mysqld=--loose-debug-gdb \

# --rr=Extended \

# --rr_options=--chaos --wait \

# --mysqld=--innodb_undo_tablespaces=3 \

# --mysqld=--innodb_undo_log_truncate=ON \

# --mysqld=--transaction-isolation=READ-UNCOMMITTED \

# --mysqld=--loose_innodb_change_buffering=all \

# --mysqld=--innodb_rollback_on_timeout=OFF \

# --mysqld=--innodb_page_size=16K \

# --mysqld=--innodb-buffer-pool-size=8M \

# <local settings>

Attachments

Issue Links

relates to

MDEV-27734 Set innodb_change_buffering=none by default

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2022-09-21 12:35

The holder of exclusive dict_sys.latch and lock_sys.latch is blocked deep inside trx_t::commit() of a table-or-partition-rebuilding DDL operation that would delete an .ibd file. Specifically, ibuf_delete_for_discarded_space() is waiting for the change buffer root page latch (page 4 in the system tablespace). That latch is being held by another thread, which is unnecessarily waiting for lock_sys.latch:

#15 0x000055a2975dbfa8 in lock_sys_t::rd_lock (this=0x55a29a5fe500 <lock_sys>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/include/lock0lock.h:809

#16 0x000055a2975aceae in TMLockGuard::TMLockGuard (this=0x64000851ffc0, hash=..., id=<error reading variable: Cannot access memory at address 0x8a>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:250

#17 0x000055a2975bb8ed in lock_move_reorganize_page (block=0x7ffd3eb40e10, oblock=0x7ffd3eb3f750) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:2303

#18 0x000055a29794ac46 in btr_page_reorganize_low (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/i--Type <RET> for more, q to quit, c to continue without paging--

nnobase/btr/btr0btr.cc:1414

#19 0x000055a29794ca6a in btr_page_reorganize (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/btr/btr0btr.cc:1589

#20 0x000055a297597edb in ibuf_insert_to_index_page_low (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, offsets=0x6400085202e0, heap=0x613000499540, mtr=0x640008520890, page_cur=0x640008520360) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3677

#21 0x000055a297598e06 in ibuf_insert_to_index_page (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3836

#22 0x000055a29759d22c in ibuf_merge_or_delete_for_page (block=0x7ffd3eb40e10, page_id=..., zip_size=0) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:4365

The change buffer merge need not invoke lock_move_reorganize_page(), because no transactional locks can exist on pages for which changes have been buffered. This is guaranteed by a check in ibuf_insert_low():

		hash_cell_t* cell = lock_sys.rec_hash.cell_get(page_id.fold());

		lock_sys.rec_hash.latch(cell)->acquire();

		const lock_t* lock = lock_sys_t::get_first(*cell, page_id);

		lock_sys.rec_hash.latch(cell)->release();

		lock_sys.rd_unlock();

		if (lock) {

			goto commit_exit;

The change buffer was disabled by default in ~~MDEV-27734~~. During this run, innodb_change_buffering=all had been used.

Marko Mäkelä added a comment - 2022-09-21 12:35 The holder of exclusive dict_sys.latch and lock_sys.latch is blocked deep inside trx_t::commit() of a table-or-partition-rebuilding DDL operation that would delete an .ibd file. Specifically, ibuf_delete_for_discarded_space() is waiting for the change buffer root page latch (page 4 in the system tablespace). That latch is being held by another thread, which is unnecessarily waiting for lock_sys.latch : #15 0x000055a2975dbfa8 in lock_sys_t::rd_lock (this=0x55a29a5fe500 <lock_sys>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/include/lock0lock.h:809 #16 0x000055a2975aceae in TMLockGuard::TMLockGuard (this=0x64000851ffc0, hash=..., id=<error reading variable: Cannot access memory at address 0x8a>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:250 #17 0x000055a2975bb8ed in lock_move_reorganize_page (block=0x7ffd3eb40e10, oblock=0x7ffd3eb3f750) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:2303 #18 0x000055a29794ac46 in btr_page_reorganize_low (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/i--Type <RET> for more, q to quit, c to continue without paging-- nnobase/btr/btr0btr.cc:1414 #19 0x000055a29794ca6a in btr_page_reorganize (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/btr/btr0btr.cc:1589 #20 0x000055a297597edb in ibuf_insert_to_index_page_low (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, offsets=0x6400085202e0, heap=0x613000499540, mtr=0x640008520890, page_cur=0x640008520360) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3677 #21 0x000055a297598e06 in ibuf_insert_to_index_page (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3836 #22 0x000055a29759d22c in ibuf_merge_or_delete_for_page (block=0x7ffd3eb40e10, page_id=..., zip_size=0) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:4365 The change buffer merge need not invoke lock_move_reorganize_page() , because no transactional locks can exist on pages for which changes have been buffered. This is guaranteed by a check in ibuf_insert_low() : hash_cell_t* cell = lock_sys.rec_hash.cell_get(page_id.fold()); lock_sys.rec_hash.latch(cell)->acquire(); const lock_t* lock = lock_sys_t::get_first(*cell, page_id); lock_sys.rec_hash.latch(cell)->release(); lock_sys.rd_unlock(); if (lock) { goto commit_exit; } The change buffer was disabled by default in MDEV-27734 . During this run, innodb_change_buffering=all had been used.

Marko Mäkelä added a comment - 2022-09-21 12:55

I pushed a fix to the branch where this was observed. I decided against modifying dict_index_t::has_locking(), because all other calls to it are related to code that will never be invoked during a change buffer merge.

Marko Mäkelä added a comment - 2022-09-21 12:55 I pushed a fix to the branch where this was observed. I decided against modifying dict_index_t::has_locking() , because all other calls to it are related to code that will never be invoked during a change buffer merge.

MariaDB Server

Deadlock between ibuf_insert_to_index_page_low() and ibuf_delete_for_discarded_space()

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration