Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29590

Deadlock between ibuf_insert_to_index_page_low() and ibuf_delete_for_discarded_space()

Details

    Description

      origin/bb-10.6-MDEV-22718 62de4addd247fceced29674b4a29dfec72b4340b 2022-09-19T15:30:09+03:00
      But per Marko the problem is in the main tree 10.6 too.
       
      Scenario:
      1. Start the DB server and generate some initial data.
      2. Nine concurrent sessions run a DDL/DML mix.
      After some time the DB server aborts with
      # 2022-09-19T14:28:20 [237985] | [rr 240412 607243]2022-09-19 14:24:50 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
       
      pluto:/data/results/1663596022/TBR-788$ _RR_TRACE_DIR=./1/rr/ rr replay --mark-stdio
       
      # git clone https://github.com/mleich1/rqg --branch experimental RQG
      #
      # GIT_SHOW: HEAD -> experimental, origin/experimental 8b78c9ad5db438b8a938faeac86dddf1e4094fdd 2022-09-19T14:29:39+02:00
      # rqg.pl  : Version 4.0.6 (2022-05)
      #
      # $RQG_HOME/rqg.pl \
      # --views \
      # --grammar=conf/mariadb/partitions_innodb.yy \
      # --redefine=conf/mariadb/alter_table.yy \
      # --redefine=conf/mariadb/instant_add.yy \
      # --redefine=conf/mariadb/modules/alter_table_columns.yy \
      # --redefine=conf/mariadb/bulk_insert.yy \
      # --redefine=conf/mariadb/modules/foreign_keys.yy \
      # --redefine=conf/mariadb/modules/locks.yy \
      # --redefine=conf/mariadb/modules/sql_mode.yy \
      # --redefine=conf/mariadb/versioning.yy \
      # --redefine=conf/mariadb/sequences.yy \
      # --redefine=conf/mariadb/modules/locks-10.4-extra.yy \
      # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
      # --mysqld=--loose-idle_write_transaction_timeout=0 \
      # --mysqld=--loose-idle_transaction_timeout=0 \
      # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      # --mysqld=--connect_timeout=60 \
      # --mysqld=--interactive_timeout=28800 \
      # --mysqld=--slave_net_timeout=60 \
      # --mysqld=--net_read_timeout=30 \
      # --mysqld=--net_write_timeout=60 \
      # --mysqld=--loose-table_lock_wait_timeout=50 \
      # --mysqld=--wait_timeout=28800 \
      # --mysqld=--lock-wait-timeout=86400 \
      # --mysqld=--innodb-lock-wait-timeout=50 \
      # --no-mask \
      # --queries=10000000 \
      # --seed=random \
      # --reporters=Backtrace \
      # --reporters=ErrorLog \
      # --reporters=Deadlock1 \
      # --validators=None \
      # --mysqld=--log_output=none \
      # --mysqld=--log_bin_trust_function_creators=1 \
      # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      # --engine=InnoDB \
      # --restart_timeout=240 \
      # --mysqld=--plugin-load-add=file_key_management.so \
      # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
      # --mysqld=--plugin-load-add=provider_lzo.so \
      # --mysqld=--plugin-load-add=provider_bzip2.so \
      # --mysqld=--plugin-load-add=provider_lzma.so \
      # --mysqld=--plugin-load-add=provider_snappy.so \
      # --mysqld=--plugin-load-add=provider_lz4.so \
      # --duration=300 \
      # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      # --mysqld=--innodb_file_per_table=1 \
      # --mysqld=--loose-innodb_read_only_compressed=OFF \
      # --mysqld=--loose-innodb-sync-debug \
      # --mysqld=--innodb_stats_persistent=on \
      # --mysqld=--innodb_adaptive_hash_index=off \
      # --mysqld=--log-bin \
      # --mysqld=--sync-binlog=1 \
      # --mysqld=--loose-innodb_evict_tables_on_commit_debug=off \
      # --mysqld=--loose-max-statement-time=30 \
      # --threads=9 \
      # --mysqld=--innodb-use-native-aio=0 \
      # --mysqld=--loose-gdb \
      # --mysqld=--loose-debug-gdb \
      # --rr=Extended \
      # --rr_options=--chaos --wait \
      # --mysqld=--innodb_undo_tablespaces=3 \
      # --mysqld=--innodb_undo_log_truncate=ON \
      # --mysqld=--transaction-isolation=READ-UNCOMMITTED \
      # --mysqld=--loose_innodb_change_buffering=all \
      # --mysqld=--innodb_rollback_on_timeout=OFF \
      # --mysqld=--innodb_page_size=16K \
      # --mysqld=--innodb-buffer-pool-size=8M \
      # <local settings>
      

      Attachments

        Issue Links

          Activity

            The holder of exclusive dict_sys.latch and lock_sys.latch is blocked deep inside trx_t::commit() of a table-or-partition-rebuilding DDL operation that would delete an .ibd file. Specifically, ibuf_delete_for_discarded_space() is waiting for the change buffer root page latch (page 4 in the system tablespace). That latch is being held by another thread, which is unnecessarily waiting for lock_sys.latch:

            #15 0x000055a2975dbfa8 in lock_sys_t::rd_lock (this=0x55a29a5fe500 <lock_sys>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/include/lock0lock.h:809
            #16 0x000055a2975aceae in TMLockGuard::TMLockGuard (this=0x64000851ffc0, hash=..., id=<error reading variable: Cannot access memory at address 0x8a>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:250
            #17 0x000055a2975bb8ed in lock_move_reorganize_page (block=0x7ffd3eb40e10, oblock=0x7ffd3eb3f750) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:2303
            #18 0x000055a29794ac46 in btr_page_reorganize_low (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/i--Type <RET> for more, q to quit, c to continue without paging--
            nnobase/btr/btr0btr.cc:1414
            #19 0x000055a29794ca6a in btr_page_reorganize (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/btr/btr0btr.cc:1589
            #20 0x000055a297597edb in ibuf_insert_to_index_page_low (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, offsets=0x6400085202e0, heap=0x613000499540, mtr=0x640008520890, page_cur=0x640008520360) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3677
            #21 0x000055a297598e06 in ibuf_insert_to_index_page (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3836
            #22 0x000055a29759d22c in ibuf_merge_or_delete_for_page (block=0x7ffd3eb40e10, page_id=..., zip_size=0) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:4365
            

            The change buffer merge need not invoke lock_move_reorganize_page(), because no transactional locks can exist on pages for which changes have been buffered. This is guaranteed by a check in ibuf_insert_low():

            		hash_cell_t* cell = lock_sys.rec_hash.cell_get(page_id.fold());
            		lock_sys.rec_hash.latch(cell)->acquire();
            		const lock_t* lock = lock_sys_t::get_first(*cell, page_id);
            		lock_sys.rec_hash.latch(cell)->release();
            		lock_sys.rd_unlock();
            		if (lock) {
            			goto commit_exit;
            		}
            

            The change buffer was disabled by default in MDEV-27734. During this run, innodb_change_buffering=all had been used.

            marko Marko Mäkelä added a comment - The holder of exclusive dict_sys.latch and lock_sys.latch is blocked deep inside trx_t::commit() of a table-or-partition-rebuilding DDL operation that would delete an .ibd file. Specifically, ibuf_delete_for_discarded_space() is waiting for the change buffer root page latch (page 4 in the system tablespace). That latch is being held by another thread, which is unnecessarily waiting for lock_sys.latch : #15 0x000055a2975dbfa8 in lock_sys_t::rd_lock (this=0x55a29a5fe500 <lock_sys>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/include/lock0lock.h:809 #16 0x000055a2975aceae in TMLockGuard::TMLockGuard (this=0x64000851ffc0, hash=..., id=<error reading variable: Cannot access memory at address 0x8a>) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:250 #17 0x000055a2975bb8ed in lock_move_reorganize_page (block=0x7ffd3eb40e10, oblock=0x7ffd3eb3f750) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/lock/lock0lock.cc:2303 #18 0x000055a29794ac46 in btr_page_reorganize_low (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/i--Type <RET> for more, q to quit, c to continue without paging-- nnobase/btr/btr0btr.cc:1414 #19 0x000055a29794ca6a in btr_page_reorganize (cursor=0x640008520360, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/btr/btr0btr.cc:1589 #20 0x000055a297597edb in ibuf_insert_to_index_page_low (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, offsets=0x6400085202e0, heap=0x613000499540, mtr=0x640008520890, page_cur=0x640008520360) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3677 #21 0x000055a297598e06 in ibuf_insert_to_index_page (entry=0x61600f64f0c0, block=0x7ffd3eb40e10, index=0x61600f64e708, mtr=0x640008520890) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:3836 #22 0x000055a29759d22c in ibuf_merge_or_delete_for_page (block=0x7ffd3eb40e10, page_id=..., zip_size=0) at /data/Server/bb-10.6-MDEV-22718/storage/innobase/ibuf/ibuf0ibuf.cc:4365 The change buffer merge need not invoke lock_move_reorganize_page() , because no transactional locks can exist on pages for which changes have been buffered. This is guaranteed by a check in ibuf_insert_low() : hash_cell_t* cell = lock_sys.rec_hash.cell_get(page_id.fold()); lock_sys.rec_hash.latch(cell)->acquire(); const lock_t* lock = lock_sys_t::get_first(*cell, page_id); lock_sys.rec_hash.latch(cell)->release(); lock_sys.rd_unlock(); if (lock) { goto commit_exit; } The change buffer was disabled by default in MDEV-27734 . During this run, innodb_change_buffering=all had been used.

            I pushed a fix to the branch where this was observed. I decided against modifying dict_index_t::has_locking(), because all other calls to it are related to code that will never be invoked during a change buffer merge.

            marko Marko Mäkelä added a comment - I pushed a fix to the branch where this was observed. I decided against modifying dict_index_t::has_locking() , because all other calls to it are related to code that will never be invoked during a change buffer merge.

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.