Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25776

Assertion `bpage->in_page_hash' in buf_pool_t::page_hash_get_low

Details

    Description

      Scenario:
      1. Start the server and generate some initial data   (row_format REDUNDANT )
      2. Nine sessions run concurrent UPDATEs (autocommit) on the same table 
      3. SIGKILL server during 2. is ongoing
      4. Restart attempt with success
      5. OPTIMIZE TABLE `test`.`tab1` harvests
      mysqld: /data/Server/10.5_old/storage/innobase/include/buf0buf.h:1613: buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint): Assertion `bpage->in_page_hash' failed.
      (rr) bt
      #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
      #1  0x00007fbcbffbd859 in __GI_abort () at abort.c:79
      #2  0x00007fbcbffbd729 in __assert_fail_base (fmt=0x7fbcc0153588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x556821eeaf60 "bpage->in_page_hash", file=0x556821eea420 "/data/Server/10.5_old/storage/innobase/include/buf0buf.h", 
          line=1613, function=<optimized out>) at assert.c:92
      #3  0x00007fbcbffcef36 in __GI___assert_fail (assertion=assertion@entry=0x556821eeaf60 "bpage->in_page_hash", file=file@entry=0x556821eea420 "/data/Server/10.5_old/storage/innobase/include/buf0buf.h", line=line@entry=1613, 
          function=function@entry=0x556821eeadc0 "buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint)") at assert.c:101
      #4  0x000055682110b9c1 in buf_pool_t::page_hash_get_low (fold=4168528640, id=..., this=<optimized out>) at /data/Server/10.5_old/storage/innobase/include/buf0buf.h:1612
      #5  buf_flush_check_neighbor (id=..., fold=fold@entry=5245779, lru=lru@entry=false) at /data/Server/10.5_old/storage/innobase/buf/buf0flu.cc:1002
      #6  0x000055682110d185 in buf_flush_check_neighbors (space=..., id=..., contiguous=contiguous@entry=true, lru=lru@entry=false) at /data/Server/10.5_old/storage/innobase/buf/buf0flu.cc:1052
      #7  0x0000556821119321 in buf_flush_try_neighbors (space=space@entry=0x615000004180, page_id=..., contiguous=contiguous@entry=true, lru=lru@entry=false, n_flushed=n_flushed@entry=95, n_to_flush=n_to_flush@entry=200)
          at /data/Server/10.5_old/storage/innobase/buf/buf0flu.cc:1142
      #8  0x000055682111dac3 in buf_do_flush_list_batch (max_n=max_n@entry=200, lsn=lsn@entry=18446744073709551615) at /data/Server/10.5_old/storage/innobase/buf/buf0flu.cc:1487
      #9  0x000055682112081f in buf_flush_lists (max_n=max_n@entry=200, lsn=lsn@entry=18446744073709551615) at /data/Server/10.5_old/storage/innobase/buf/buf0flu.cc:1602
      #10 0x000055682112310b in buf_flush_page_cleaner () at /data/Server/10.5_old/storage/innobase/buf/buf0flu.cc:2239
      #11 0x00007fbcc04e6609 in start_thread (arg=<optimized out>) at pthread_create.c:477
      #12 0x00007fbcc00ba293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      (rr)
      sdp:/data/Results/1621969282/TBR-1091/dev/shm/vardir/1621969282/45/1/rr
      _RR_TRACE_DIR="." rr replay --mark-stdio mysqld-0 # Fate of server till SIGKILL
      _RR_TRACE_DIR="." rr replay --mark-stdio mysqld-1 # Fate of server after restart
       
      GIT_SHOW: HEAD, origin/bb-10.5-monty, origin/10.5 c80cecb5e3e509d37929b4f446edf9b6c636b98f 2021-05-23T19:53:38+03:00
       
      RQG
      ====
      git clone https://github.com/mleich1/rqg --branch experimental RQG
       
      rqg.pl \
      --gendata=MDEV-25776.zz \
      --max_gd_duration=600 \
      --grammar=MDEV-25776.yy \
      --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
      --mysqld=--loose-idle_write_transaction_timeout=0 \
      --mysqld=--loose-idle_transaction_timeout=0 \
      --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      --mysqld=--connect_timeout=60 \
      --mysqld=--interactive_timeout=28800 \
      --mysqld=--slave_net_timeout=60 \
      --mysqld=--net_read_timeout=30 \
      --mysqld=--net_write_timeout=60 \
      --mysqld=--loose-table_lock_wait_timeout=50 \
      --mysqld=--wait_timeout=28800 \
      --mysqld=--lock-wait-timeout=86400 \
      --mysqld=--innodb-lock-wait-timeout=50 \
      --no-mask \
      --queries=10000000 \
      --seed=random \
      --reporters=Backtrace \
      --reporters=ErrorLog \
      --reporters=CrashRecovery1 \
      --validators=None \
      --mysqld=--log_output=none \
      --mysqld=--log-bin \
      --mysqld=--log_bin_trust_function_creators=1 \
      --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      --engine=InnoDB \
      --restart_timeout=240 \
      --duration=300 \
      --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      --mysqld=--innodb_stats_persistent=on \
      --mysqld=--innodb_adaptive_hash_index=on \
      --threads=9 \
      --mysqld=--innodb-use-native-aio=0 \
      --rr=Extended \
      --rr_options="--wait" \
      --mysqld=--innodb_page_size=4K \
      --mysqld=--innodb-buffer-pool-size=5M \
      --no_mask \
      --workdir=<local settings> \
      --vardir=<local settings> \
      --mtr-build-thread=<local settings> \
      --basedir1=<local settings> \
      --script_debug=_nix_
      

      Attachments

        1. MDEV-25776.yy
          0.2 kB
        2. MDEV-25776.zz
          2 kB
        3. 25776_100.yy
          0.2 kB
        4. 25776_100.zz
          2 kB

        Issue Links

          Activity

            mleich Matthias Leich added a comment - - edited

            Alternative scenario showing the same assert.
            1. Start the server and generate some initial data   (row_format REDUNDANT )
            2. Two sessions run concurrent UPDATEs (autocommit) on the same table.
            3. At some point of time  happens
            mysqld: /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1513: buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint): Assertion `bpage->in_page_hash' failed.
            (rr) bt
            #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
            #1  0x00003f8c3a539859 in __GI_abort () at abort.c:79
            #2  0x00003f8c3a539729 in __assert_fail_base (fmt=0x3f8c3a6cf588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55953e054920 "bpage->in_page_hash", file=0x55953e054680 "/data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h", line=1513, 
                function=<optimized out>) at assert.c:92
            #3  0x00003f8c3a54af36 in __GI___assert_fail (assertion=0x55953e054920 "bpage->in_page_hash", file=0x55953e054680 "/data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h", line=1513, 
                function=0x55953e054780 "buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint)") at assert.c:101
            #4  0x000055953cb49302 in buf_pool_t::page_hash_get_low (this=0x55953f1aea80 <buf_pool>, id=..., fold=5242928) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1512
            #5  0x000055953cfd019a in buf_flush_check_neighbor (id=..., fold=5242928, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:967
            #6  0x000055953cfd08e1 in buf_flush_check_neighbors (space=..., id=..., contiguous=true, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1017
            #7  0x000055953cfd1752 in buf_flush_try_neighbors (space=0x612000071740, page_id=..., contiguous=true, lru=false, n_flushed=13, n_to_flush=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1107
            #8  0x000055953cfd42bb in buf_do_flush_list_batch (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1444
            #9  0x000055953cfd4e4b in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1560
            #10 0x000055953cf99084 in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:2973
            #11 0x000055953cf99bcb in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:3106
            #12 0x000055953cf313b7 in btr_cur_search_to_nth_level_func (index=0x6160007a2908, level=0, tuple=0x61900109f608, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x244942d882d0, ahi_latch=0x0, mtr=0x244942d88690, autoinc=0)
                at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/btr/btr0cur.cc:1608
            #13 0x000055953cd1db0b in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x6160007a2908, offsets_heap=0x619000975480, heap=0x619000972c80, entry=0x61900109f608, trx_id=0, thr=0x6210003f9228)
                at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0ins.cc:2992
            #14 0x000055953cd1f3e1 in row_ins_sec_index_entry (index=0x6160007a2908, entry=0x61900109f608, thr=0x6210003f9228, check_foreign=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0ins.cc:3315
            #15 0x000055953ce07105 in row_upd_sec_index_entry (node=0x6210003f8e28, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:2224
            #16 0x000055953ce07730 in row_upd_sec_step (node=0x6210003f8e28, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:2251
            #17 0x000055953ce0cc4a in row_upd (node=0x6210003f8e28, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:2982
            #18 0x000055953ce0d40c in row_upd_step (thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:3097
            #19 0x000055953cd64c57 in row_update_for_mysql (prebuilt=0x6210003f8588) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0mysql.cc:1769
            #20 0x000055953c9f3992 in ha_innobase::update_row (this=0x61d0007d14b8, old_row=0x61a0001a6d20 "\300\blssnehno", new_row=0x61a0001a6ab8 "\300\blssnehno") at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/handler/ha_innodb.cc:8440
            #21 0x000055953c0fdd6d in handler::ha_update_row (this=0x61d0007d14b8, old_data=0x61a0001a6d20 "\300\blssnehno", new_data=0x61a0001a6ab8 "\300\blssnehno") at /data/Server/bb-10.6-MDEV-25506_K/sql/handler.cc:7293
            #22 0x000055953bc16b95 in mysql_update (thd=0x62b0000d9218, table_list=0x62b0000a83c0, fields=..., values=..., conds=0x62b0000a8e90, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x244942d8b270, updated_return=0x244942d8b290)
                at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_update.cc:1081
            #23 0x000055953b93b44b in mysql_execute_command (thd=0x62b0000d9218) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:4399
            #24 0x000055953b953933 in mysql_parse (thd=0x62b0000d9218, rawbuf=0x62b0000a8238 "UPDATE `tab1` SET e = 'air' WHERE pk = 63 /* E_R Thread1 QNO 132 CON_ID 14 */", length=77, parser_state=0x244942d8bb20) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:8016
            #25 0x000055953b92bc76 in dispatch_command (command=COM_QUERY, thd=0x62b0000d9218, packet=0x629000bea219 "UPDATE `tab1` SET e = 'air' WHERE pk = 63 /* E_R Thread1 QNO 132 CON_ID 14 */ ", packet_length=78, blocking=true)
                at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:1897
            #26 0x000055953b929057 in do_command (thd=0x62b0000d9218, blocking=true) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:1406
            #27 0x000055953bd24ed2 in do_handle_one_connection (connect=0x608000003038, put_in_cache=true) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_connect.cc:1410
            #28 0x000055953bd24834 in handle_one_connection (arg=0x608000002e38) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_connect.cc:1312
            #29 0x000030365276b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
            #30 0x00003f8c3a636293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
            (rr)
            pluto:/data/Results/1623073203/TBR-977/dev/shm/vardir/1623073203/83/1/rr
            _RR_TRACE_DIR="." rr replay --mark-stdio
             
            perl rqg.pl \
            --gendata=MDEV-25776_100.zz \
            --max_gd_duration=300 \
            --grammar=MDEV-25776_100.yy \
            --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
            --mysqld=--loose-idle_write_transaction_timeout=0 \
            --mysqld=--loose-idle_transaction_timeout=0 \
            --mysqld=--loose-idle_readonly_transaction_timeout=0 \
            --mysqld=--connect_timeout=60 \
            --mysqld=--interactive_timeout=28800 \
            --mysqld=--slave_net_timeout=60 \
            --mysqld=--net_read_timeout=30 \
            --mysqld=--net_write_timeout=60 \
            --mysqld=--loose-table_lock_wait_timeout=50 \
            --mysqld=--wait_timeout=28800 \
            --mysqld=--lock-wait-timeout=86400 \
            --mysqld=--innodb-lock-wait-timeout=50 \
            --no-mask \
            --queries=10000000 \
            --seed=random \
            --reporters=Backtrace \
            --reporters=ErrorLog \
            --reporters=CrashRecovery1 \
            --validators=None \
            --mysqld=--log_output=none \
            --mysqld=--log-bin \
            --mysqld=--log_bin_trust_function_creators=1 \
            --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
            --engine=InnoDB \
            --restart_timeout=240 \
            --duration=300 \
            --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
            --mysqld=--innodb_stats_persistent=on \
            --mysqld=--innodb_adaptive_hash_index=on \
            --threads=2 \
            --mysqld=--innodb_change_buffering_debug=1 \
            --mysqld=--innodb_change_buffering=all \
            --mysqld=--innodb-use-native-aio=0 \
            --rr=Extended \
            --rr_options="--chaos --wait" \
            --mysqld=--innodb_page_size=4K \
            --mysqld=--innodb-buffer-pool-size=5M \
            --no_mask \
            --workdir=<local settings> \
            --vardir=<local settings> \
            --mtr-build-thread=<local settings> \
            --basedir1=<local settings> \
            --script_debug=_nix_
            
            

            mleich Matthias Leich added a comment - - edited Alternative scenario showing the same assert. 1. Start the server and generate some initial data (row_format REDUNDANT ) 2. Two sessions run concurrent UPDATEs (autocommit) on the same table. 3. At some point of time happens mysqld: /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1513: buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint): Assertion `bpage->in_page_hash' failed. (rr) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00003f8c3a539859 in __GI_abort () at abort.c:79 #2 0x00003f8c3a539729 in __assert_fail_base (fmt=0x3f8c3a6cf588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55953e054920 "bpage->in_page_hash", file=0x55953e054680 "/data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h", line=1513, function=<optimized out>) at assert.c:92 #3 0x00003f8c3a54af36 in __GI___assert_fail (assertion=0x55953e054920 "bpage->in_page_hash", file=0x55953e054680 "/data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h", line=1513, function=0x55953e054780 "buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint)") at assert.c:101 #4 0x000055953cb49302 in buf_pool_t::page_hash_get_low (this=0x55953f1aea80 <buf_pool>, id=..., fold=5242928) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1512 #5 0x000055953cfd019a in buf_flush_check_neighbor (id=..., fold=5242928, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:967 #6 0x000055953cfd08e1 in buf_flush_check_neighbors (space=..., id=..., contiguous=true, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1017 #7 0x000055953cfd1752 in buf_flush_try_neighbors (space=0x612000071740, page_id=..., contiguous=true, lru=false, n_flushed=13, n_to_flush=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1107 #8 0x000055953cfd42bb in buf_do_flush_list_batch (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1444 #9 0x000055953cfd4e4b in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1560 #10 0x000055953cf99084 in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:2973 #11 0x000055953cf99bcb in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:3106 #12 0x000055953cf313b7 in btr_cur_search_to_nth_level_func (index=0x6160007a2908, level=0, tuple=0x61900109f608, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x244942d882d0, ahi_latch=0x0, mtr=0x244942d88690, autoinc=0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/btr/btr0cur.cc:1608 #13 0x000055953cd1db0b in row_ins_sec_index_entry_low (flags=0, mode=2, index=0x6160007a2908, offsets_heap=0x619000975480, heap=0x619000972c80, entry=0x61900109f608, trx_id=0, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0ins.cc:2992 #14 0x000055953cd1f3e1 in row_ins_sec_index_entry (index=0x6160007a2908, entry=0x61900109f608, thr=0x6210003f9228, check_foreign=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0ins.cc:3315 #15 0x000055953ce07105 in row_upd_sec_index_entry (node=0x6210003f8e28, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:2224 #16 0x000055953ce07730 in row_upd_sec_step (node=0x6210003f8e28, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:2251 #17 0x000055953ce0cc4a in row_upd (node=0x6210003f8e28, thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:2982 #18 0x000055953ce0d40c in row_upd_step (thr=0x6210003f9228) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0upd.cc:3097 #19 0x000055953cd64c57 in row_update_for_mysql (prebuilt=0x6210003f8588) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0mysql.cc:1769 #20 0x000055953c9f3992 in ha_innobase::update_row (this=0x61d0007d14b8, old_row=0x61a0001a6d20 "\300\blssnehno", new_row=0x61a0001a6ab8 "\300\blssnehno") at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/handler/ha_innodb.cc:8440 #21 0x000055953c0fdd6d in handler::ha_update_row (this=0x61d0007d14b8, old_data=0x61a0001a6d20 "\300\blssnehno", new_data=0x61a0001a6ab8 "\300\blssnehno") at /data/Server/bb-10.6-MDEV-25506_K/sql/handler.cc:7293 #22 0x000055953bc16b95 in mysql_update (thd=0x62b0000d9218, table_list=0x62b0000a83c0, fields=..., values=..., conds=0x62b0000a8e90, order_num=0, order=0x0, limit=18446744073709551615, ignore=false, found_return=0x244942d8b270, updated_return=0x244942d8b290) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_update.cc:1081 #23 0x000055953b93b44b in mysql_execute_command (thd=0x62b0000d9218) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:4399 #24 0x000055953b953933 in mysql_parse (thd=0x62b0000d9218, rawbuf=0x62b0000a8238 "UPDATE `tab1` SET e = 'air' WHERE pk = 63 /* E_R Thread1 QNO 132 CON_ID 14 */", length=77, parser_state=0x244942d8bb20) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:8016 #25 0x000055953b92bc76 in dispatch_command (command=COM_QUERY, thd=0x62b0000d9218, packet=0x629000bea219 "UPDATE `tab1` SET e = 'air' WHERE pk = 63 /* E_R Thread1 QNO 132 CON_ID 14 */ ", packet_length=78, blocking=true) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:1897 #26 0x000055953b929057 in do_command (thd=0x62b0000d9218, blocking=true) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_parse.cc:1406 #27 0x000055953bd24ed2 in do_handle_one_connection (connect=0x608000003038, put_in_cache=true) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_connect.cc:1410 #28 0x000055953bd24834 in handle_one_connection (arg=0x608000002e38) at /data/Server/bb-10.6-MDEV-25506_K/sql/sql_connect.cc:1312 #29 0x000030365276b609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #30 0x00003f8c3a636293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (rr) pluto:/data/Results/1623073203/TBR-977/dev/shm/vardir/1623073203/83/1/rr _RR_TRACE_DIR="." rr replay --mark-stdio   perl rqg.pl \ --gendata=MDEV-25776_100.zz \ --max_gd_duration=300 \ --grammar=MDEV-25776_100.yy \ --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ --mysqld=--loose-idle_write_transaction_timeout=0 \ --mysqld=--loose-idle_transaction_timeout=0 \ --mysqld=--loose-idle_readonly_transaction_timeout=0 \ --mysqld=--connect_timeout=60 \ --mysqld=--interactive_timeout=28800 \ --mysqld=--slave_net_timeout=60 \ --mysqld=--net_read_timeout=30 \ --mysqld=--net_write_timeout=60 \ --mysqld=--loose-table_lock_wait_timeout=50 \ --mysqld=--wait_timeout=28800 \ --mysqld=--lock-wait-timeout=86400 \ --mysqld=--innodb-lock-wait-timeout=50 \ --no-mask \ --queries=10000000 \ --seed=random \ --reporters=Backtrace \ --reporters=ErrorLog \ --reporters=CrashRecovery1 \ --validators=None \ --mysqld=--log_output=none \ --mysqld=--log-bin \ --mysqld=--log_bin_trust_function_creators=1 \ --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ --engine=InnoDB \ --restart_timeout=240 \ --duration=300 \ --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ --mysqld=--innodb_stats_persistent=on \ --mysqld=--innodb_adaptive_hash_index=on \ --threads=2 \ --mysqld=--innodb_change_buffering_debug=1 \ --mysqld=--innodb_change_buffering=all \ --mysqld=--innodb-use-native-aio=0 \ --rr=Extended \ --rr_options="--chaos --wait" \ --mysqld=--innodb_page_size=4K \ --mysqld=--innodb-buffer-pool-size=5M \ --no_mask \ --workdir=<local settings> \ --vardir=<local settings> \ --mtr-build-thread=<local settings> \ --basedir1=<local settings> \ --script_debug=_nix_

            The trace on sdp, is no longer available, but I checked the other trace:

            ssh pluto
            rr replay /data/Results/1623073203/TBR-977/dev/shm/vardir/1623073203/83/1/rr/latest-trace
            

            This is not from a main branch, but from an attempt to fix a problem in MDEV-25506 which in the end was fixed by another approach. Because the original failure was reported for the 10.5 branch, it does make sense to continue the analysis.

            bb-10.6-MDEV-25506_FTS 8850f7d854465bee24639ceb2148baf7b0ec1b39

            Thread 16 hit Hardware watchpoint 1: -location bpage.in_page_hash
             
            Old value = true
            New value = false
            buf_pool_t::watch_unset (this=0x55953f1aea80 <buf_pool>, id=...) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1633
            1633	      HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch);
            (rr) bt
            #0  buf_pool_t::watch_unset (this=0x55953f1aea80 <buf_pool>, id=...) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1633
            #1  0x000055953cf31ddc in btr_cur_search_to_nth_level_func (index=0x616000796c08, level=0, tuple=0x619000f83f08, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x392a222f5080, ahi_latch=0x0, mtr=0x392a222f5570, 
                autoinc=0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/btr/btr0cur.cc:1689
            #2  0x000055953cdb3f85 in btr_pcur_open_low (index=0x616000796c08, level=0, tuple=0x619000f83f08, mode=PAGE_CUR_LE, latch_mode=8194, cursor=0x392a222f5080, autoinc=0, mtr=0x392a222f5570)
                at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/btr0pcur.ic:439
            #3  0x000055953cdbaa8e in row_search_index_entry (index=0x616000796c08, entry=0x619000f83f08, mode=8194, pcur=0x392a222f5080, mtr=0x392a222f5570)
                at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0row.cc:1303
            #4  0x000055953cda367e in row_purge_remove_sec_if_poss_leaf (node=0x61a00000a308, index=0x616000796c08, entry=0x619000f83f08) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0purge.cc:546
            …
            (rr) thr a 2 bt
            #8  0x00003f8c3a62c3f1 in fdatasync (fd=7) at ../sysdeps/unix/sysv/linux/fdatasync.c:28
            …
            #14 0x000055953cb988b3 in log_write_flush_to_disk_low (lsn=5186117) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/log/log0log.cc:627
            #15 0x000055953cb99a3b in log_write_up_to (lsn=5186117, flush_to_disk=true, rotate_key=false, callback=0x0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/log/log0log.cc:830
            #16 0x000055953cb99b8b in log_buffer_flush_to_disk (sync=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/log/log0log.cc:842
            #17 0x000055953cfd4d40 in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1540
            #18 0x000055953cf99084 in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:2973
            #19 0x000055953cf99bcb in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:3106
            #20 0x000055953cf313b7 in btr_cur_search_to_nth_level_func (index=0x6160007a2908, level=0, tuple=0x61900109f608, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x244942d882d0, ahi_latch=0x0, mtr=0x244942d88690, autoinc=0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/btr/btr0cur.cc:1608
            

            The assertion would fail in Thread 2 a little later:

            (rr) c
            Continuing.
            mysqld: /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1513: buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint): Assertion `bpage->in_page_hash' failed.
             
            Thread 2 received signal SIGABRT, Aborted.
            [Switching to Thread 1738107.1740840]
            __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
            50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
            (rr) bt
            …
            #4  0x000055953cb49302 in buf_pool_t::page_hash_get_low (this=0x55953f1aea80 <buf_pool>, id=..., fold=5242928) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1512
            #5  0x000055953cfd019a in buf_flush_check_neighbor (id=..., fold=5242928, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:967
            #6  0x000055953cfd08e1 in buf_flush_check_neighbors (space=..., id=..., contiguous=true, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1017
            #7  0x000055953cfd1752 in buf_flush_try_neighbors (space=0x612000071740, page_id=..., contiguous=true, lru=false, n_flushed=13, n_to_flush=18446744073709551615)
                at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1107
            #8  0x000055953cfd42bb in buf_do_flush_list_batch (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1444
            #9  0x000055953cfd4e4b in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1560
            #10 0x000055953cf99084 in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true)
                at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:2973
            

            In MDEV-25801 this case of the function call buf_flush_lists() was renamed to buf_flush_list(). This code is specific to innodb_change_buffering_debug=1:

            bb-10.6-MDEV-25506_FTS 8850f7d854465bee24639ceb2148baf7b0ec1b39

            (rr) list
            2968				return(NULL);
            2969			}
            2970	
            2971			fix_block->fix();
            2972			mysql_mutex_unlock(&buf_pool.mutex);
            2973			buf_flush_lists(ULINT_UNDEFINED, LSN_MAX);
            2974			buf_flush_wait_batch_end_acquiring_mutex(false);
            2975	
            2976			if (fix_block->page.buf_fix_count() == 1
            2977			    && !fix_block->page.oldest_modification()) {
            (rr) p ibuf_debug 
            $2 = 1
            

            Even though this code is only present in debug builds, we must analyze the root cause and check if something similar might be possible in release builds. At the time of the crash, yet another thread (Thread 15, the original one was Thread 16) is executing in the same function:

            #14 0x000055953d1e3821 in tpool::waitable_task::wait (this=0x55953fe09e60 <log_flush_task>) at /data/Server/bb-10.6-MDEV-25506_K/tpool/task.cc:85
            #15 0x000055953cfd4bfc in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1533
            

            The log_flush_task, which is reducing concurrency here, was removed in MDEV-25948.

            Only one buf_flush_list() batch (using the current function name) is supposed to run at a time, and the reference counting around buf_pool.n_flush_list appears to be mostly working. buf_pool_t::watch_unset() in Thread 16 had already cleared the debug field in_page_hash before Thread 2 started to execute the next batch. But, while Thread 2 is already running, Thread 16 is still in the process of executing HASH_DELETE:

            1632	      ut_d(watch->in_page_hash= false);
            1633	      HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch);
            1634	      hash_lock->write_unlock();
            

            The pointer was still reachable via

            1512	    HASH_SEARCH(hash, &page_hash, fold, buf_page_t*, bpage,
            1513	                ut_ad(bpage->in_page_hash), id == bpage->id());
            

            because we are still calculating the hash value:

            Thread 2 received signal SIGABRT, Aborted.
            [Switching to Thread 1738107.1740840]
            __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
            50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
            1: x/i $pc
            => 0x3f8c3a55a18b <__GI_raise+203>:	mov    0x108(%rsp),%rax
            (rr) thr a 16 bt
             
            Thread 16 (Thread 1738107.1739723):
            #0  0x000055953cb48057 in ut_hash_ulint (key=1657039487, table_size=2557) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/ut0rnd.ic:47
            #1  0x000055953cb49be3 in buf_pool_t::page_hash_table::hash (fold=5242928, n=2557) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1755
            #2  0x000055953cb49c0c in buf_pool_t::page_hash_table::calc_hash (fold=5242928, n_cells=2557) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1760
            #3  0x000055953cb49b7a in buf_pool_t::page_hash_table::calc_hash (this=0x55953f1aec08 <buf_pool+392>, fold=5242928) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1750
            #4  0x000055953cf5b6ca in buf_pool_t::watch_unset (this=0x55953f1aea80 <buf_pool>, id=...) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1633
            

            The latch in the cache line is still held in exclusive mode by Thread 16:

            (rr) p &buf_pool.page_hash.array[1380]
            $30 = (hash_cell_t *) 0x62a000008d20
            (rr) p *(hash_cell_t*)0x62a000008d00@8
            $31 = {{node = 0x80000000}, {node = 0x367e30dcd8d0}, {node = 0x0}, {node = 0x367e30dcc968}, {node = 0x55953f1b4010 <buf_pool+21904>}, {node = 0x0}, {node = 0x0}, {node = 0x0}}
            

            Thread 2 in buf_pool_t::page_hash_get_low() is holding buf_pool.mutex, so it is allowed to traverse buf_pool.page_hash without acquiring any of the latches inside it. The bug is in buf_pool_t::watch_unset(), which needs to acquire the mutex before any of the following steps:

            1632	      ut_d(watch->in_page_hash= false);
            1633	      HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch);
            1634	      hash_lock->write_unlock();
            1635	      // Now that the watch is detached from page_hash, release it to watch[].
            1636	      mysql_mutex_lock(&mutex);
            

            Any other code than the change buffer watch is holding both buf_pool.mutex and the buf_pool.page_hash latch while modifying buf_pool.page_hash.

            While checking whether other code than the change buffer could be affected by this kind of races, I encountered the following work-around in buf_page_init_for_read():

              /* We must acquire hash_lock this early to prevent
              a race condition with buf_pool_t::watch_remove() */
              page_hash_latch *hash_lock= buf_pool.page_hash.lock_get(fold);
              hash_lock->write_lock();
            

            A simple fix would be to extend the critical section of buf_pool.mutex in buf_pool_t::watch_unset() and to adjust the above work-around in buf_page_init_for_read(). Related to that, we should also add buf_pool.mutex protection to buf_pool_t::watch_set():

                ut_ad(!w->in_page_hash);
                ut_d(w->in_page_hash= true); /* Not holding buf_pool.mutex here! */
                HASH_INSERT(buf_page_t, hash, &page_hash, fold, w);
            

            I do not think that it is safe to relax the debug assertion itself. I do not think that extending the use of buf_pool.mutex in buf_pool_t::watch_set() should cause a significant performance regression, because that code should only be invoked when a page does not exist in the buffer pool and we will attempt to use the change buffer.

            This race condition can be avoided by setting innodb_change_buffering=none.

            marko Marko Mäkelä added a comment - The trace on sdp , is no longer available, but I checked the other trace: ssh pluto rr replay /data/Results/1623073203/TBR-977/dev/shm/vardir/1623073203/83/1/rr/latest-trace This is not from a main branch, but from an attempt to fix a problem in MDEV-25506 which in the end was fixed by another approach. Because the original failure was reported for the 10.5 branch, it does make sense to continue the analysis. bb-10.6-MDEV-25506_FTS 8850f7d854465bee24639ceb2148baf7b0ec1b39 Thread 16 hit Hardware watchpoint 1: -location bpage.in_page_hash   Old value = true New value = false buf_pool_t::watch_unset (this=0x55953f1aea80 <buf_pool>, id=...) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1633 1633 HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch); (rr) bt #0 buf_pool_t::watch_unset (this=0x55953f1aea80 <buf_pool>, id=...) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1633 #1 0x000055953cf31ddc in btr_cur_search_to_nth_level_func (index=0x616000796c08, level=0, tuple=0x619000f83f08, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x392a222f5080, ahi_latch=0x0, mtr=0x392a222f5570, autoinc=0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/btr/btr0cur.cc:1689 #2 0x000055953cdb3f85 in btr_pcur_open_low (index=0x616000796c08, level=0, tuple=0x619000f83f08, mode=PAGE_CUR_LE, latch_mode=8194, cursor=0x392a222f5080, autoinc=0, mtr=0x392a222f5570) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/btr0pcur.ic:439 #3 0x000055953cdbaa8e in row_search_index_entry (index=0x616000796c08, entry=0x619000f83f08, mode=8194, pcur=0x392a222f5080, mtr=0x392a222f5570) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0row.cc:1303 #4 0x000055953cda367e in row_purge_remove_sec_if_poss_leaf (node=0x61a00000a308, index=0x616000796c08, entry=0x619000f83f08) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/row/row0purge.cc:546 … (rr) thr a 2 bt #8 0x00003f8c3a62c3f1 in fdatasync (fd=7) at ../sysdeps/unix/sysv/linux/fdatasync.c:28 … #14 0x000055953cb988b3 in log_write_flush_to_disk_low (lsn=5186117) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/log/log0log.cc:627 #15 0x000055953cb99a3b in log_write_up_to (lsn=5186117, flush_to_disk=true, rotate_key=false, callback=0x0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/log/log0log.cc:830 #16 0x000055953cb99b8b in log_buffer_flush_to_disk (sync=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/log/log0log.cc:842 #17 0x000055953cfd4d40 in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1540 #18 0x000055953cf99084 in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:2973 #19 0x000055953cf99bcb in buf_page_get_gen (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:3106 #20 0x000055953cf313b7 in btr_cur_search_to_nth_level_func (index=0x6160007a2908, level=0, tuple=0x61900109f608, mode=PAGE_CUR_LE, latch_mode=2, cursor=0x244942d882d0, ahi_latch=0x0, mtr=0x244942d88690, autoinc=0) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/btr/btr0cur.cc:1608 The assertion would fail in Thread 2 a little later: (rr) c Continuing. mysqld: /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1513: buf_page_t* buf_pool_t::page_hash_get_low(page_id_t, ulint): Assertion `bpage->in_page_hash' failed.   Thread 2 received signal SIGABRT, Aborted. [Switching to Thread 1738107.1740840] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (rr) bt … #4 0x000055953cb49302 in buf_pool_t::page_hash_get_low (this=0x55953f1aea80 <buf_pool>, id=..., fold=5242928) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1512 #5 0x000055953cfd019a in buf_flush_check_neighbor (id=..., fold=5242928, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:967 #6 0x000055953cfd08e1 in buf_flush_check_neighbors (space=..., id=..., contiguous=true, lru=false) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1017 #7 0x000055953cfd1752 in buf_flush_try_neighbors (space=0x612000071740, page_id=..., contiguous=true, lru=false, n_flushed=13, n_to_flush=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1107 #8 0x000055953cfd42bb in buf_do_flush_list_batch (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1444 #9 0x000055953cfd4e4b in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1560 #10 0x000055953cf99084 in buf_page_get_low (page_id=..., zip_size=0, rw_latch=2, guess=0x0, mode=11, mtr=0x244942d88690, err=0x244942d87290, allow_ibuf_merge=true) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0buf.cc:2973 In MDEV-25801 this case of the function call buf_flush_lists() was renamed to buf_flush_list() . This code is specific to innodb_change_buffering_debug=1 : bb-10.6-MDEV-25506_FTS 8850f7d854465bee24639ceb2148baf7b0ec1b39 (rr) list 2968 return(NULL); 2969 } 2970 2971 fix_block->fix(); 2972 mysql_mutex_unlock(&buf_pool.mutex); 2973 buf_flush_lists(ULINT_UNDEFINED, LSN_MAX); 2974 buf_flush_wait_batch_end_acquiring_mutex(false); 2975 2976 if (fix_block->page.buf_fix_count() == 1 2977 && !fix_block->page.oldest_modification()) { (rr) p ibuf_debug $2 = 1 Even though this code is only present in debug builds, we must analyze the root cause and check if something similar might be possible in release builds. At the time of the crash, yet another thread (Thread 15, the original one was Thread 16) is executing in the same function: #14 0x000055953d1e3821 in tpool::waitable_task::wait (this=0x55953fe09e60 <log_flush_task>) at /data/Server/bb-10.6-MDEV-25506_K/tpool/task.cc:85 #15 0x000055953cfd4bfc in buf_flush_lists (max_n=18446744073709551615, lsn=18446744073709551615) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/buf/buf0flu.cc:1533 The log_flush_task , which is reducing concurrency here, was removed in MDEV-25948 . Only one buf_flush_list() batch (using the current function name) is supposed to run at a time, and the reference counting around buf_pool.n_flush_list appears to be mostly working. buf_pool_t::watch_unset() in Thread 16 had already cleared the debug field in_page_hash before Thread 2 started to execute the next batch. But, while Thread 2 is already running, Thread 16 is still in the process of executing HASH_DELETE : 1632 ut_d(watch->in_page_hash= false ); 1633 HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch); 1634 hash_lock->write_unlock(); The pointer was still reachable via 1512 HASH_SEARCH(hash, &page_hash, fold, buf_page_t*, bpage, 1513 ut_ad(bpage->in_page_hash), id == bpage->id()); because we are still calculating the hash value: Thread 2 received signal SIGABRT, Aborted. [Switching to Thread 1738107.1740840] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. 1: x/i $pc => 0x3f8c3a55a18b <__GI_raise+203>: mov 0x108(%rsp),%rax (rr) thr a 16 bt   Thread 16 (Thread 1738107.1739723): #0 0x000055953cb48057 in ut_hash_ulint (key=1657039487, table_size=2557) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/ut0rnd.ic:47 #1 0x000055953cb49be3 in buf_pool_t::page_hash_table::hash (fold=5242928, n=2557) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1755 #2 0x000055953cb49c0c in buf_pool_t::page_hash_table::calc_hash (fold=5242928, n_cells=2557) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1760 #3 0x000055953cb49b7a in buf_pool_t::page_hash_table::calc_hash (this=0x55953f1aec08 <buf_pool+392>, fold=5242928) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1750 #4 0x000055953cf5b6ca in buf_pool_t::watch_unset (this=0x55953f1aea80 <buf_pool>, id=...) at /data/Server/bb-10.6-MDEV-25506_K/storage/innobase/include/buf0buf.h:1633 The latch in the cache line is still held in exclusive mode by Thread 16: (rr) p &buf_pool.page_hash.array[1380] $30 = (hash_cell_t *) 0x62a000008d20 (rr) p *(hash_cell_t*)0x62a000008d00@8 $31 = {{node = 0x80000000}, {node = 0x367e30dcd8d0}, {node = 0x0}, {node = 0x367e30dcc968}, {node = 0x55953f1b4010 <buf_pool+21904>}, {node = 0x0}, {node = 0x0}, {node = 0x0}} Thread 2 in buf_pool_t::page_hash_get_low() is holding buf_pool.mutex , so it is allowed to traverse buf_pool.page_hash without acquiring any of the latches inside it. The bug is in buf_pool_t::watch_unset() , which needs to acquire the mutex before any of the following steps: 1632 ut_d(watch->in_page_hash= false ); 1633 HASH_DELETE(buf_page_t, hash, &page_hash, fold, watch); 1634 hash_lock->write_unlock(); 1635 // Now that the watch is detached from page_hash, release it to watch[]. 1636 mysql_mutex_lock(&mutex); Any other code than the change buffer watch is holding both buf_pool.mutex and the buf_pool.page_hash latch while modifying buf_pool.page_hash . While checking whether other code than the change buffer could be affected by this kind of races, I encountered the following work-around in buf_page_init_for_read() : /* We must acquire hash_lock this early to prevent a race condition with buf_pool_t::watch_remove() */ page_hash_latch *hash_lock= buf_pool.page_hash.lock_get(fold); hash_lock->write_lock(); A simple fix would be to extend the critical section of buf_pool.mutex in buf_pool_t::watch_unset() and to adjust the above work-around in buf_page_init_for_read() . Related to that, we should also add buf_pool.mutex protection to buf_pool_t::watch_set() : ut_ad(!w->in_page_hash); ut_d(w->in_page_hash= true ); /* Not holding buf_pool.mutex here! */ HASH_INSERT(buf_page_t, hash, &page_hash, fold, w); I do not think that it is safe to relax the debug assertion itself. I do not think that extending the use of buf_pool.mutex in buf_pool_t::watch_set() should cause a significant performance regression, because that code should only be invoked when a page does not exist in the buffer pool and we will attempt to use the change buffer. This race condition can be avoided by setting innodb_change_buffering=none .

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.