[MDEV-22790] Race between btr_page_mtr_lock() dropping AHI on the same block Created: 2020-06-03  Updated: 2022-08-25  Resolved: 2020-06-05

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5.4, 10.2, 10.3, 10.4
Fix Version/s: 10.5.4, 10.2.33, 10.3.24, 10.4.14

Type: Bug Priority: Critical
Reporter: Matthias Leich Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: ASAN, corruption, race, rr-profile

Attachments: File 000650.log    
Issue Links:
Problem/Incident
causes MDEV-29384 Hangs caused by innodb_adaptive_hash_... Closed
is caused by MDEV-22456 Dropping the adaptive hash index may ... Closed

 Description   

ASAN failure hit during RQG test on development source tree
origin/bb-10.5-MDEV-15053-3 be3c9a23fd014ed80d48e2984dec78db921a7a00 2020-06-02T20:37:37+03:00
 
==100160==ERROR: AddressSanitizer: heap-use-after-free on address 0x618000140938 at pc 0x560d2e722c69 bp 0x664073915570 sp 0x664073915560
READ of size 8 at 0x618000140938 thread T19
    #0 0x560d2e722c68 in rec_is_metadata(unsigned char const*, dict_index_t const&) storage/innobase/include/rem0rec.h:734
    #1 0x560d2ec4b24a in btr_search_drop_page_hash_index(buf_block_t*) storage/innobase/btr/btr0sea.cc:1263
    #2 0x560d2ec74c04 in buf_page_mtr_lock storage/innobase/buf/buf0buf.cc:2946
    #3 0x560d2ec77240 in buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool) storage/innobase/buf/buf0buf.cc:3521
    #4 0x560d2ec778b1 in buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool) storage/innobase/buf/buf0buf.cc:3594
    #5 0x560d2e72464a in btr_block_get_func(dict_index_t const&, unsigned long, unsigned long, bool, char const*, unsigned int, mtr_t*) storage/innobase/include/btr0btr.h:240
    #6 0x560d2ebfe77b in btr_cur_latch_leaves(buf_block_t*, unsigned long, btr_cur_t*, mtr_t*) storage/innobase/btr/btr0cur.cc:246
    #7 0x560d2ec0f844 in btr_cur_open_at_rnd_pos_func(dict_index_t*, unsigned long, btr_cur_t*, char const*, unsigned int, mtr_t*) storage/innobase/btr/btr0cur.cc:3028
    #8 0x560d2ec273a8 in btr_estimate_number_of_different_key_vals(dict_index_t*) storage/innobase/btr/btr0cur.cc:6683
    #9 0x560d2ed67790 in dict_stats_update_transient_for_index storage/innobase/dict/dict0stats.cc:887
    #10 0x560d2ed67c08 in dict_stats_update_transient storage/innobase/dict/dict0stats.cc:945
    #11 0x560d2ed73228 in dict_stats_update(dict_table_t*, dict_stats_upd_option_t) storage/innobase/dict/dict0stats.cc:3331
    #12 0x560d2e6356da in dict_stats_init storage/innobase/include/dict0stats.ic:165
    #13 0x560d2e64a3f0 in ha_innobase::open(char const*, int, unsigned int) storage/innobase/handler/ha_innodb.cc:5831
    #14 0x560d2db6e0ed in handler::ha_open(TABLE*, char const*, int, unsigned int, st_mem_root*, List<String>*) sql/handler.cc:2906
    #15 0x560d2d6d3bbb in open_table_from_share(THD*, TABLE_SHARE*, st_mysql_const_lex_string const*, unsigned int, unsigned int, unsigned int, TABLE*, bool, List<String>*) sql/table.cc:4190
    #16 0x560d2d1fdf47 in open_table(THD*, TABLE_LIST*, Open_table_context*) sql/sql_base.cc:1980
    #17 0x560d2d207b81 in open_and_process_table sql/sql_base.cc:3779
    #18 0x560d2d20a729 in open_tables(THD*, DDL_options_st const&, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) sql/sql_base.cc:4251
    #19 0x560d2d20fcc1 in open_and_lock_tables(THD*, DDL_options_st const&, TABLE_LIST*, bool, unsigned int, Prelocking_strategy*) sql/sql_base.cc:5155
    #20 0x560d2d16a1bd in open_and_lock_tables(THD*, TABLE_LIST*, bool, unsigned int) sql/sql_base.h:509
    #21 0x560d2d2d6d96 in mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*) sql/sql_insert.cc:756
    #22 0x560d2d39cb4e in mysql_execute_command(THD*) sql/sql_parse.cc:4553
    #23 0x560d2d3b4b26 in mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool) sql/sql_parse.cc:7992
    #24 0x560d2d38b67a in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool) sql/sql_parse.cc:1874
    #25 0x560d2d387ef8 in do_command(THD*) sql/sql_parse.cc:1355
    #26 0x560d2d7b1e51 in do_handle_one_connection(CONNECT*, bool) sql/sql_connect.cc:1411
    #27 0x560d2d7b17af in handle_one_connection sql/sql_connect.cc:1313
    #28 0x560d2e464eee in pfs_spawn_thread storage/perfschema/pfs.cc:2201
    #29 0x20d31b0696da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #30 0x560d2b67c88e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x12188e)
...
SUMMARY: AddressSanitizer: heap-use-after-free storage/innobase/include/rem0rec.h:734 in rec_is_metadata(unsigned char const*, dict_index_t const&)
...
Query (0x62b000093238): INSERT INTO t4 (col1,col2, col_int, col_string, col_text) VALUES /* 1 */ (1,1,1,REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), 10),REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), @fill_amount) ), (1,1,1,REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), 10),REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), @fill_amount) )
Connection ID (thread ID): 133
Status: NOT_KILLED
 
RQG
-------
git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich
origin/experimental 5c63068c24fa6d687422f4d26490b067ff6535e4 2020-05-28T13:50:30+02:00
 
perl rqg.pl \
--mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
--gendata=conf/mariadb/table_stress.zz \
--gendata_sql=conf/mariadb/table_stress.sql \
--engine=Innodb \
--reporters=Deadlock1,ErrorLog,Backtrace \
--mysqld=--connect_timeout=60 \
--mysqld=--net_read_timeout=30 \
--mysqld=--net_write_timeout=60 \
--mysqld=--loose-idle_readonly_transaction_timeout=0 \
--mysqld=--loose-idle_transaction_timeout=0 \
--mysqld=--loose-idle_write_transaction_timeout=0 \
--mysqld=--interactive_timeout=28800 \
--mysqld=--lock_wait_timeout=86400 \
--mysqld=--innodb-lock-wait-timeout=50 \
--mysqld=--loose-table_lock_wait_timeout=50 \
--mysqld=--wait_timeout=28800 \
--mysqld=--slave_net_timeout=60 \
--mysqld=--loose-max-statement-time=30 \
--mysqld=--loose-debug_assert_on_not_freed_memory=0 \
--mysqld=--log-output=none \
--mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
--duration=300 \
--seed=random \
--grammar=conf/mariadb/table_stress_innodb_nocopy.yy \
--threads=9 \
--mysqld=--innodb_stats_persistent=OFF \
--mysqld=--innodb_adaptive_hash_index=ON \
--mysqld=--innodb_page_size=8K \
--mysqld=--innodb-buffer-pool-size=256M \
--duration=300 \
--no_mask \
--workdir=<local settings> \
--vardir=<local settings> \
--mtr-build-thread=<local settings> \
--basedir1=<local settings> \
--basedir2=<local settings> \
--script_debug=_nix_ \
--rr=Server \
--rr_options=--chaos



 Comments   
Comment by Marko Mäkelä [ 2020-06-03 ]

This looks like a race condition in MDEV-22456. Almost concurrently with dropping the last adaptive hash index entry (for a different block), we are executing this function. At the time of the invalid access, block->index for this block is already NULL.

The problem is that we have two threads executing btr_search_drop_page_hash_index() for the same block in buf_page_mtr_lock() at the same time. While this occurred on a MDEV-15053 development branch, I believe that the same bug should be possible starting with 10.2, because we are not necessarily holding exclusive block->lock here in buf_page_mtr_lock():

#ifdef BTR_CUR_HASH_ADAPT
  {
    dict_index_t *index= block->index;
    if (index && index->freed())
      btr_search_drop_page_hash_index(block);
  }
#endif /* BTR_CUR_HASH_ADAPT */

Both threads are with buf_page_mtr_lock(block, RW_S_LATCH, …).

In order for the call to be safe, we should acquire an exclusive latch and re-check block->index after acquiring it.

Comment by Marko Mäkelä [ 2020-06-04 ]

mleich, please test the patch on bb-10.2-marko. With that patch, we are skipping some btr_search_drop_page_hash_index(block) calls that were added in MDEV-22456. It might turn out that those calls are necessary for correctness, although I think that it is very unlikely.

Comment by Matthias Leich [ 2020-06-05 ]

The development trees with the final fix
origin/bb-10.5-MDEV-15053-3 6bdefbed4253c391cdb136fb51f3629be6bf6b32 2020-06-04T15:55:57+03:00
origin/bb-10.2-marko ed6a3608590b5bc755d19a7036f49997f31e3c9f 2020-06-04T12:29:32+03:00
performed well in RQG testing.
The ASAN failure mentioned above was not replayed again and there were also no
new unknown failures.

Comment by Manjot Singh (Inactive) [ 2020-06-22 ]

As a correction to Marko Mäkelä @ 2020-06-03

This looks like a race condition in MDEV-22546.

The correction is:

This looks like a race condition in MDEV-22456.

Generated at Thu Feb 08 09:17:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.