Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22790

Race between btr_page_mtr_lock() dropping AHI on the same block

Details

    Description

      ASAN failure hit during RQG test on development source tree
      origin/bb-10.5-MDEV-15053-3 be3c9a23fd014ed80d48e2984dec78db921a7a00 2020-06-02T20:37:37+03:00
       
      ==100160==ERROR: AddressSanitizer: heap-use-after-free on address 0x618000140938 at pc 0x560d2e722c69 bp 0x664073915570 sp 0x664073915560
      READ of size 8 at 0x618000140938 thread T19
          #0 0x560d2e722c68 in rec_is_metadata(unsigned char const*, dict_index_t const&) storage/innobase/include/rem0rec.h:734
          #1 0x560d2ec4b24a in btr_search_drop_page_hash_index(buf_block_t*) storage/innobase/btr/btr0sea.cc:1263
          #2 0x560d2ec74c04 in buf_page_mtr_lock storage/innobase/buf/buf0buf.cc:2946
          #3 0x560d2ec77240 in buf_page_get_low(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool) storage/innobase/buf/buf0buf.cc:3521
          #4 0x560d2ec778b1 in buf_page_get_gen(page_id_t, unsigned long, unsigned long, buf_block_t*, unsigned long, char const*, unsigned int, mtr_t*, dberr_t*, bool) storage/innobase/buf/buf0buf.cc:3594
          #5 0x560d2e72464a in btr_block_get_func(dict_index_t const&, unsigned long, unsigned long, bool, char const*, unsigned int, mtr_t*) storage/innobase/include/btr0btr.h:240
          #6 0x560d2ebfe77b in btr_cur_latch_leaves(buf_block_t*, unsigned long, btr_cur_t*, mtr_t*) storage/innobase/btr/btr0cur.cc:246
          #7 0x560d2ec0f844 in btr_cur_open_at_rnd_pos_func(dict_index_t*, unsigned long, btr_cur_t*, char const*, unsigned int, mtr_t*) storage/innobase/btr/btr0cur.cc:3028
          #8 0x560d2ec273a8 in btr_estimate_number_of_different_key_vals(dict_index_t*) storage/innobase/btr/btr0cur.cc:6683
          #9 0x560d2ed67790 in dict_stats_update_transient_for_index storage/innobase/dict/dict0stats.cc:887
          #10 0x560d2ed67c08 in dict_stats_update_transient storage/innobase/dict/dict0stats.cc:945
          #11 0x560d2ed73228 in dict_stats_update(dict_table_t*, dict_stats_upd_option_t) storage/innobase/dict/dict0stats.cc:3331
          #12 0x560d2e6356da in dict_stats_init storage/innobase/include/dict0stats.ic:165
          #13 0x560d2e64a3f0 in ha_innobase::open(char const*, int, unsigned int) storage/innobase/handler/ha_innodb.cc:5831
          #14 0x560d2db6e0ed in handler::ha_open(TABLE*, char const*, int, unsigned int, st_mem_root*, List<String>*) sql/handler.cc:2906
          #15 0x560d2d6d3bbb in open_table_from_share(THD*, TABLE_SHARE*, st_mysql_const_lex_string const*, unsigned int, unsigned int, unsigned int, TABLE*, bool, List<String>*) sql/table.cc:4190
          #16 0x560d2d1fdf47 in open_table(THD*, TABLE_LIST*, Open_table_context*) sql/sql_base.cc:1980
          #17 0x560d2d207b81 in open_and_process_table sql/sql_base.cc:3779
          #18 0x560d2d20a729 in open_tables(THD*, DDL_options_st const&, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) sql/sql_base.cc:4251
          #19 0x560d2d20fcc1 in open_and_lock_tables(THD*, DDL_options_st const&, TABLE_LIST*, bool, unsigned int, Prelocking_strategy*) sql/sql_base.cc:5155
          #20 0x560d2d16a1bd in open_and_lock_tables(THD*, TABLE_LIST*, bool, unsigned int) sql/sql_base.h:509
          #21 0x560d2d2d6d96 in mysql_insert(THD*, TABLE_LIST*, List<Item>&, List<List<Item> >&, List<Item>&, List<Item>&, enum_duplicates, bool, select_result*) sql/sql_insert.cc:756
          #22 0x560d2d39cb4e in mysql_execute_command(THD*) sql/sql_parse.cc:4553
          #23 0x560d2d3b4b26 in mysql_parse(THD*, char*, unsigned int, Parser_state*, bool, bool) sql/sql_parse.cc:7992
          #24 0x560d2d38b67a in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool, bool) sql/sql_parse.cc:1874
          #25 0x560d2d387ef8 in do_command(THD*) sql/sql_parse.cc:1355
          #26 0x560d2d7b1e51 in do_handle_one_connection(CONNECT*, bool) sql/sql_connect.cc:1411
          #27 0x560d2d7b17af in handle_one_connection sql/sql_connect.cc:1313
          #28 0x560d2e464eee in pfs_spawn_thread storage/perfschema/pfs.cc:2201
          #29 0x20d31b0696da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
          #30 0x560d2b67c88e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x12188e)
      ...
      SUMMARY: AddressSanitizer: heap-use-after-free storage/innobase/include/rem0rec.h:734 in rec_is_metadata(unsigned char const*, dict_index_t const&)
      ...
      Query (0x62b000093238): INSERT INTO t4 (col1,col2, col_int, col_string, col_text) VALUES /* 1 */ (1,1,1,REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), 10),REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), @fill_amount) ), (1,1,1,REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), 10),REPEAT(SUBSTR(CAST( 1 AS CHAR),1,1), @fill_amount) )
      Connection ID (thread ID): 133
      Status: NOT_KILLED
       
      RQG
      -------
      git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich
      origin/experimental 5c63068c24fa6d687422f4d26490b067ff6535e4 2020-05-28T13:50:30+02:00
       
      perl rqg.pl \
      --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
      --gendata=conf/mariadb/table_stress.zz \
      --gendata_sql=conf/mariadb/table_stress.sql \
      --engine=Innodb \
      --reporters=Deadlock1,ErrorLog,Backtrace \
      --mysqld=--connect_timeout=60 \
      --mysqld=--net_read_timeout=30 \
      --mysqld=--net_write_timeout=60 \
      --mysqld=--loose-idle_readonly_transaction_timeout=0 \
      --mysqld=--loose-idle_transaction_timeout=0 \
      --mysqld=--loose-idle_write_transaction_timeout=0 \
      --mysqld=--interactive_timeout=28800 \
      --mysqld=--lock_wait_timeout=86400 \
      --mysqld=--innodb-lock-wait-timeout=50 \
      --mysqld=--loose-table_lock_wait_timeout=50 \
      --mysqld=--wait_timeout=28800 \
      --mysqld=--slave_net_timeout=60 \
      --mysqld=--loose-max-statement-time=30 \
      --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
      --mysqld=--log-output=none \
      --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
      --duration=300 \
      --seed=random \
      --grammar=conf/mariadb/table_stress_innodb_nocopy.yy \
      --threads=9 \
      --mysqld=--innodb_stats_persistent=OFF \
      --mysqld=--innodb_adaptive_hash_index=ON \
      --mysqld=--innodb_page_size=8K \
      --mysqld=--innodb-buffer-pool-size=256M \
      --duration=300 \
      --no_mask \
      --workdir=<local settings> \
      --vardir=<local settings> \
      --mtr-build-thread=<local settings> \
      --basedir1=<local settings> \
      --basedir2=<local settings> \
      --script_debug=_nix_ \
      --rr=Server \
      --rr_options=--chaos
      

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä added a comment - - edited

            This looks like a race condition in MDEV-22456. Almost concurrently with dropping the last adaptive hash index entry (for a different block), we are executing this function. At the time of the invalid access, block->index for this block is already NULL.

            The problem is that we have two threads executing btr_search_drop_page_hash_index() for the same block in buf_page_mtr_lock() at the same time. While this occurred on a MDEV-15053 development branch, I believe that the same bug should be possible starting with 10.2, because we are not necessarily holding exclusive block->lock here in buf_page_mtr_lock():

            #ifdef BTR_CUR_HASH_ADAPT
              {
                dict_index_t *index= block->index;
                if (index && index->freed())
                  btr_search_drop_page_hash_index(block);
              }
            #endif /* BTR_CUR_HASH_ADAPT */
            

            Both threads are with buf_page_mtr_lock(block, RW_S_LATCH, …).

            In order for the call to be safe, we should acquire an exclusive latch and re-check block->index after acquiring it.

            marko Marko Mäkelä added a comment - - edited This looks like a race condition in MDEV-22456 . Almost concurrently with dropping the last adaptive hash index entry (for a different block), we are executing this function. At the time of the invalid access, block->index for this block is already NULL. The problem is that we have two threads executing btr_search_drop_page_hash_index() for the same block in buf_page_mtr_lock() at the same time. While this occurred on a MDEV-15053 development branch, I believe that the same bug should be possible starting with 10.2, because we are not necessarily holding exclusive block->lock here in buf_page_mtr_lock() : #ifdef BTR_CUR_HASH_ADAPT { dict_index_t *index= block->index; if (index && index->freed()) btr_search_drop_page_hash_index(block); } #endif /* BTR_CUR_HASH_ADAPT */ Both threads are with buf_page_mtr_lock(block, RW_S_LATCH, …) . In order for the call to be safe, we should acquire an exclusive latch and re-check block->index after acquiring it.

            mleich, please test the patch on bb-10.2-marko. With that patch, we are skipping some btr_search_drop_page_hash_index(block) calls that were added in MDEV-22456. It might turn out that those calls are necessary for correctness, although I think that it is very unlikely.

            marko Marko Mäkelä added a comment - mleich , please test the patch on bb-10.2-marko. With that patch, we are skipping some btr_search_drop_page_hash_index(block) calls that were added in MDEV-22456 . It might turn out that those calls are necessary for correctness, although I think that it is very unlikely.

            The development trees with the final fix
            origin/bb-10.5-MDEV-15053-3 6bdefbed4253c391cdb136fb51f3629be6bf6b32 2020-06-04T15:55:57+03:00
            origin/bb-10.2-marko ed6a3608590b5bc755d19a7036f49997f31e3c9f 2020-06-04T12:29:32+03:00
            performed well in RQG testing.
            The ASAN failure mentioned above was not replayed again and there were also no
            new unknown failures.

            mleich Matthias Leich added a comment - The development trees with the final fix origin/bb-10.5- MDEV-15053 -3 6bdefbed4253c391cdb136fb51f3629be6bf6b32 2020-06-04T15:55:57+03:00 origin/bb-10.2-marko ed6a3608590b5bc755d19a7036f49997f31e3c9f 2020-06-04T12:29:32+03:00 performed well in RQG testing. The ASAN failure mentioned above was not replayed again and there were also no new unknown failures.

            As a correction to Marko Mäkelä @ 2020-06-03

            This looks like a race condition in MDEV-22546.

            The correction is:

            This looks like a race condition in MDEV-22456.

            manjot Manjot Singh (Inactive) added a comment - As a correction to Marko Mäkelä @ 2020-06-03 This looks like a race condition in MDEV-22546 . The correction is: This looks like a race condition in MDEV-22456 .

            People

              marko Marko Mäkelä
              mleich Matthias Leich
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.