Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33855

MSAN use-of-uninitialized-value in rtr_pcur_getnext_from_path()

Details

    Description

      The following failure is reported by MemorySanitizer on our CI systems every now and then, and I can reproduce it locally as well:

      10.6 73291de74e49a84700ce4e2aa2c7ec6769d884dc

      innodb_gis.rtree_compress '16k,innodb'   w1 [ 3 fail ]
              Test ended at 2024-04-08 17:14:37
       
      CURRENT_TEST: innodb_gis.rtree_compress
      mysqltest: At line 43: query 'rollback' failed: <Unknown> (2013): Lost connection to server during query
      ...
      ==932423==WARNING: MemorySanitizer: use-of-uninitialized-value
          #0 0x5641413d732e in rtr_pcur_getnext_from_path(dtuple_t const*, page_cur_mode_t, btr_cur_t*, unsigned long, unsigned long, bool, mtr_t*) /mariadb/10.6/storage/innobase/gis/gis0sea.cc:292:3
          #1 0x5641413c817e in rtr_search(dtuple_t const*, btr_latch_mode, btr_pcur_t*, mtr_t*) /mariadb/10.6/storage/innobase/gis/gis0sea.cc:1087:8
          #2 0x5641418bbc90 in row_undo_ins_remove_sec_low(btr_latch_mode, dict_index_t*, dtuple_t*, que_thr_t*) /mariadb/10.6/storage/innobase/row/row0uins.cc:283:7
          #3 0x5641418b6c14 in row_undo_ins_remove_sec(dict_index_t*, dtuple_t*, que_thr_t*) /mariadb/10.6/storage/innobase/row/row0uins.cc:353:8
          #4 0x5641418b6c14 in row_undo_ins_remove_sec_rec(undo_node_t*, que_thr_t*) /mariadb/10.6/storage/innobase/row/row0uins.cc:547:10
          #5 0x5641418b6c14 in row_undo_ins(undo_node_t*, que_thr_t*) /mariadb/10.6/storage/innobase/row/row0uins.cc:599:9
          #6 0x5641418b3429 in row_undo(undo_node_t*, que_thr_t*) /mariadb/10.6/storage/innobase/row/row0undo.cc:399:5
          #7 0x5641418b3429 in row_undo_step(que_thr_t*) /mariadb/10.6/storage/innobase/row/row0undo.cc:440:8
          #8 0x5641417021fe in que_thr_step(que_thr_t*) /mariadb/10.6/storage/innobase/que/que0que.cc:586:9
          #9 0x5641417021fe in que_run_threads_low(que_thr_t*) /mariadb/10.6/storage/innobase/que/que0que.cc:644:25
          #10 0x5641417021fe in que_run_threads(que_thr_t*) /mariadb/10.6/storage/innobase/que/que0que.cc:664:2
          #11 0x564141988b70 in trx_t::rollback_low(trx_savept_t*) /mariadb/10.6/storage/innobase/trx/trx0roll.cc:125:5
          #12 0x564141982dcd in trx_rollback_for_mysql_low(trx_t*) /mariadb/10.6/storage/innobase/trx/trx0roll.cc:196:7
          #13 0x564141982dcd in trx_rollback_for_mysql(trx_t*) /mariadb/10.6/storage/innobase/trx/trx0roll.cc
          #14 0x5641410d3d85 in innobase_rollback(handlerton*, THD*, bool) /mariadb/10.6/storage/innobase/handler/ha_innodb.cc:4697:11
          #15 0x56413f9c4faf in ha_rollback_trans(THD*, bool) /mariadb/10.6/sql/handler.cc:2237:17
          #16 0x5641408bdfae in trans_rollback(THD*) /mariadb/10.6/sql/transaction.cc:387:8
          #17 0x5641403aed19 in mysql_execute_command(THD*, bool) /mariadb/10.6/sql/sql_parse.cc:5777:27
          #18 0x564140395f52 in mysql_parse(THD*, char*, unsigned int, Parser_state*) /mariadb/10.6/sql/sql_parse.cc:8139:18
          #19 0x56414038e690 in dispatch_command(enum_server_command, THD*, char*, unsigned int, bool) /mariadb/10.6/sql/sql_parse.cc:1896:7
          #20 0x564140397014 in do_command(THD*, bool) /mariadb/10.6/sql/sql_parse.cc:1409:17
          #21 0x56414087e4f5 in do_handle_one_connection(CONNECT*, bool) /mariadb/10.6/sql/sql_connect.cc:1415:11
          #22 0x56414087df65 in handle_one_connection /mariadb/10.6/sql/sql_connect.cc:1317:5
          #23 0x564140e72c40 in pfs_spawn_thread /mariadb/10.6/storage/perfschema/pfs.cc:2201:3
          #24 0x7f586a6a645b in start_thread nptl/pthread_create.c:444:8
          #25 0x7f586a726bbb in clone3 misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
       
        Uninitialized value was stored to memory at
          #0 0x5641413d7327 in page_is_leaf(unsigned char const*) /mariadb/10.6/storage/innobase/include/page0page.h:263:10
          #1 0x5641413d7327 in rtr_pcur_getnext_from_path(dtuple_t const*, page_cur_mode_t, btr_cur_t*, unsigned long, unsigned long, bool, mtr_t*) /mariadb/10.6/storage/innobase/gis/gis0sea.cc:292:3
       
        Memory was marked as uninitialized
          #0 0x56413f8dc971 in __msan_allocated_memory (/dev/shm/10.6msan/sql/mariadbd+0xfd9971) (BuildId: 963d974e3992c2b5)
          #1 0x5641412af2fa in buf_LRU_block_free_non_file_page(buf_block_t*) /mariadb/10.6/storage/innobase/buf/buf0lru.cc:975:2
      

      A possible reason why earlier versions are not affected could be that MDEV-23484 was fixed in 10.6.

      The problem seems to be inadequate locking in rtr_pcur_getnext_from_path(). I have a patch that passes the following execution on my local system (see MDEV-20377 how to build and run tests with MemorySanitizer):

      LD_LIBRARY_PATH=~/libmsan-18 MSAN_SYMBOLIZER_PATH=~/bin/llvm-symbolizer-msan ./mtr --parallel=auto --repeat=10 innodb_gis.rtree_compress{,,,,,,,,,,}
      

      10.6 73291de74e49a84700ce4e2aa2c7ec6769d884dc with patch

      innodb_gis.rtree_compress '4k,innodb'    w14 [ 10 pass ]  17794
      innodb_gis.rtree_compress '4k,innodb'    w17 [ 10 pass ]  17397
      --------------------------------------------------------------------------
      The servers were restarted 0 times
      Spent 8710.576 of 305 seconds executing testcases
       
      Completed: All 330 tests were successful.
      

      Attachments

        Issue Links

          Activity

            marko Thanks for the change. I think the debug assert removal is correct but AFAICS latching changes are not required.

            We release all the page latches (mtr->rollback_to_savepoint(1)) when rtr_pcur_getnext_from_path() is called. This looks fine as we are either holding X (pessimistic) or SX(optimistic) latch at index level and have all the possible page IDs cached already.

            Only issue is that cursor now points to a possible invalid block "btr_cur->page_cur.block". We eventually set the cursor to the next candidate page after latching it and there doesn't seem to be any real issue other than this debug assert trying to access the block.

            The block can be freed even can be reused concurrently and MSAN looks to be catching it when it is indeed freed. We should not access the block/page at this point and It justifies the removal of the assert.

            debarun Debarun Banerjee added a comment - marko Thanks for the change. I think the debug assert removal is correct but AFAICS latching changes are not required. We release all the page latches (mtr->rollback_to_savepoint(1)) when rtr_pcur_getnext_from_path() is called. This looks fine as we are either holding X (pessimistic) or SX(optimistic) latch at index level and have all the possible page IDs cached already. Only issue is that cursor now points to a possible invalid block "btr_cur->page_cur.block". We eventually set the cursor to the next candidate page after latching it and there doesn't seem to be any real issue other than this debug assert trying to access the block. The block can be freed even can be reused concurrently and MSAN looks to be catching it when it is indeed freed. We should not access the block/page at this point and It justifies the removal of the assert.

            Thank you. I will test your suggestion of simply removing the problematic assertion. It takes some effort to reproduce this. I had no luck with 100×18 runs, but 100×36 did the trick:

            10.6 a8a75ba2d0dd639b2b14337a6c4f495fcaaff7e4

            innodb_gis.rtree_compress '4k,innodb'    w21 [ pass ]  29092
            innodb_gis.rtree_compress '16k,innodb'   w9 [ fail ]
                    Test ended at 2024-04-17 12:09:08
             
            CURRENT_TEST: innodb_gis.rtree_compress
            mysqltest: At line 43: query 'rollback' failed: <Unknown> (2013): Lost connection to server during query
            

            For the record, in 10.5 the corresponding assertion is never enabled:

            #ifdef UNIV_RTR_DEBUG
            		ut_ad(!(rw_lock_own_flagged(&btr_cur->page_cur.block->lock,
            					    RW_LOCK_FLAG_X | RW_LOCK_FLAG_S))
            			|| my_latch_mode == BTR_MODIFY_TREE
            			|| my_latch_mode == BTR_CONT_MODIFY_TREE
            			|| !page_is_leaf(buf_block_get_frame(
            					btr_cur->page_cur.block)));
            #endif /* UNIV_RTR_DEBUG */
            

            marko Marko Mäkelä added a comment - Thank you. I will test your suggestion of simply removing the problematic assertion. It takes some effort to reproduce this. I had no luck with 100×18 runs, but 100×36 did the trick: 10.6 a8a75ba2d0dd639b2b14337a6c4f495fcaaff7e4 innodb_gis.rtree_compress '4k,innodb' w21 [ pass ] 29092 innodb_gis.rtree_compress '16k,innodb' w9 [ fail ] Test ended at 2024-04-17 12:09:08   CURRENT_TEST: innodb_gis.rtree_compress mysqltest: At line 43: query 'rollback' failed: <Unknown> (2013): Lost connection to server during query For the record, in 10.5 the corresponding assertion is never enabled: #ifdef UNIV_RTR_DEBUG ut_ad(!(rw_lock_own_flagged(&btr_cur->page_cur.block->lock, RW_LOCK_FLAG_X | RW_LOCK_FLAG_S)) || my_latch_mode == BTR_MODIFY_TREE || my_latch_mode == BTR_CONT_MODIFY_TREE || !page_is_leaf(buf_block_get_frame( btr_cur->page_cur.block))); #endif /* UNIV_RTR_DEBUG */

            debarun, I tested removing only the debug assertion:

            diff --git a/storage/innobase/gis/gis0sea.cc b/storage/innobase/gis/gis0sea.cc
            index fb4fcf7bc8d..9aedd4af69c 100644
            --- a/storage/innobase/gis/gis0sea.cc
            +++ b/storage/innobase/gis/gis0sea.cc
            @@ -289,10 +289,6 @@ rtr_pcur_getnext_from_path(
             			mtr->rollback_to_savepoint(1);
             		}
             
            -		ut_ad((my_latch_mode | 4) == BTR_CONT_MODIFY_TREE
            -		      || !page_is_leaf(btr_cur_get_page(btr_cur))
            -		      || !btr_cur->page_cur.block->page.lock.have_any());
            -
             		const auto block_savepoint = mtr->get_savepoint();
             		block = buf_page_get_gen(
             			page_id_t(index->table->space_id,
            

            ./mtr --parallel=36 --repeat=10 innodb_gis.rtree_compress{,,,,,}{,}
            

            10.6 a8a75ba2d0dd639b2b14337a6c4f495fcaaff7e4 with the above patch

            innodb_gis.rtree_compress '4k,innodb'    w8 [ 10 pass ]  15098
            innodb_gis.rtree_compress '4k,innodb'    w13 [ 10 pass ]  14981
            --------------------------------------------------------------------------
            The servers were restarted 0 times
            Spent 8694.223 of 280 seconds executing testcases
             
            Completed: All 360 tests were successful.
            

            Correction: also the previous attempt (which reproduced the bug today) was 36 tests in parallel with 10 repeats, not 100. I think that we should just remove this assertion.

            marko Marko Mäkelä added a comment - debarun , I tested removing only the debug assertion: diff --git a/storage/innobase/gis/gis0sea.cc b/storage/innobase/gis/gis0sea.cc index fb4fcf7bc8d..9aedd4af69c 100644 --- a/storage/innobase/gis/gis0sea.cc +++ b/storage/innobase/gis/gis0sea.cc @@ -289,10 +289,6 @@ rtr_pcur_getnext_from_path( mtr->rollback_to_savepoint(1); } - ut_ad((my_latch_mode | 4) == BTR_CONT_MODIFY_TREE - || !page_is_leaf(btr_cur_get_page(btr_cur)) - || !btr_cur->page_cur.block->page.lock.have_any()); - const auto block_savepoint = mtr->get_savepoint(); block = buf_page_get_gen( page_id_t(index->table->space_id, . /mtr --parallel=36 --repeat=10 innodb_gis.rtree_compress{,,,,,}{,} 10.6 a8a75ba2d0dd639b2b14337a6c4f495fcaaff7e4 with the above patch innodb_gis.rtree_compress '4k,innodb' w8 [ 10 pass ] 15098 innodb_gis.rtree_compress '4k,innodb' w13 [ 10 pass ] 14981 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 8694.223 of 280 seconds executing testcases   Completed: All 360 tests were successful. Correction: also the previous attempt (which reproduced the bug today) was 36 tests in parallel with 10 repeats, not 100. I think that we should just remove this assertion.

            A more extensive test with the assertion removed completed as well:

            10.6 a8a75ba2d0dd639b2b14337a6c4f495fcaaff7e4 with the above patch

            innodb_gis.rtree_compress '4k,innodb'    w16 [ 100 pass ]  14319
            innodb_gis.rtree_compress '4k,innodb'    w14 [ 100 pass ]  14230
            innodb_gis.rtree_compress '4k,innodb'    w25 [ 100 pass ]  13470
            --------------------------------------------------------------------------
            The servers were restarted 0 times
            Spent 94105.940 of 2878 seconds executing testcases
             
            Completed: All 3600 tests were successful.
            

            marko Marko Mäkelä added a comment - A more extensive test with the assertion removed completed as well: 10.6 a8a75ba2d0dd639b2b14337a6c4f495fcaaff7e4 with the above patch innodb_gis.rtree_compress '4k,innodb' w16 [ 100 pass ] 14319 innodb_gis.rtree_compress '4k,innodb' w14 [ 100 pass ] 14230 innodb_gis.rtree_compress '4k,innodb' w25 [ 100 pass ] 13470 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 94105.940 of 2878 seconds executing testcases   Completed: All 3600 tests were successful.

            Before MDEV-24142, the bogus debug assertion was never enabled.

            marko Marko Mäkelä added a comment - Before MDEV-24142 , the bogus debug assertion was never enabled.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.