Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35472

Server crash in ha_storage_put_memlim upon reading from INNODB_LOCKS

Details

    Description

      Note:
      The test case is non-deterministic, please run with --repeat=N. It currently fails for me locally in 1-2 attempts, but it is strangely sensitive to a variety of conditions: it only fails on my ASAN builds, even though on different ones (different revisions, different set of features), and the failure seems too depend on nearly every symbol in the test case – I can't rename the databases or the tables, or remove extra spaces, or even change the text in the comments to different text of the same length. So I expect that reproducing it may turn problematic, please let me know if that's the case, and I'll try to set it up on some lab machine.

      --source include/have_sequence.inc
      --source include/have_innodb.inc
       
      CREATE DATABASE oltp_db;
      CREATE TABLE oltp_db.oltp4 (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY) ENGINE=InnoDB;
      INSERT INTO oltp_db.oltp4 SELECT seq FROM seq_1_to_100;
      CREATE DATABASE advanced_db;
      CREATE TABLE advanced_db.t1 (id INT AUTO_INCREMENT PRIMARY KEY) ENGINE=InnoDB;
      INSERT INTO advanced_db.t1 SELECT seq FROM seq_1_to_100;
       
      --connect (con1,localhost,root,,)
      --send
      	/* WRK-8 QNO 2425 */  UPDATE IGNORE `oltp_db`.`oltp4` SET `id` = 'w' ORDER BY `id` LIMIT 1;
      --connect (con2,localhost,root,,)
      --send
      	/* WRK-2 QNO 1613 */  UPDATE IGNORE `advanced_db`.`t1` SET `id` = NULL ORDER BY `id` LIMIT 4;
      --connection default
      	SELECT * FROM information_schema.INNODB_LOCKS;
       
      # Cleanup
      --connection con1
      --reap
      --disconnect con1
      --connection con2
      --reap
      --disconnect con2
      --connection default
      DROP DATABASE advanced_db;
      DROP DATABASE oltp_db;
      

      10.6 bcbeef677227995f329b251dbaf8d0cc40e07e79

      #3  <signal handler called>
      #4  0x0000556084053fc5 in ha_storage_put_memlim (storage=0x61a000007320, data=0x7fa8825611c0, data_len=91, memlim=16621568) at /data/bld/10.6-asan/storage/innobase/ha/ha0storage.cc:54
      #5  0x0000556083d1b7d1 in fill_trx_row (row=0x7fa87f7e78b0, trx=0x7fa88911bf40, requested_lock_row=0x0, cache=0x556087416b20 <trx_i_s_cache_static>) at /data/bld/10.6-asan/storage/innobase/trx/trx0i_s.cc:433
      #6  0x0000556083d1eed7 in fetch_data_into_cache_low (cache=0x556087416b20 <trx_i_s_cache_static>, trx=0x7fa88911bf40) at /data/bld/10.6-asan/storage/innobase/trx/trx0i_s.cc:1030
      #7  0x0000556083d1f146 in operator() (__closure=0x7fa882561850, trx=...) at /data/bld/10.6-asan/storage/innobase/trx/trx0i_s.cc:1058
      #8  0x0000556083d1fd1d in thread_safe_trx_ilist_t::for_each<fetch_data_into_cache(trx_i_s_cache_t*)::<lambda(trx_t&)> >(struct {...} &&) (this=0x5560866b3d40 <trx_sys+128>, callback=...) at /data/bld/10.6-asan/storage/innobase/include/trx0sys.h:839
      #9  0x0000556083d1f252 in fetch_data_into_cache (cache=0x556087416b20 <trx_i_s_cache_static>) at /data/bld/10.6-asan/storage/innobase/trx/trx0i_s.cc:1052
      #10 0x0000556083d1f374 in trx_i_s_possibly_fetch_data_into_cache (cache=0x556087416b20 <trx_i_s_cache_static>) at /data/bld/10.6-asan/storage/innobase/trx/trx0i_s.cc:1081
      #11 0x00005560838e3bf1 in trx_i_s_common_fill_table (thd=0x62b0000bd218, tables=0x62b0000c4998) at /data/bld/10.6-asan/storage/innobase/handler/i_s.cc:191
      #12 0x00005560838e6686 in fill_innodb_locks_from_cache (thd=0x62b0000bd218, tables=0x62b0000c4998) at /data/bld/10.6-asan/storage/innobase/handler/i_s.cc:664
      #13 0x00005560826a8234 in get_schema_tables_result (join=0x62b0000c5f70, executed_place=PROCESSED_BY_JOIN_EXEC) at /data/bld/10.6-asan/sql/sql_show.cc:9036
      #14 0x000055608256ed51 in JOIN::exec_inner (this=0x62b0000c5f70) at /data/bld/10.6-asan/sql/sql_select.cc:4893
      #15 0x000055608256cad2 in JOIN::exec (this=0x62b0000c5f70) at /data/bld/10.6-asan/sql/sql_select.cc:4714
      #16 0x0000556082570f51 in mysql_select (thd=0x62b0000bd218, tables=0x62b0000c4998, fields=..., conds=0x0, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=2684619520, result=0x62b0000c5f40, unit=0x62b0000c1608, select_lex=0x62b0000c4300) at /data/bld/10.6-asan/sql/sql_select.cc:5193
      #17 0x000055608253faae in handle_select (thd=0x62b0000bd218, lex=0x62b0000c1540, result=0x62b0000c5f40, setup_tables_done_option=0) at /data/bld/10.6-asan/sql/sql_select.cc:573
      #18 0x00005560824a10e0 in execute_sqlcom_select (thd=0x62b0000bd218, all_tables=0x62b0000c4998) at /data/bld/10.6-asan/sql/sql_parse.cc:6407
      #19 0x000055608248f239 in mysql_execute_command (thd=0x62b0000bd218, is_called_from_prepared_stmt=false) at /data/bld/10.6-asan/sql/sql_parse.cc:3999
      #20 0x00005560824abf16 in mysql_parse (thd=0x62b0000bd218, rawbuf=0x62b0000c4238 "SELECT * FROM information_schema.INNODB_LOCKS", length=45, parser_state=0x7fa882563a90) at /data/bld/10.6-asan/sql/sql_parse.cc:8194
      #21 0x0000556082481425 in dispatch_command (command=COM_QUERY, thd=0x62b0000bd218, packet=0x629000276219 "SELECT * FROM information_schema.INNODB_LOCKS", packet_length=45, blocking=true) at /data/bld/10.6-asan/sql/sql_parse.cc:1908
      #22 0x000055608247e159 in do_command (thd=0x62b0000bd218, blocking=true) at /data/bld/10.6-asan/sql/sql_parse.cc:1421
      #23 0x00005560828fe9d1 in do_handle_one_connection (connect=0x608000014738, put_in_cache=true) at /data/bld/10.6-asan/sql/sql_connect.cc:1407
      #24 0x00005560828fe530 in handle_one_connection (arg=0x6080000146b8) at /data/bld/10.6-asan/sql/sql_connect.cc:1319
      #25 0x00005560835759b4 in pfs_spawn_thread (arg=0x617000008218) at /data/bld/10.6-asan/storage/perfschema/pfs.cc:2201
      #26 0x00007fa8910a8044 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
      #27 0x00007fa89112861c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
      

      Examples of build options where the failure occurs:

      cmake . -DPLUGIN_TOKUDB=NO -DPLUGIN_COLUMNSTORE=NO -DPLUGIN_XPAND=NO -DWITH_EMBEDDED_SERVER=YES -DCMAKE_BUILD_TYPE=Debug -DWITH_ASAN=YES -DMYSQL_MAINTAINER_MODE=OFF -DCMAKE_C_FLAGS=-fno-omit-frame-pointer -DCMAKE_CXX_FLAGS=-fno-omit-frame-pointer -DWITH_SAFEMALLOC=OFF
      

      cmake . -DPLUGIN_TOKUDB=NO -DPLUGIN_COLUMNSTORE=NO -DPLUGIN_XPAND=NO -DPLUGIN_ROCKSDB=NO -DPLUGIN_SPHINX=NO -DPLUGIN_SPIDER=NO -DPLUGIN_MROONGA=NO -DPLUGIN_FEDERATEDX=NO -DPLUGIN_CONNECT=NO -DPLUGIN_FEDERATED=NO -DPLUGIN_OQGRAPH=NO -DWITH_EMBEDDED_SERVER=NO -DWITH_MARIABACKUP=OFF -DWITH_UNIT_TESTS=0 -DCMAKE_BUILD_TYPE=Debug -DWITH_ASAN=YES -DMYSQL_MAINTAINER_MODE=OFF -DCMAKE_C_FLAGS=-fno-omit-frame-pointer -DCMAKE_CXX_FLAGS=-fno-omit-frame-pointer -DWITH_SAFEMALLOC=OFF
      

      The failure started happening after this commit in 10.6:

      commit ccb6cd8053ace40d203d435941c6b609d7818cf5 (HEAD)
      Author: Marko Mäkelä
      Date:   Tue Nov 12 12:17:34 2024 +0200
       
          MDEV-35189: Updating cache for INNODB_LOCKS et al is suboptimal
      

      Attachments

        Issue Links

          Activity

            It appears that the failure stopped happening after this commit

            commit df3855a4711ac6b71afd3c8aef25e2d00fc97111
            Author: Marko Mäkelä
            Date:   Thu Nov 21 08:59:31 2024 +0200
             
                MDEV-35247: ut_hash_ulint() is a waste
            

            but I have no information so far whether it has disappeared completely or just this test case stopped failing (due to its suspicious sensitivity I suppose it can happen).

            elenst Elena Stepanova added a comment - It appears that the failure stopped happening after this commit commit df3855a4711ac6b71afd3c8aef25e2d00fc97111 Author: Marko Mäkelä Date: Thu Nov 21 08:59:31 2024 +0200   MDEV-35247: ut_hash_ulint() is a waste but I have no information so far whether it has disappeared completely or just this test case stopped failing (due to its suspicious sensitivity I suppose it can happen).

            I don’t have GCC 12.2.0 but 14.2.0, but luckily I can reproduce the problem in a debug+ASAN build.

            marko Marko Mäkelä added a comment - I don’t have GCC 12.2.0 but 14.2.0, but luckily I can reproduce the problem in a debug+ASAN build.

            The malloc() interceptor in AddressSanitizer is filling the block with 0xbe bytes. That is why we get an invalid pointer. I forgot to zero-initialize the pointer:

            diff --git a/storage/innobase/ha/ha0storage.cc b/storage/innobase/ha/ha0storage.cc
            index 4a99d5f6e69..ef7cc78a4dd 100644
            --- a/storage/innobase/ha/ha0storage.cc
            +++ b/storage/innobase/ha/ha0storage.cc
            @@ -67,6 +67,7 @@ ha_storage_put_memlim(
                 (mem_heap_alloc(storage->heap, sizeof *node + data_len));
               node->data_len= data_len;
               node->data= &node[1];
            +  node->next= nullptr;
               memcpy(const_cast<void*>(node->data), data, data_len);
               *after= node;
               return node->data;
            

            With this applied on top of a9b0a1c5d01eb299d195707a3307fbab58c8d804, the test passed 5×100 times.

            marko Marko Mäkelä added a comment - The malloc() interceptor in AddressSanitizer is filling the block with 0xbe bytes. That is why we get an invalid pointer. I forgot to zero-initialize the pointer: diff --git a/storage/innobase/ha/ha0storage.cc b/storage/innobase/ha/ha0storage.cc index 4a99d5f6e69..ef7cc78a4dd 100644 --- a/storage/innobase/ha/ha0storage.cc +++ b/storage/innobase/ha/ha0storage.cc @@ -67,6 +67,7 @@ ha_storage_put_memlim( (mem_heap_alloc(storage->heap, sizeof *node + data_len)); node->data_len= data_len; node->data= &node[1]; + node->next= nullptr; memcpy(const_cast<void*>(node->data), data, data_len); *after= node; return node->data; With this applied on top of a9b0a1c5d01eb299d195707a3307fbab58c8d804, the test passed 5×100 times.

            The fix Looks correct to me. Please go ahead.

            The initialization of the next pointer is missed when we got rid of HASH_INSERT abstraction. I think it is good to check all other places where HASH_INSERT was removed earlier and check for similar issue.

            HASH_INSERT(
            		ha_storage_node_t,	/* type used in the hash chain */
            		next,			/* node->"next" */
             
            #define HASH_INSERT(TYPE, NAME, TABLE, HASH_VALUE, DATA)                    \
              do {                                                                      \
            ...
                (DATA)->NAME = NULL; 
            

            debarun Debarun Banerjee added a comment - The fix Looks correct to me. Please go ahead. The initialization of the next pointer is missed when we got rid of HASH_INSERT abstraction. I think it is good to check all other places where HASH_INSERT was removed earlier and check for similar issue. HASH_INSERT( ha_storage_node_t, /* type used in the hash chain */ next, /* node->"next" */   #define HASH_INSERT(TYPE, NAME, TABLE, HASH_VALUE, DATA) \ do { \ ... (DATA)->NAME = NULL;

            Other previous invocations of HASH_INSERT are as follows:

            • buf_buddy_block_register(): switched to hash_cell_t::append(), which resets the pointer to next element.
            • dict_sys_t::add(): asserts ut_ad(!table->name_hash) and ut_ad(!table->id_hash), which are satisfied by dict_table_t::create(), which precedes each call to dict_table_t::add_to_cache(), which is the only invoker of this
            • dict_table_rename_in_cache(): asserts ut_ad(!table->name_hash), which is satisfied by a previous call to hash_cell_t::remove()
            • dict_table_change_id_in_cache(): moved to row_discard_tablespace(), which invokes hash_cell_t::append()
            • dict_sys_t::resize(): invokes hash_cell_t::append()
            • fil_space_t::create(): uses ut_zalloc_nokey(), in a way that is potentially prone to GCC -flifetime-dse (MDEV-18749, MDEV-34625) because a non-trivial default fil_space_t::fil_space_t() constructor may exist due to fil_space_t::latch

            The fil_space_t::create() looks like a possible time bomb, depending on the compilation options (such as SUX_LOCK_GENERIC). I had added a source code comment about that already:

            	/* FIXME: if calloc() is defined as an inline function that calls
            	memset() or bzero(), then GCC 6 -flifetime-dse can optimize it away */
            	*after = space = new (ut_zalloc_nokey(sizeof(*space))) fil_space_t;
            

            I filed MDEV-35494 for fixing that.

            marko Marko Mäkelä added a comment - Other previous invocations of HASH_INSERT are as follows: buf_buddy_block_register() : switched to hash_cell_t::append() , which resets the pointer to next element. dict_sys_t::add() : asserts ut_ad(!table->name_hash) and ut_ad(!table->id_hash) , which are satisfied by dict_table_t::create() , which precedes each call to dict_table_t::add_to_cache() , which is the only invoker of this dict_table_rename_in_cache() : asserts ut_ad(!table->name_hash) , which is satisfied by a previous call to hash_cell_t::remove() dict_table_change_id_in_cache() : moved to row_discard_tablespace() , which invokes hash_cell_t::append() dict_sys_t::resize() : invokes hash_cell_t::append() fil_space_t::create() : uses ut_zalloc_nokey() , in a way that is potentially prone to GCC -flifetime-dse ( MDEV-18749 , MDEV-34625 ) because a non-trivial default fil_space_t::fil_space_t() constructor may exist due to fil_space_t::latch The fil_space_t::create() looks like a possible time bomb, depending on the compilation options (such as SUX_LOCK_GENERIC ). I had added a source code comment about that already: /* FIXME: if calloc() is defined as an inline function that calls memset() or bzero(), then GCC 6 -flifetime-dse can optimize it away */ *after = space = new (ut_zalloc_nokey( sizeof (*space))) fil_space_t; I filed MDEV-35494 for fixing that.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.