Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13051

MySQL#86607 InnoDB crash after failed ADD INDEX and table_definition_cache eviction

Details

    Description

      Re-filing the upstream bug on behalf of the original reporter Jianwei Zhao.

      Sleep injection

      diff --git a/storage/innobase/handler/ha_innodb.cc b/storage/innobase/handler/ha_innodb.cc
      index 41c767a..bfd7102 100644
      --- a/storage/innobase/handler/ha_innodb.cc
      +++ b/storage/innobase/handler/ha_innodb.cc
      @@ -6779,6 +6779,7 @@ no_commit:
       
       		build_template(true);
       	}
      +	os_thread_sleep(5000000);
       
       	innobase_srv_conc_enter_innodb(prebuilt->trx);
       
      diff --git a/storage/innobase/handler/handler0alter.cc b/storage/innobase/handler/handler0alter.cc
      index e772208..dea7696 100644
      --- a/storage/innobase/handler/handler0alter.cc
      +++ b/storage/innobase/handler/handler0alter.cc
      @@ -4138,6 +4138,7 @@ rollback_inplace_alter_table(
       		(almost) nothing has been or needs to be done. */
       		goto func_exit;
       	}
      +	os_thread_sleep(2000000);
       
       	row_mysql_lock_data_dictionary(ctx->trx);
      

      Test case

      --source include/have_innodb.inc
       
      connect(con1,localhost,root,,);
      connect(con2,localhost,root,,);
      let $MYSQLD_DATADIR= `SELECT @@datadir`;
       
      connection con1;
      set global table_definition_cache=400;
      set global table_open_cache= 1024;
       
      use test;
      create table t_bug(id int primary key, col1 int) engine=innodb;
      insert into t_bug values(1, 2);
      insert into t_bug values(2, 2);
      commit;
       
      ####
      #Sleep 2s to wait INSERT statement increment table->n_ref_count.
      ####
      --send
      alter table t_bug add unique key(col1);
       
      connection con2;
      use test;
      select * from t_bug;
       
      ####
      #Sleep 5s to wait ALTER statement complete.
      ####
      insert into t_bug values(3, 2);
      commit;
       
      ####
      #So ALTER statement failed, left table->drop_abort=true;
      ####
      connection con1;
      --error 1062
      --reap
       
      disconnect con2;
       
      connection con1;
      select sleep(10);
       
      ####
      #Flush tables, so t_bug table can be evicted by master thread.
      ####
      flush tables;
       
      ####
      #Create 1000 tables, try the best to evict t_bug since table_def_size.
      ####
      --disable_query_log
      let $loop=1000;
      while($loop)
      {
        eval create table t_$loop(id int)engine=innodb;
        dec $loop;
      }
       
      select sleep(10);
       
      let $loop=1000;
      while($loop)
      {
        eval drop table t_$loop;
        dec $loop;
      }
       
      --enable_query_log
      set global table_definition_cache=default;
      set global table_open_cache= default;
      drop table t_bug;
       
      --exec grep 't_bug' $MYSQLD_DATADIR/../../log/mysqld.1.err
       
      disconnect con1;
      

      10.0 449a88e1c629b without injection

      #2  0x0000000000836793 in handle_fatal_signal (sig=11) at /data/src/10.0/sql/signal_handler.cc:285
      #3  <signal handler called>
      #4  0x00007f1d9802485b in dict_index_is_clust (index=0x0) at /data/src/10.0/storage/innobase/include/dict0dict.ic:269
      #5  0x00007f1d9802e60c in row_merge_drop_indexes (trx=0x7f1d8fc4aa78, table=0x7f1d8b8300f8, locked=1) at /data/src/10.0/storage/innobase/row/row0merge.cc:2900
      #6  0x00007f1d97ec48e8 in dict_table_remove_from_cache_low (table=0x7f1d8b8300f8, lru_evict=1) at /data/src/10.0/storage/innobase/dict/dict0dict.cc:2146
      #7  0x00007f1d97ec2749 in dict_make_room_in_cache (max_tables=400, pct_check=50) at /data/src/10.0/storage/innobase/dict/dict0dict.cc:1485
      #8  0x00007f1d98088530 in srv_master_evict_from_table_cache (pct_check=50) at /data/src/10.0/storage/innobase/srv/srv0srv.cc:2078
      #9  0x00007f1d98088966 in srv_master_do_active_tasks () at /data/src/10.0/storage/innobase/srv/srv0srv.cc:2200
      #10 0x00007f1d980891e7 in srv_master_thread (arg=0x0) at /data/src/10.0/storage/innobase/srv/srv0srv.cc:2387
      #11 0x00007f1da13c3494 in start_thread (arg=0x7f1d917fd700) at pthread_create.c:333
      #12 0x00007f1d9f50e93f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      10.1 112b21da37d without injection

      #3  <signal handler called>
      #4  0x00007f254a3ebceb in dict_index_is_clust (index=0x0) at /data/src/10.1-bug3/storage/innobase/include/dict0dict.ic:270
      #5  0x00007f254a3f6976 in row_merge_drop_indexes (trx=0x7f2550493278, table=0x7f25496e78f8, locked=1) at /data/src/10.1-bug3/storage/innobase/row/row0merge.cc:3184
      #6  0x00007f254a2615a9 in dict_table_remove_from_cache_low (table=0x7f25496e78f8, lru_evict=1) at /data/src/10.1-bug3/storage/innobase/dict/dict0dict.cc:2170
      #7  0x00007f254a25f305 in dict_make_room_in_cache (max_tables=400, pct_check=50) at /data/src/10.1-bug3/storage/innobase/dict/dict0dict.cc:1509
      #12 0x00007f2550d7193f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      10.2 2d8fdfbde5d5d32 without injection

      #2  0x000055a81a2abd1e in handle_fatal_signal (sig=11) at /data/src/10.2/sql/signal_handler.cc:299
      #3  <signal handler called>
      #4  0x000055a81a6e4b89 in dict_index_is_clust (index=0x0) at /data/src/10.2/storage/innobase/include/dict0dict.ic:298
      #5  0x000055a81a6f25bc in row_merge_drop_indexes (trx=0x7efdd4e17cc0, table=0x7efd7802de08, locked=1) at /data/src/10.2/storage/innobase/row/row0merge.cc:3801
      #6  0x000055a81a87f34e in dict_table_remove_from_cache_low (table=0x7efd7802de08, lru_evict=1) at /data/src/10.2/storage/innobase/dict/dict0dict.cc:2119
      #7  0x000055a81a87d151 in dict_make_room_in_cache (max_tables=400, pct_check=100) at /data/src/10.2/storage/innobase/dict/dict0dict.cc:1490
      #8  0x000055a81a76d09e in srv_master_evict_from_table_cache (pct_check=100) at /data/src/10.2/storage/innobase/srv/srv0srv.cc:2159
      #9  0x000055a81a76db2c in srv_master_do_idle_tasks () at /data/src/10.2/storage/innobase/srv/srv0srv.cc:2398
      #10 0x000055a81a76e0ca in srv_master_thread (arg=0x0) at /data/src/10.2/storage/innobase/srv/srv0srv.cc:2523
      #11 0x00007efddcdd6494 in start_thread (arg=0x7efdc27fc700) at pthread_create.c:333
      #12 0x00007efddaf4e93f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      10.1 112b21da37d with injection

      #2  0x000055ba7b38b34a in handle_fatal_signal (sig=11) at /data/src/10.1-bug3/sql/signal_handler.cc:296
      #3  <signal handler called>
      #4  0x00007feb84febcff in dict_index_is_clust (index=0x0) at /data/src/10.1-bug3/storage/innobase/include/dict0dict.ic:270
      #5  0x00007feb84ff698a in row_merge_drop_indexes (trx=0x7feb8b093278, table=0x7feb842e78f8, locked=1) at /data/src/10.1-bug3/storage/innobase/row/row0merge.cc:3184
      #6  0x00007feb84e615a9 in dict_table_remove_from_cache_low (table=0x7feb842e78f8, lru_evict=1) at /data/src/10.1-bug3/storage/innobase/dict/dict0dict.cc:2170
      #7  0x00007feb84e5f305 in dict_make_room_in_cache (max_tables=400, pct_check=50) at /data/src/10.1-bug3/storage/innobase/dict/dict0dict.cc:1509
      #8  0x00007feb85054c7c in srv_master_evict_from_table_cache (pct_check=50) at /data/src/10.1-bug3/storage/innobase/srv/srv0srv.cc:2176
      #9  0x00007feb850550ba in srv_master_do_active_tasks () at /data/src/10.1-bug3/storage/innobase/srv/srv0srv.cc:2299
      #10 0x00007feb85055a95 in srv_master_thread (arg=0x0) at /data/src/10.1-bug3/storage/innobase/srv/srv0srv.cc:2491
      #11 0x00007feb8db21494 in start_thread (arg=0x7feb7c7fd700) at pthread_create.c:333
      #12 0x00007feb8bc6c93f in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - - edited

            It fails for me easily with and without injection (maybe my machine is slow enough not to need an extra sleep). So, I recommend trying without injection first, and if it doesn't work, then apply it.

            On MySQL 5.7 (also without the injection) I'm getting the assertion failure:

            #5  0x00007f7ba988b3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6
            #6  0x0000000001be985b in ut_dbg_assertion_failed (expr=0x21ba482 "index", file=0x21ba3f0 "/data/src/mysql-5.7/storage/innobase/include/dict0dict.ic", line=304) at /data/src/mysql-5.7/storage/innobase/ut/ut0dbg.cc:67
            #7  0x0000000001afc67f in dict_index_is_clust (index=0x0) at /data/src/mysql-5.7/storage/innobase/include/dict0dict.ic:304
            #8  0x0000000001b081cb in row_merge_drop_indexes (trx=0x7f7ba4e06150, table=0x7f7b580103c0, locked=1) at /data/src/mysql-5.7/storage/innobase/row/row0merge.cc:3495
            #9  0x0000000001c8b1d1 in dict_table_remove_from_cache_low (table=0x7f7b580103c0, lru_evict=1) at /data/src/mysql-5.7/storage/innobase/dict/dict0dict.cc:2205
            #10 0x0000000001c88eda in dict_make_room_in_cache (max_tables=400, pct_check=50) at /data/src/mysql-5.7/storage/innobase/dict/dict0dict.cc:1563
            #11 0x0000000001b83936 in srv_master_evict_from_table_cache (pct_check=50) at /data/src/mysql-5.7/storage/innobase/srv/srv0srv.cc:1973
            #12 0x0000000001b83eb4 in srv_master_do_active_tasks () at /data/src/mysql-5.7/storage/innobase/srv/srv0srv.cc:2141
            #13 0x0000000001b8479f in srv_master_thread (arg=0x0) at /data/src/mysql-5.7/storage/innobase/srv/srv0srv.cc:2369
            #14 0x00007f7baaee9494 in start_thread (arg=0x7f7b99cc1700) at pthread_create.c:333
            #15 0x00007f7ba993f93f in clone () from /lib/x86_64-linux-gnu/libc.so.6
            

            On MariaDB so far it has always been SIGSEGV.

            elenst Elena Stepanova added a comment - - edited It fails for me easily with and without injection (maybe my machine is slow enough not to need an extra sleep). So, I recommend trying without injection first, and if it doesn't work, then apply it. On MySQL 5.7 (also without the injection) I'm getting the assertion failure: #5 0x00007f7ba988b3fa in abort () from /lib/x86_64-linux-gnu/libc.so.6 #6 0x0000000001be985b in ut_dbg_assertion_failed (expr=0x21ba482 "index", file=0x21ba3f0 "/data/src/mysql-5.7/storage/innobase/include/dict0dict.ic", line=304) at /data/src/mysql-5.7/storage/innobase/ut/ut0dbg.cc:67 #7 0x0000000001afc67f in dict_index_is_clust (index=0x0) at /data/src/mysql-5.7/storage/innobase/include/dict0dict.ic:304 #8 0x0000000001b081cb in row_merge_drop_indexes (trx=0x7f7ba4e06150, table=0x7f7b580103c0, locked=1) at /data/src/mysql-5.7/storage/innobase/row/row0merge.cc:3495 #9 0x0000000001c8b1d1 in dict_table_remove_from_cache_low (table=0x7f7b580103c0, lru_evict=1) at /data/src/mysql-5.7/storage/innobase/dict/dict0dict.cc:2205 #10 0x0000000001c88eda in dict_make_room_in_cache (max_tables=400, pct_check=50) at /data/src/mysql-5.7/storage/innobase/dict/dict0dict.cc:1563 #11 0x0000000001b83936 in srv_master_evict_from_table_cache (pct_check=50) at /data/src/mysql-5.7/storage/innobase/srv/srv0srv.cc:1973 #12 0x0000000001b83eb4 in srv_master_do_active_tasks () at /data/src/mysql-5.7/storage/innobase/srv/srv0srv.cc:2141 #13 0x0000000001b8479f in srv_master_thread (arg=0x0) at /data/src/mysql-5.7/storage/innobase/srv/srv0srv.cc:2369 #14 0x00007f7baaee9494 in start_thread (arg=0x7f7b99cc1700) at pthread_create.c:333 #15 0x00007f7ba993f93f in clone () from /lib/x86_64-linux-gnu/libc.so.6 On MariaDB so far it has always been SIGSEGV.
            jianwei zhao jianwei zhao added a comment -

            Upload the fixed DIFF file from AliSQL. fix.diff

            jianwei zhao jianwei zhao added a comment - Upload the fixed DIFF file from AliSQL. fix.diff

            jianwei zhao, thank you for reporting the bug that was introduced by me in MySQL 5.6.6.
            The uploaded diff seems to revert part of that commit.

            I believe that a simpler fix would be to compare table->n_ref_count to 0 instead of 1, because we already decremented the count, and we are interested in whether the last table handle is being released.

            I will try to rewrite the test case with DEBUG_SYNC, because sleeps can make regression tests nondeterministic and unnecessarily slow.

            marko Marko Mäkelä added a comment - jianwei zhao , thank you for reporting the bug that was introduced by me in MySQL 5.6.6 . The uploaded diff seems to revert part of that commit. I believe that a simpler fix would be to compare table->n_ref_count to 0 instead of 1, because we already decremented the count, and we are interested in whether the last table handle is being released. I will try to rewrite the test case with DEBUG_SYNC, because sleeps can make regression tests nondeterministic and unnecessarily slow.
            jianwei zhao jianwei zhao added a comment -

            Marko, table->n_ref_count == 0 can solve the problem, and looks like that it make more sense.
            I have tried the DEBUG_SYNC, but it didn't work in InnoDB Engine codes, it's ok in the Server layer codes.
            I didn't find the reason, so used os_thread_sleep() instead.

            jianwei zhao jianwei zhao added a comment - Marko, table->n_ref_count == 0 can solve the problem, and looks like that it make more sense. I have tried the DEBUG_SYNC, but it didn't work in InnoDB Engine codes, it's ok in the Server layer codes. I didn't find the reason, so used os_thread_sleep() instead.

            With a patch applied, the test still occasionally fails.
            In all cases (crash or not crash), all indexes of the table (actually, only the clustered index) would be removed here:

            #1  0x0000000000e222b2 in dict_index_remove_from_cache_low (table=0x7fffb400bd18, 
                index=0x7fffb400f038, lru_evict=1) at /mariadb/10.0/storage/xtradb/dict/dict0dict.cc:2735
            #2  0x0000000000e1f8ed in dict_table_remove_from_cache_low (table=0x7fffb400bd18, lru_evict=1)
                at /mariadb/10.0/storage/xtradb/dict/dict0dict.cc:2068
            #3  0x0000000000e1f440 in dict_make_room_in_cache (max_tables=400, pct_check=50)
                at /mariadb/10.0/storage/xtradb/dict/dict0dict.cc:1453
            #4  0x0000000000d41d39 in srv_master_evict_from_table_cache (pct_check=50)
                at /mariadb/10.0/storage/xtradb/srv/srv0srv.cc:2730
            #5  0x0000000000d3efe7 in srv_master_do_active_tasks ()
                at /mariadb/10.0/storage/xtradb/srv/srv0srv.cc:2853
            #6  0x0000000000d3e757 in srv_master_thread (arg=0x0)
                at /mariadb/10.0/storage/xtradb/srv/srv0srv.cc:3062
            

            What makes the difference between crash or no crash is table->drop_aborted. If it is set, we will obviously fail later in dict_table_remove_from_cache_low() when invoking row_merge_drop_indexes() to the drop SYS_INDEXES entries for the orphan indexes. Obviously, because the indexes were just freed by the same function.

            It looks like this has always been broken, and the bug was introduced in WL#5526 online ADD INDEX by me.
            We must use some InnoDB SQL to drop all orphan indexes of the table, instead of invoking row_merge_drop_indexes(). The built-in "trigger" on SYS_INDEXES records will take care of freeing the index trees.

            marko Marko Mäkelä added a comment - With a patch applied, the test still occasionally fails. In all cases (crash or not crash), all indexes of the table (actually, only the clustered index) would be removed here: #1 0x0000000000e222b2 in dict_index_remove_from_cache_low (table=0x7fffb400bd18, index=0x7fffb400f038, lru_evict=1) at /mariadb/10.0/storage/xtradb/dict/dict0dict.cc:2735 #2 0x0000000000e1f8ed in dict_table_remove_from_cache_low (table=0x7fffb400bd18, lru_evict=1) at /mariadb/10.0/storage/xtradb/dict/dict0dict.cc:2068 #3 0x0000000000e1f440 in dict_make_room_in_cache (max_tables=400, pct_check=50) at /mariadb/10.0/storage/xtradb/dict/dict0dict.cc:1453 #4 0x0000000000d41d39 in srv_master_evict_from_table_cache (pct_check=50) at /mariadb/10.0/storage/xtradb/srv/srv0srv.cc:2730 #5 0x0000000000d3efe7 in srv_master_do_active_tasks () at /mariadb/10.0/storage/xtradb/srv/srv0srv.cc:2853 #6 0x0000000000d3e757 in srv_master_thread (arg=0x0) at /mariadb/10.0/storage/xtradb/srv/srv0srv.cc:3062 What makes the difference between crash or no crash is table->drop_aborted. If it is set, we will obviously fail later in dict_table_remove_from_cache_low() when invoking row_merge_drop_indexes() to the drop SYS_INDEXES entries for the orphan indexes. Obviously, because the indexes were just freed by the same function. It looks like this has always been broken, and the bug was introduced in WL#5526 online ADD INDEX by me. We must use some InnoDB SQL to drop all orphan indexes of the table, instead of invoking row_merge_drop_indexes(). The built-in "trigger" on SYS_INDEXES records will take care of freeing the index trees.

            I cannot find any interface where InnoDB would report the UT_LIST_GET_LEN(dict_sys->table_LRU) to the client. It is neither in SHOW ENGINE INNODB STATUS nor in SHOW VARIABLES (export_vars).

            Therefore, it will be impossible to write a deterministic test for this. Only the first part (error during ADD UNIQUE INDEX) can be deterministically triggered by using DEBUG_SYNC.

            marko Marko Mäkelä added a comment - I cannot find any interface where InnoDB would report the UT_LIST_GET_LEN(dict_sys->table_LRU) to the client. It is neither in SHOW ENGINE INNODB STATUS nor in SHOW VARIABLES (export_vars). Therefore, it will be impossible to write a deterministic test for this. Only the first part (error during ADD UNIQUE INDEX) can be deterministically triggered by using DEBUG_SYNC.
            jianwei zhao jianwei zhao added a comment -

            I think the reason of occasional crash is that it released the dict_sys->mutex after drop_aborted is judged TRUE in dict_table_close(). then master thread dict_index_remove_from_cache_low() is prior of dict_table_try_drop_aborted(), so it crashed.

            as the comment described, the indexes should be dropped when dict_table_close(), but occasionally it is dropped by master thread.

            so another fix plan is that hold the dict_sys->mutex if drop_aborted== TRUE until dict_table_try_drop_aborted() complete.

            jianwei zhao jianwei zhao added a comment - I think the reason of occasional crash is that it released the dict_sys->mutex after drop_aborted is judged TRUE in dict_table_close(). then master thread dict_index_remove_from_cache_low() is prior of dict_table_try_drop_aborted(), so it crashed. as the comment described, the indexes should be dropped when dict_table_close(), but occasionally it is dropped by master thread. so another fix plan is that hold the dict_sys->mutex if drop_aborted== TRUE until dict_table_try_drop_aborted() complete.

            jianwei zhao, sorry, I missed your last comment. It looks like I pushed my fix around the same time as you wrote it. I am now revisiting this, because I believe that MDEV-19063 was filed as a duplicate of this bug.

            I solved the second problem by modifying dict_table_remove_from_cache_low(). That part is missing from the AliSQL bug fix.

            marko Marko Mäkelä added a comment - jianwei zhao , sorry, I missed your last comment. It looks like I pushed my fix around the same time as you wrote it. I am now revisiting this, because I believe that MDEV-19063 was filed as a duplicate of this bug. I solved the second problem by modifying dict_table_remove_from_cache_low() . That part is missing from the AliSQL bug fix .

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.