Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22867

Assertion `instant.n_core_fields == n_core_fields' failed in dict_index_t::instant_add_field

Details

    Description

      Note: The test case has an obvious race condition, run with --repeat=N. It currently fails nearly every time for me, both in memory and on disk, but it can vary on different machines and builds.

      --source include/have_innodb.inc
       
      CREATE TABLE t1 (a BINARY, UNIQUE(a)) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (NULL);
      ALTER TABLE t1 ADD COLUMN b INT;
       
      --connect (con1,localhost,root,,test)
      --send
        ALTER TABLE t1 DROP a;
       
      --connection default
      DELETE FROM t1;
       
      --connection con1
      --reap
       
      # Cleanup
      --disconnect con1
      --connection default
      DROP TABLE t1;
      

      10.5 17a7bafe

      mariadbd: /data/src/10.5/storage/innobase/handler/handler0alter.cc:422: void dict_index_t::instant_add_field(const dict_index_t&): Assertion `instant.n_core_fields == n_core_fields' failed.
      200610 21:59:38 [ERROR] mysqld got signal 6 ;
       
      #7  0x00007fc6095eaf12 in __GI___assert_fail (assertion=0x55f9bb456db8 "instant.n_core_fields == n_core_fields", file=0x55f9bb456618 "/data/src/10.5/storage/innobase/handler/handler0alter.cc", line=422, function=0x55f9bb45f700 <dict_index_t::instant_add_field(dict_index_t const&)::__PRETTY_FUNCTION__> "void dict_index_t::instant_add_field(const dict_index_t&)") at assert.c:101
      #8  0x000055f9babac041 in dict_index_t::instant_add_field (this=0x7fc5d421efc8, instant=...) at /data/src/10.5/storage/innobase/handler/handler0alter.cc:422
      #9  0x000055f9babae111 in dict_table_t::instant_column (this=0x7fc5d40faf58, table=..., col_map=0x7fc5c80230f8) at /data/src/10.5/storage/innobase/handler/handler0alter.cc:604
      #10 0x000055f9babb0509 in ha_innobase_inplace_ctx::instant_column (this=0x7fc5c8013950) at /data/src/10.5/storage/innobase/handler/handler0alter.cc:1050
      #11 0x000055f9bab8f737 in innobase_instant_try (ha_alter_info=0x7fc5feb17e80, ctx=0x7fc5c8013950, altered_table=0x7fc5feb17f20, table=0x7fc5c801a8f8, trx=0x7fc5ffc02558) at /data/src/10.5/storage/innobase/handler/handler0alter.cc:5629
      #12 0x000055f9babb3d9a in commit_try_norebuild (ha_alter_info=0x7fc5feb17e80, ctx=0x7fc5c8013950, altered_table=0x7fc5feb17f20, old_table=0x7fc5c801a8f8, trx=0x7fc5ffc02558, table_name=0x7fc5d40fc245 "t1") at /data/src/10.5/storage/innobase/handler/handler0alter.cc:10179
      #13 0x000055f9baba27c3 in ha_innobase::commit_inplace_alter_table (this=0x7fc5c801bbe0, altered_table=0x7fc5feb17f20, ha_alter_info=0x7fc5feb17e80, commit=true) at /data/src/10.5/storage/innobase/handler/handler0alter.cc:10901
      #14 0x000055f9ba6fa051 in handler::ha_commit_inplace_alter_table (this=0x7fc5c801bbe0, altered_table=0x7fc5feb17f20, ha_alter_info=0x7fc5feb17e80, commit=true) at /data/src/10.5/sql/handler.cc:4745
      #15 0x000055f9ba492313 in mysql_inplace_alter_table (thd=0x7fc5c8000b18, table_list=0x7fc5c80125c8, table=0x7fc5c801a8f8, altered_table=0x7fc5feb17f20, ha_alter_info=0x7fc5feb17e80, inplace_supported=HA_ALTER_INPLACE_NOCOPY_NO_LOCK, target_mdl_request=0x7fc5feb18ce0, alter_ctx=0x7fc5feb19830) at /data/src/10.5/sql/sql_table.cc:7947
      #16 0x000055f9ba499b0c in mysql_alter_table (thd=0x7fc5c8000b18, new_db=0x7fc5c80053d8, new_name=0x7fc5c80057e0, create_info=0x7fc5feb1a430, table_list=0x7fc5c80125c8, alter_info=0x7fc5feb1a360, order_num=0, order=0x0, ignore=false, if_exists=false) at /data/src/10.5/sql/sql_table.cc:10468
      #17 0x000055f9ba53effa in Sql_cmd_alter_table::execute (this=0x7fc5c8012cc8, thd=0x7fc5c8000b18) at /data/src/10.5/sql/sql_alter.cc:538
      #18 0x000055f9ba39eef3 in mysql_execute_command (thd=0x7fc5c8000b18) at /data/src/10.5/sql/sql_parse.cc:5950
      #19 0x000055f9ba3a5227 in mysql_parse (thd=0x7fc5c8000b18, rawbuf=0x7fc5c80124f0 "ALTER TABLE t1 DROP a", length=21, parser_state=0x7fc5feb1b520, is_com_multi=false, is_next_command=false) at /data/src/10.5/sql/sql_parse.cc:7992
      #20 0x000055f9ba3915ad in dispatch_command (command=COM_QUERY, thd=0x7fc5c8000b18, packet=0x7fc5c8008939 "ALTER TABLE t1 DROP a", packet_length=21, is_com_multi=false, is_next_command=false) at /data/src/10.5/sql/sql_parse.cc:1875
      #21 0x000055f9ba38fce5 in do_command (thd=0x7fc5c8000b18) at /data/src/10.5/sql/sql_parse.cc:1356
      #22 0x000055f9ba5346cf in do_handle_one_connection (connect=0x55f9bd3abb58, put_in_cache=true) at /data/src/10.5/sql/sql_connect.cc:1411
      #23 0x000055f9ba534437 in handle_one_connection (arg=0x55f9bd3abb58) at /data/src/10.5/sql/sql_connect.cc:1313
      #24 0x000055f9baa6d60e in pfs_spawn_thread (arg=0x55f9bd7121f8) at /data/src/10.5/storage/perfschema/pfs.cc:2201
      #25 0x00007fc60b5734a4 in start_thread (arg=0x7fc5feb1c700) at pthread_create.c:456
      #26 0x00007fc6096a7d0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97
      

      No obvious immediate problem on a non-debug build.
      Couldn't reproduce on 10.4, however the failure seems to have appeared after the merge 7bcaa541aa, which featured this commit:

      commit 474290540a829edcc6a74be8c354053f330bf5de
      Author: Marko Mäkelä
      Date:   Tue May 5 14:20:47 2020 +0300
       
          MDEV-22465: DROP indexed COLUMN is wrongly claimed to be ALGORITHM=INSTANT
      

      Attachments

        Issue Links

          Activity

            This is a race condition between DELETE clearing the ‘instant ADD’ status of the table, and instant DROP COLUMN. During the DELETE, the DROP COLUMN is blocked by MDL, which it is supposed to, because the operation is not ALGORITHM=INSTANT any more. An ALGORITHM=INSTANT operation would be continuously protected by MDL_EXCLUSIVE:

            10.5 c9f262ee0d256368cab4fdb894fd31d5fb669680

            #16 0x000055eacd4b5b18 in MDL_context::upgrade_shared_lock (
                this=this@entry=0x50f538000e68, mdl_ticket=0x50f538007b50, 
                new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400)
                at /mariadb/10.5m/sql/mdl.cc:2583
            #17 0x000055eacd3182c2 in wait_while_table_is_used (
                thd=thd@entry=0x50f538000d48, table=table@entry=0x50f53801b1e8, 
                function=function@entry=HA_EXTRA_PREPARE_FOR_RENAME)
                at /mariadb/10.5m/sql/sql_base.cc:1308
            #18 0x000055eacd441146 in mysql_inplace_alter_table (
                target_mdl_request=0x7fe9887c9490, alter_ctx=0x7fe9887ca8c0, 
                inplace_supported=HA_ALTER_INPLACE_NOCOPY_NO_LOCK, 
                ha_alter_info=0x7fe9887c8630, altered_table=0x7fe9887c86d0, 
                table=0x50f53801b1e8, table_list=0x50f538012948, thd=0x50f538000d48)
                at /mariadb/10.5m/sql/sql_table.cc:7903
            

            During this wait, the table is converted to plain format by not only the DELETE, but also a purge worker:

            10.5 c9f262ee0d256368cab4fdb894fd31d5fb669680

            #0  dict_index_t::clear_instant_add (this=0x7ad741a1f58) at /mariadb/10.5m/storage/innobase/include/dict0mem.h:2354
            #1  0x000055eacda9c2f1 in btr_cur_optimistic_delete_func (cursor=cursor@entry=0x55eacfb72868, flags=flags@entry=0, mtr=mtr@entry=0x4cf13afbd180) at /mariadb/10.5m/storage/innobase/btr/btr0cur.cc:5482
            #2  0x000055eacd9f1828 in row_purge_remove_clust_if_poss_low (node=0x55eacfb727c8, mode=2) at /mariadb/10.5m/storage/innobase/include/btr0pcur.ic:57
            #3  0x000055eacd9f30b8 in row_purge_remove_clust_if_poss (node=0x55eacfb727c8) at /mariadb/10.5m/storage/innobase/row/row0purge.cc:647
            #4  row_purge_del_mark (node=0x55eacfb727c8) at /mariadb/10.5m/storage/innobase/row/row0purge.cc:647
            

            Based on the above, I created a deterministic test case, but it is only crashing in 10.5, not 10.4. It might be related to MDEV-16678. Nevertheless, I would rather fix this in 10.4 too.

            --source include/have_innodb.inc
            --source include/have_debug.inc
            --source include/have_debug_sync.inc
             
            --echo #
            --echo # MDEV-22867 Assertion instant.n_core_fields == n_core_fields
            --echo # in dict_index_t::instant_add_field
            --echo #
             
            SET @save_freq= @@GLOBAL.innodb_purge_rseg_truncate_frequency;
            SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
             
            CREATE TABLE t1 (a BINARY, UNIQUE(a)) ENGINE=InnoDB;
            INSERT INTO t1 VALUES (NULL);
            ALTER TABLE t1 ADD COLUMN b INT;
             
            --connect (con1,localhost,root,,test)
            SET DEBUG_SYNC='innodb_inplace_alter_table_enter SIGNAL ddl WAIT_FOR purged';
            send ALTER TABLE t1 DROP a;
             
            --connection default
            SET DEBUG_SYNC='now WAIT_FOR ddl';
            DELETE FROM t1;
            --source include/wait_all_purged.inc
            SET DEBUG_SYNC='now SIGNAL purged';
             
            --connection con1
            reap;
            --disconnect con1
            --connection default
             
            DROP TABLE t1;
            SET DEBUG_SYNC=RESET;
             
            SET GLOBAL innodb_purge_rseg_truncate_frequency=@save_freq;
            

            The fix is simple (tested on 10.5):

            diff --git a/storage/innobase/include/dict0mem.h b/storage/innobase/include/dict0mem.h
            index 731ff545685..607da60d11c 100644
            --- a/storage/innobase/include/dict0mem.h
            +++ b/storage/innobase/include/dict0mem.h
            @@ -2354,6 +2354,13 @@ inline void dict_index_t::clear_instant_add()
               DBUG_ASSERT(is_primary());
               DBUG_ASSERT(is_instant());
               DBUG_ASSERT(!table->instant);
            +
            +  /* Do not clear the "instantness" after instant ADD COLUMN if
            +  an instant DROP COLUMN, DROP INDEX is in progress. */
            +  for (dict_index_t *i= this; (i= UT_LIST_GET_NEXT(indexes, i)) != nullptr; )
            +    if (i->to_be_dropped)
            +      return;
            +
               for (unsigned i= n_core_fields; i < n_fields; i++)
                 fields[i].col->clear_instant();
               n_core_fields= n_fields;
            

            I will push it to 10.4 after running ./mtr --big-test on both 10.4 and 10.5.

            marko Marko Mäkelä added a comment - This is a race condition between DELETE clearing the ‘instant ADD’ status of the table, and instant DROP COLUMN . During the DELETE , the DROP COLUMN is blocked by MDL, which it is supposed to, because the operation is not ALGORITHM=INSTANT any more. An ALGORITHM=INSTANT operation would be continuously protected by MDL_EXCLUSIVE : 10.5 c9f262ee0d256368cab4fdb894fd31d5fb669680 #16 0x000055eacd4b5b18 in MDL_context::upgrade_shared_lock ( this=this@entry=0x50f538000e68, mdl_ticket=0x50f538007b50, new_type=new_type@entry=MDL_EXCLUSIVE, lock_wait_timeout=86400) at /mariadb/10.5m/sql/mdl.cc:2583 #17 0x000055eacd3182c2 in wait_while_table_is_used ( thd=thd@entry=0x50f538000d48, table=table@entry=0x50f53801b1e8, function=function@entry=HA_EXTRA_PREPARE_FOR_RENAME) at /mariadb/10.5m/sql/sql_base.cc:1308 #18 0x000055eacd441146 in mysql_inplace_alter_table ( target_mdl_request=0x7fe9887c9490, alter_ctx=0x7fe9887ca8c0, inplace_supported=HA_ALTER_INPLACE_NOCOPY_NO_LOCK, ha_alter_info=0x7fe9887c8630, altered_table=0x7fe9887c86d0, table=0x50f53801b1e8, table_list=0x50f538012948, thd=0x50f538000d48) at /mariadb/10.5m/sql/sql_table.cc:7903 During this wait, the table is converted to plain format by not only the DELETE , but also a purge worker: 10.5 c9f262ee0d256368cab4fdb894fd31d5fb669680 #0 dict_index_t::clear_instant_add (this=0x7ad741a1f58) at /mariadb/10.5m/storage/innobase/include/dict0mem.h:2354 #1 0x000055eacda9c2f1 in btr_cur_optimistic_delete_func (cursor=cursor@entry=0x55eacfb72868, flags=flags@entry=0, mtr=mtr@entry=0x4cf13afbd180) at /mariadb/10.5m/storage/innobase/btr/btr0cur.cc:5482 #2 0x000055eacd9f1828 in row_purge_remove_clust_if_poss_low (node=0x55eacfb727c8, mode=2) at /mariadb/10.5m/storage/innobase/include/btr0pcur.ic:57 #3 0x000055eacd9f30b8 in row_purge_remove_clust_if_poss (node=0x55eacfb727c8) at /mariadb/10.5m/storage/innobase/row/row0purge.cc:647 #4 row_purge_del_mark (node=0x55eacfb727c8) at /mariadb/10.5m/storage/innobase/row/row0purge.cc:647 Based on the above, I created a deterministic test case, but it is only crashing in 10.5, not 10.4. It might be related to MDEV-16678 . Nevertheless, I would rather fix this in 10.4 too. --source include/have_innodb.inc --source include/have_debug.inc --source include/have_debug_sync.inc   --echo # --echo # MDEV-22867 Assertion instant.n_core_fields == n_core_fields --echo # in dict_index_t::instant_add_field --echo #   SET @save_freq= @@ GLOBAL .innodb_purge_rseg_truncate_frequency; SET GLOBAL innodb_purge_rseg_truncate_frequency=1;   CREATE TABLE t1 (a BINARY , UNIQUE (a)) ENGINE=InnoDB; INSERT INTO t1 VALUES ( NULL ); ALTER TABLE t1 ADD COLUMN b INT ;   --connect (con1,localhost,root,,test) SET DEBUG_SYNC= 'innodb_inplace_alter_table_enter SIGNAL ddl WAIT_FOR purged' ; send ALTER TABLE t1 DROP a;   --connection default SET DEBUG_SYNC= 'now WAIT_FOR ddl' ; DELETE FROM t1; --source include/wait_all_purged.inc SET DEBUG_SYNC= 'now SIGNAL purged' ;   --connection con1 reap; --disconnect con1 --connection default   DROP TABLE t1; SET DEBUG_SYNC=RESET;   SET GLOBAL innodb_purge_rseg_truncate_frequency=@save_freq; The fix is simple (tested on 10.5): diff --git a/storage/innobase/include/dict0mem.h b/storage/innobase/include/dict0mem.h index 731ff545685..607da60d11c 100644 --- a/storage/innobase/include/dict0mem.h +++ b/storage/innobase/include/dict0mem.h @@ -2354,6 +2354,13 @@ inline void dict_index_t::clear_instant_add() DBUG_ASSERT(is_primary()); DBUG_ASSERT(is_instant()); DBUG_ASSERT(!table->instant); + + /* Do not clear the "instantness" after instant ADD COLUMN if + an instant DROP COLUMN, DROP INDEX is in progress. */ + for (dict_index_t *i= this; (i= UT_LIST_GET_NEXT(indexes, i)) != nullptr; ) + if (i->to_be_dropped) + return; + for (unsigned i= n_core_fields; i < n_fields; i++) fields[i].col->clear_instant(); n_core_fields= n_fields; I will push it to 10.4 after running ./mtr --big-test on both 10.4 and 10.5.
            marko Marko Mäkelä added a comment - - edited

            On a second thought, we must not let the metadata and the persistent data file get out of sync.

            We must block this at the ultimate callers of dict_index_t::clear_instant_add(). The affected callers are btr_discard_only_page_on_level(), btr_cur_optimistic_delete_func(), btr_cur_pessimistic_delete().

            marko Marko Mäkelä added a comment - - edited On a second thought, we must not let the metadata and the persistent data file get out of sync. We must block this at the ultimate callers of dict_index_t::clear_instant_add() . The affected callers are btr_discard_only_page_on_level() , btr_cur_optimistic_delete_func() , btr_cur_pessimistic_delete() .

            MDEV-16678 probably disabled the MDEV-20117 fix in this case where we have an rebuild-avoiding operation combined with ADD/DROP index.

            I was able to simplify the test so that purge is not directly involved. But, for some reason it does not crash in 10.4 for me. It could still be safest to fix this bug also in 10.4.

            --source include/have_innodb.inc
            --source include/have_debug.inc
            --source include/have_debug_sync.inc
            --echo #
            --echo # MDEV-22867 Assertion instant.n_core_fields == n_core_fields
            --echo # in dict_index_t::instant_add_field
            --echo #
             
            CREATE TABLE t1 (a INT PRIMARY KEY, b INT UNIQUE) ENGINE=InnoDB;
            INSERT INTO t1 SET a=1;
             
            connect (prevent_purge,localhost,root,,);
            START TRANSACTION WITH CONSISTENT SNAPSHOT;
            connection default;
            ALTER TABLE t1 ADD COLUMN c INT;
            DELETE FROM t1;
             
            connect (con2,localhost,root,,test);
            SET DEBUG_SYNC='innodb_inplace_alter_table_enter SIGNAL ddl WAIT_FOR emptied';
            send ALTER TABLE t1 DROP b;
             
            connection default;
            SET DEBUG_SYNC='now WAIT_FOR ddl';
            BEGIN;
            INSERT INTO t1 SET a=1;
             
            connection prevent_purge;
            COMMIT;
            disconnect prevent_purge;
             
            connection default;
            ROLLBACK;
            SELECT * FROM t1;
            SET DEBUG_SYNC='now SIGNAL emptied';
             
            connection con2;
            reap;
            disconnect con2;
            connection default;
            DROP TABLE t1;
            SET DEBUG_SYNC=RESET;
            

            Note: there is a slight race condition in the test at the COMMIT. I did not double-check that purge did not kick in at that point. But I do not think that it should be able to. We are holding the page latch when row_undo_mod_must_purge() returns true.

            If you run with innodb_force_recovery=2, then row_undo_mod_must_purge() will not hold, because the purge_sys.view will not be updated.

            marko Marko Mäkelä added a comment - MDEV-16678 probably disabled the MDEV-20117 fix in this case where we have an rebuild-avoiding operation combined with ADD/DROP index. I was able to simplify the test so that purge is not directly involved. But, for some reason it does not crash in 10.4 for me. It could still be safest to fix this bug also in 10.4. --source include/have_innodb.inc --source include/have_debug.inc --source include/have_debug_sync.inc --echo # --echo # MDEV-22867 Assertion instant.n_core_fields == n_core_fields --echo # in dict_index_t::instant_add_field --echo #   CREATE TABLE t1 (a INT PRIMARY KEY , b INT UNIQUE ) ENGINE=InnoDB; INSERT INTO t1 SET a=1;   connect (prevent_purge,localhost,root,,); START TRANSACTION WITH CONSISTENT SNAPSHOT; connection default ; ALTER TABLE t1 ADD COLUMN c INT ; DELETE FROM t1;   connect (con2,localhost,root,,test); SET DEBUG_SYNC= 'innodb_inplace_alter_table_enter SIGNAL ddl WAIT_FOR emptied' ; send ALTER TABLE t1 DROP b;   connection default ; SET DEBUG_SYNC= 'now WAIT_FOR ddl' ; BEGIN ; INSERT INTO t1 SET a=1;   connection prevent_purge; COMMIT ; disconnect prevent_purge;   connection default ; ROLLBACK ; SELECT * FROM t1; SET DEBUG_SYNC= 'now SIGNAL emptied' ;   connection con2; reap; disconnect con2; connection default ; DROP TABLE t1; SET DEBUG_SYNC=RESET; Note: there is a slight race condition in the test at the COMMIT . I did not double-check that purge did not kick in at that point. But I do not think that it should be able to. We are holding the page latch when row_undo_mod_must_purge() returns true . If you run with innodb_force_recovery=2 , then row_undo_mod_must_purge() will not hold, because the purge_sys.view will not be updated.

            It seems that in 10.4, the function instant_metadata_lock() would prevent the race condition. But, it would also hold a page latch on the leftmost leaf page of clustered index for the duration of a possible DROP INDEX operation. It would probably violate the latching order and could cause InnoDB to hang if concurrent DML is executed during the ALTER TABLE…DROP COLUMN…, DROP INDEX operation.

            According to the latching order, a secondary index leaf page latch may be held while looking up something in a clustered index. The following should be what happens in 10.4:

            1. dict_table_t::prepare_instant() acquires an S-latch on the leftmost clustered index leaf page
            2. MDL_EXCLUSIVE is released for the duration of ha_innobase::inplace_alter_table(), which does nothing in this case (there is no ADD INDEX), and upgraded back to MDL_EXCLUSIVE
            3. ha_innobase_commit_inplace_alter_table() will execute the DROP INDEX and finalize the "instant" DROP COLUMN

            So, 10.4 does not seem to be affected. The MDEV-17813 code for instant_metadata_lock() was removed from 10.5 as part of MDEV-16678. It might be simplest to restore that code.

            Restoring instant_metadata_lock() could be a bad idea, because at some point, MDEV-16282 or a related task may implement ADD INDEX in combination with ALGORITHM=NOCOPY variant of DROP COLUMN. I think that it is safest to apply my outlined fix to 10.5.

            marko Marko Mäkelä added a comment - It seems that in 10.4, the function instant_metadata_lock() would prevent the race condition. But, it would also hold a page latch on the leftmost leaf page of clustered index for the duration of a possible DROP INDEX operation. It would probably violate the latching order and could cause InnoDB to hang if concurrent DML is executed during the ALTER TABLE…DROP COLUMN…, DROP INDEX operation. According to the latching order, a secondary index leaf page latch may be held while looking up something in a clustered index. The following should be what happens in 10.4: dict_table_t::prepare_instant() acquires an S-latch on the leftmost clustered index leaf page MDL_EXCLUSIVE is released for the duration of ha_innobase::inplace_alter_table() , which does nothing in this case (there is no ADD INDEX ), and upgraded back to MDL_EXCLUSIVE ha_innobase_commit_inplace_alter_table() will execute the DROP INDEX and finalize the "instant" DROP COLUMN So, 10.4 does not seem to be affected. The MDEV-17813 code for instant_metadata_lock() was removed from 10.5 as part of MDEV-16678 . It might be simplest to restore that code. Restoring instant_metadata_lock() could be a bad idea, because at some point, MDEV-16282 or a related task may implement ADD INDEX in combination with ALGORITHM=NOCOPY variant of DROP COLUMN . I think that it is safest to apply my outlined fix to 10.5.

            People

              marko Marko Mäkelä
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.