Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27798

SIGSEGV in dict_index_t::reconstruct_fields()

Details

    Description

      The following crash was observed on 10.8:

      10.8 a635c40648519fd6c3729c9657872a16a0a20821

      #0  0x000055635ccef651 in dict_index_t::reconstruct_fields()::{lambda(dict_field_t const&)#1}::operator()(dict_field_t const&) const (o=
            @0x61a00000a9e0: {col = 0xbebebebebebebebe, name = {m_name = 0xbebebebebebebebe <error: Cannot access memory at address 0xbebebebebebebebe>}, prefix_len = 3774, fixed_len = 1003, descending = 0}, __closure=0x7fffda7bcbf0)
          at /mariadb/10.8/storage/innobase/dict/dict0mem.cc:1209
      1209					{ return o.col->ind == c.ind(); });
      

      One column has been instantly dropped, and some columns have been reordered. I tried to guess a test case, but I failed.

      diff --git a/mysql-test/suite/innodb/t/instant_alter_crash.test b/mysql-test/suite/innodb/t/instant_alter_crash.test
      index 43db8f619f3..75f26a80559 100644
      --- a/mysql-test/suite/innodb/t/instant_alter_crash.test
      +++ b/mysql-test/suite/innodb/t/instant_alter_crash.test
      @@ -184,6 +184,12 @@ CREATE TABLE t3(id INT PRIMARY KEY, c2 INT, v2 INT AS(c2) VIRTUAL, UNIQUE(v2))
       ENGINE=InnoDB;
       INSERT INTO t3 SET id=1,c2=1;
       
      +CREATE TABLE t4(id INT PRIMARY KEY,c2 INT,c3 INT,c4 INT,c5 INT,c6 INT,c7 INT)
      +ENGINE=InnoDB;
      +INSERT INTO t4 SET id=1;
      +ALTER TABLE t4 DROP c4, MODIFY c5 INT AFTER c2, ADD (c8 INT, c9 INT),
      +ALGORITHM=INSTANT;
      +
       SET DEBUG_SYNC='innodb_alter_inplace_before_commit SIGNAL ddl WAIT_FOR ever';
       --send
       ALTER TABLE t3 ADD COLUMN c3 TEXT NOT NULL DEFAULT 'sic transit gloria mundi';
      @@ -198,10 +204,12 @@ INSERT INTO t1 VALUES(0,0);
       disconnect ddl;
       --source include/start_mysqld.inc
       
      +SELECT * FROM t4;
       SHOW CREATE TABLE t1;
       SHOW CREATE TABLE t2;
       SHOW CREATE TABLE t3;
      -DROP TABLE t1,t2,t3;
      +SHOW CREATE TABLE t4;
      +DROP TABLE t1,t2,t3,t4;
       
       --remove_files_wildcard $MYSQLD_DATADIR/test #sql*.frm
       --list_files $MYSQLD_DATADIR/test
      

      The following should fix the failure by preventing the out-of-bounds access:

      diff --git a/storage/innobase/dict/dict0mem.cc b/storage/innobase/dict/dict0mem.cc
      index 8d53f646131..37cf7dc687f 100644
      --- a/storage/innobase/dict/dict0mem.cc
      +++ b/storage/innobase/dict/dict0mem.cc
      @@ -1177,6 +1177,8 @@ inline bool dict_index_t::reconstruct_fields()
       {
       	DBUG_ASSERT(is_primary());
       
      +	const auto old_n_fields{n_fields};
      +
       	n_fields = (n_fields + table->instant->n_dropped)
       		& dict_index_t::MAX_N_FIELDS;
       	n_def = (n_def + table->instant->n_dropped)
      @@ -1204,11 +1206,11 @@ inline bool dict_index_t::reconstruct_fields()
       		} else {
       			DBUG_ASSERT(!c.is_not_null());
       			const auto old = std::find_if(
      -				fields + n_first, fields + n_fields,
      +				fields + n_first, fields + old_n_fields,
       				[c](const dict_field_t& o)
       				{ return o.col->ind == c.ind(); });
       
      -			if (old >= fields + n_fields
      +			if (old >= fields + old_n_fields
       			    || old->prefix_len
       			    || old->col != &table->cols[c.ind()]) {
       				return true;
      

      I hope that mleich can come up with a test case for this.

      Attachments

        Activity

          # git clone https://github.com/mleich1/rqg --branch experimental RQG
          #
          # GIT_SHOW: HEAD -> experimental, origin/experimental c1922175f532c03ac4fd9a4c0a4170bc1b0b0665 2022-02-09T18:46:35+01:00
          # rqg.pl  : Version 4.0.4 (2021-12)
          #
          # $RQG_HOME/rqg.pl \
          # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \
          # --gendata=conf/mariadb/table_stress.zz \
          # --gendata_sql=conf/mariadb/table_stress.sql \
          # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \
          # --mysqld=--loose-idle_write_transaction_timeout=0 \
          # --mysqld=--loose-idle_transaction_timeout=0 \
          # --mysqld=--loose-idle_readonly_transaction_timeout=0 \
          # --mysqld=--connect_timeout=60 \
          # --mysqld=--interactive_timeout=28800 \
          # --mysqld=--slave_net_timeout=60 \
          # --mysqld=--net_read_timeout=30 \
          # --mysqld=--net_write_timeout=60 \
          # --mysqld=--loose-table_lock_wait_timeout=50 \
          # --mysqld=--wait_timeout=28800 \
          # --mysqld=--lock-wait-timeout=86400 \
          # --mysqld=--innodb-lock-wait-timeout=50 \
          # --no-mask \
          # --queries=10000000 \
          # --seed=random \
          # --reporters=Backtrace \
          # --reporters=ErrorLog \
          # --reporters=Deadlock1 \
          # --validators=None \
          # --mysqld=--log_output=none \
          # --mysqld=--log_bin_trust_function_creators=1 \
          # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \
          # --engine=InnoDB \
          # --restart_timeout=360 \
          # --mysqld=--plugin-load-add=file_key_management.so \
          # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \
          # --mysqld=--plugin-load-add=provider_lzo.so \
          # --mysqld=--plugin-load-add=provider_bzip2.so \
          # --mysqld=--plugin-load-add=provider_lzma \
          # --mysqld=--plugin-load-add=provider_snappy \
          # --mysqld=--plugin-load-add=provider_lz4 \
          # --duration=300 \
          # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \
          # --mysqld=--loose-innodb_read_only_compressed=OFF \
          # --reporters=CrashRecovery1 \
          # --duration=100 \
          # --mysqld=--innodb_stats_persistent=on \
          # --mysqld=--innodb_adaptive_hash_index=on \
          # --mysqld=--log-bin \
          # --mysqld=--sync-binlog=1 \
          # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \
          # --mysqld=--loose-max-statement-time=30 \
          # --threads=9 \
          # --mysqld=--innodb_use_native_aio=1 \
          # --mysqld=--innodb_rollback_on_timeout=OFF \
          # --vardir_type=fast \
          # --mysqld=--innodb_page_size=8K \
          # --mysqld=--innodb-buffer-pool-size=256M \
          # <local settings>
           
          Per Marko:
          Most probably only ADD/DROP/reorder column   + ALGORITHM=INSTANT is required.
          

          mleich Matthias Leich added a comment - # git clone https://github.com/mleich1/rqg --branch experimental RQG # # GIT_SHOW: HEAD -> experimental, origin/experimental c1922175f532c03ac4fd9a4c0a4170bc1b0b0665 2022-02-09T18:46:35+01:00 # rqg.pl : Version 4.0.4 (2021-12) # # $RQG_HOME/rqg.pl \ # --grammar=conf/mariadb/table_stress_innodb_nocopy1.yy \ # --gendata=conf/mariadb/table_stress.zz \ # --gendata_sql=conf/mariadb/table_stress.sql \ # --mysqld=--loose-innodb_lock_schedule_algorithm=fcfs \ # --mysqld=--loose-idle_write_transaction_timeout=0 \ # --mysqld=--loose-idle_transaction_timeout=0 \ # --mysqld=--loose-idle_readonly_transaction_timeout=0 \ # --mysqld=--connect_timeout=60 \ # --mysqld=--interactive_timeout=28800 \ # --mysqld=--slave_net_timeout=60 \ # --mysqld=--net_read_timeout=30 \ # --mysqld=--net_write_timeout=60 \ # --mysqld=--loose-table_lock_wait_timeout=50 \ # --mysqld=--wait_timeout=28800 \ # --mysqld=--lock-wait-timeout=86400 \ # --mysqld=--innodb-lock-wait-timeout=50 \ # --no-mask \ # --queries=10000000 \ # --seed=random \ # --reporters=Backtrace \ # --reporters=ErrorLog \ # --reporters=Deadlock1 \ # --validators=None \ # --mysqld=--log_output=none \ # --mysqld=--log_bin_trust_function_creators=1 \ # --mysqld=--loose-debug_assert_on_not_freed_memory=0 \ # --engine=InnoDB \ # --restart_timeout=360 \ # --mysqld=--plugin-load-add=file_key_management.so \ # --mysqld=--loose-file-key-management-filename=$RQG_HOME/conf/mariadb/encryption_keys.txt \ # --mysqld=--plugin-load-add=provider_lzo.so \ # --mysqld=--plugin-load-add=provider_bzip2.so \ # --mysqld=--plugin-load-add=provider_lzma \ # --mysqld=--plugin-load-add=provider_snappy \ # --mysqld=--plugin-load-add=provider_lz4 \ # --duration=300 \ # --mysqld=--loose-innodb_fatal_semaphore_wait_threshold=300 \ # --mysqld=--loose-innodb_read_only_compressed=OFF \ # --reporters=CrashRecovery1 \ # --duration=100 \ # --mysqld=--innodb_stats_persistent=on \ # --mysqld=--innodb_adaptive_hash_index=on \ # --mysqld=--log-bin \ # --mysqld=--sync-binlog=1 \ # --mysqld=--loose-innodb_evict_tables_on_commit_debug=on \ # --mysqld=--loose-max-statement-time=30 \ # --threads=9 \ # --mysqld=--innodb_use_native_aio=1 \ # --mysqld=--innodb_rollback_on_timeout=OFF \ # --vardir_type=fast \ # --mysqld=--innodb_page_size=8K \ # --mysqld=--innodb-buffer-pool-size=256M \ # <local settings>   Per Marko: Most probably only ADD/DROP/reorder column + ALGORITHM=INSTANT is required.

          The grammar MDEV-27798.yy replayed the bad effect with one thread only.
          The automatic RQG Simplifier transformed it to some logically equivalent grammar.
          But that one did not replay within 640 RQG runs.

          mleich Matthias Leich added a comment - The grammar MDEV-27798 .yy replayed the bad effect with one thread only. The automatic RQG Simplifier transformed it to some logically equivalent grammar. But that one did not replay within 640 RQG runs.

          origin/10.8 1c5b099a9619c953e7510bbafca89353ad0a020c 2022-02-17T20:06:33+02:00
          with the fix from above behaved well in RQG testbattery for broad range functional coverage.
          

          mleich Matthias Leich added a comment - origin/10.8 1c5b099a9619c953e7510bbafca89353ad0a020c 2022-02-17T20:06:33+02:00 with the fix from above behaved well in RQG testbattery for broad range functional coverage.

          I tried to create the following test for reproducing the failure, from an rr replay trace of an RQG based test, but it fails to crash on my system.

          --source include/have_innodb.inc
          --source include/have_debug_sync.inc
           
          CREATE TABLE t1(a INT PRIMARY KEY) ENGINE=InnoDB;
          CREATE TABLE t7 (col1 INT PRIMARY KEY, col2 INT, col_int INTEGER,
          col_string INTEGER, col_varchar VARCHAR(500), col_text TEXT) ENGINE = InnoDB;
           
          let $N=175;
          while ($N) {
          ALTER TABLE t7 DROP COLUMN col_text;
          ALTER TABLE t7 ADD COLUMN col_text_copy TEXT FIRST;
          ALTER TABLE t7 CHANGE COLUMN col_text_copy col_text TEXT;
          dec $N;
          }
           
          connect (ddl,localhost,root,,);
          SET DEBUG_SYNC='innodb_inplace_alter_table_enter SIGNAL blocked WAIT_FOR ever';
          send ALTER TABLE t7 DROP COLUMN col_text;
           
          connection default;
          SET DEBUG_SYNC='now WAIT_FOR blocked';
          INSERT INTO t1 SET a=1;
          --let $shutdown_timeout=0
          --source include/restart_mysqld.inc
           
          disconnect ddl;
          CHECK TABLE t7;
          DROP TABLE t7;
          DROP TABLE t1;
          

          The rr replay trace that I analyzed ended in a SIGKILL during the commit of DROP COLUMN (before any redo log was written) and then the failed recovery. The ALTER TABLE statements were not always run in such an optimal order, and there were about 500 such statements in total (so, some of them must have failed). In the end, the total number of instantly dropped columns would be 175, matching the above test.

          I added the DEBUG_SYNC and INSERT trick is there to ensure that the last DROP COLUMN gets stuck before it wrote any log. Possibly the INSERT and the auxiliary table should be removed, because in the rr replay trace that I checked, the server did not write any redo log after the commit for DROP COLUMN was invoked (and quickly interrupted by the SIGKILL). I tried removing all references to the table t1, and it did not change the outcome for me.

          If we are unable to create a regression test for this, I think that we can do without one, given that the fix is so simple and it behaved well in broadband testing.

          marko Marko Mäkelä added a comment - I tried to create the following test for reproducing the failure, from an rr replay trace of an RQG based test, but it fails to crash on my system. --source include/have_innodb.inc --source include/have_debug_sync.inc   CREATE TABLE t1(a INT PRIMARY KEY ) ENGINE=InnoDB; CREATE TABLE t7 (col1 INT PRIMARY KEY , col2 INT , col_int INTEGER , col_string INTEGER , col_varchar VARCHAR (500), col_text TEXT) ENGINE = InnoDB;   let $N=175; while ($N) { ALTER TABLE t7 DROP COLUMN col_text; ALTER TABLE t7 ADD COLUMN col_text_copy TEXT FIRST ; ALTER TABLE t7 CHANGE COLUMN col_text_copy col_text TEXT; dec $N; }   connect (ddl,localhost,root,,); SET DEBUG_SYNC= 'innodb_inplace_alter_table_enter SIGNAL blocked WAIT_FOR ever' ; send ALTER TABLE t7 DROP COLUMN col_text;   connection default ; SET DEBUG_SYNC= 'now WAIT_FOR blocked' ; INSERT INTO t1 SET a=1; --let $shutdown_timeout=0 --source include/restart_mysqld.inc   disconnect ddl; CHECK TABLE t7; DROP TABLE t7; DROP TABLE t1; The rr replay trace that I analyzed ended in a SIGKILL during the commit of DROP COLUMN (before any redo log was written) and then the failed recovery. The ALTER TABLE statements were not always run in such an optimal order, and there were about 500 such statements in total (so, some of them must have failed). In the end, the total number of instantly dropped columns would be 175, matching the above test. I added the DEBUG_SYNC and INSERT trick is there to ensure that the last DROP COLUMN gets stuck before it wrote any log. Possibly the INSERT and the auxiliary table should be removed, because in the rr replay trace that I checked, the server did not write any redo log after the commit for DROP COLUMN was invoked (and quickly interrupted by the SIGKILL). I tried removing all references to the table t1 , and it did not change the outcome for me. If we are unable to create a regression test for this, I think that we can do without one, given that the fix is so simple and it behaved well in broadband testing.

          I pushed the fix without a test case.

          marko Marko Mäkelä added a comment - I pushed the fix without a test case.
          mleich Matthias Leich added a comment - - edited

          IMHO pushing the fix without some MTR based test is acceptable.
          The bug was never replayed on 10.8 + the fix.
          10.8 + fix did not show new problems.
          And if the bug shows up or not on plain 10.8 depends extreme on build type, general load on testing box etc.
          The serious amount of time invested for getting some simplified replay test for RQG
          also shows how extreme we depend on timing on the box.
          

          mleich Matthias Leich added a comment - - edited IMHO pushing the fix without some MTR based test is acceptable. The bug was never replayed on 10.8 + the fix. 10.8 + fix did not show new problems. And if the bug shows up or not on plain 10.8 depends extreme on build type, general load on testing box etc. The serious amount of time invested for getting some simplified replay test for RQG also shows how extreme we depend on timing on the box.

          People

            marko Marko Mäkelä
            marko Marko Mäkelä
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.