Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27234

InnoDB dictionary recovery wrongly uses READ UNCOMMITTED isolation level instead of READ COMMITTED

Details

    Description

      The following patch tries to do extra checkpoint in fil_rename_tablespace(). This lead
      to fail to write few tablespace file name during fil_names_clear().

      diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
      index 681cee32fd5..0b6e4764d98 100644
      --- a/storage/innobase/fil/fil0fil.cc
      +++ b/storage/innobase/fil/fil0fil.cc
      @@ -1918,6 +1918,7 @@ fil_rename_tablespace(
              ut_ad(strchr(new_file_name, '/'));
       
              if (!recv_recovery_is_on()) {
      +               log_make_checkpoint();
                      mysql_mutex_lock(&log_sys.mutex);
              }
      

      The following test case were failing in 10.6:

      innodb.innodb-alter-tempfile
      CURRENT_TEST: innodb.innodb-alter-tempfile
      mysqltest: At line 45: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
       
      innodb.alter_crash 'innodb'              w4 [ fail ]
              Test ended at 2021-12-12 10:22:42
       
      CURRENT_TEST: innodb.alter_crash
      mysqltest: At line 125: query 'INSERT INTO t2 VALUES (5,6),(7,8)' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
       
      innodb.instant_alter_crash 'innodb'      w5 [ fail ]
              Test ended at 2021-12-12 10:23:28
       
      CURRENT_TEST: innodb.instant_alter_crash
      mysqltest: At line 200: query 'SHOW CREATE TABLE t3' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
       
      innodb.truncate_crash 'innodb'           w4 [ fail ]
              Test ended at 2021-12-12 10:24:16
       
      CURRENT_TEST: innodb.truncate_crash
      mysqltest: At line 21: query 'SELECT COUNT(*) FROM t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
      
      

      Attachments

        Issue Links

          Activity

            let datadir= `select @@datadir`;
             
            --let $_server_id= `SELECT @@server_id`
            --let $_expect_file_name=$MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect
             
            CREATE TABLE t1 (f1 INT NOT NULL, f2 INT NOT NULL) ENGINE=innodb;
            SET debug_dbug='+d,innodb_alter_commit_crash_before_commit';
             
            --exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
             
            --error 2013
            ALTER TABLE t1 ADD PRIMARY KEY (f2, f1);
             
             
            --source include/start_mysqld.inc
             
            show create table t1;
            
            

            Before crashing:
            ================
            InnoDB does the following:
            1) create a table t1(table id is 18)
            2) create a table #sql-alter-10b0-4.ibd(table id is 19) as part of alter operation
            3) InnoDB renames t1.ibd to #sql-ib19.ibd
            4) InnoDB renames #sql-alter-10b0-4.ibd to t1.ibd
            5) InnoDB drops #sql-ib19.ibd
            6) Kills the server before alter finishes.

            Data directory has the following:
            =================================
            '#sql-alter-10b0-4.frm'  '#sql-ib19.ibd'   t1.frm   t1.ibd
            

            InnoDB sys tables
            ===================
            2 entries for table id (18). one is test/t1, other is #sql-ibd19.ibd
            Both entries are deleted.

            2 entries for table id (19). one is #sql-alter-10b0-4.ibd, other is test/t1.
            first one is deleted, 2nd one exist.

            Recovery:
            =========

            InnoDB tries to load the table during trx_resurrect():

            • table id 18 fails to load the table due to deleted entries in SYS_TABLES record.
            • table id 19 load was successful and loads the tablespace also. (for tablespace t1)

            Undo log processing done by InnoDB:

            InnoDB does removal of new entries in SYS_TABLES.

            InnoDB does undo of rename operation (t1 to #sql-alter-10b0-4.ibd).
            During opening of table, InnoDB fails to re-load the tablespace for #sql-alter-10b04.ibd
            due to the following code in dict_load_tablespace():

                    /* The tablespace may already be open. */
                    table->space = fil_space_for_table_exists_in_mem(table->space_id,
                                                                     table->flags);
                    if (table->space) {
                            return;
                    }
            

            and it does rename successfully as well. Now the data directory has
            '#sql-alter-10b0-4.frm' '#sql-ib19.ibd' t1.frm '#sql-alter-10b0-4.ibd'

            InnoDB tries to undo the rename operation (#sql-ib19.ibd to t1)

            • InnoDB tries to open the tablespace for t1 and it fails to load the file in data directory.
              and it leads to skipping the rename operation. Now, InnoDB doesn't have table t1.
            thiru Thirunarayanan Balathandayuthapani added a comment - let datadir= `select @@datadir`;   --let $_server_id= `SELECT @@server_id` --let $_expect_file_name=$MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect   CREATE TABLE t1 (f1 INT NOT NULL, f2 INT NOT NULL) ENGINE=innodb; SET debug_dbug='+d,innodb_alter_commit_crash_before_commit';   --exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect   --error 2013 ALTER TABLE t1 ADD PRIMARY KEY (f2, f1);     --source include/start_mysqld.inc   show create table t1; Before crashing: ================ InnoDB does the following: 1) create a table t1(table id is 18) 2) create a table #sql-alter-10b0-4.ibd(table id is 19) as part of alter operation 3) InnoDB renames t1.ibd to #sql-ib19.ibd 4) InnoDB renames #sql-alter-10b0-4.ibd to t1.ibd 5) InnoDB drops #sql-ib19.ibd 6) Kills the server before alter finishes. Data directory has the following: ================================= '#sql-alter-10b0-4.frm' '#sql-ib19.ibd' t1.frm t1.ibd InnoDB sys tables =================== 2 entries for table id (18). one is test/t1, other is #sql-ibd19.ibd Both entries are deleted. 2 entries for table id (19). one is #sql-alter-10b0-4.ibd, other is test/t1. first one is deleted, 2nd one exist. Recovery: ========= InnoDB tries to load the table during trx_resurrect(): table id 18 fails to load the table due to deleted entries in SYS_TABLES record. table id 19 load was successful and loads the tablespace also. (for tablespace t1) Undo log processing done by InnoDB: InnoDB does removal of new entries in SYS_TABLES. InnoDB does undo of rename operation (t1 to #sql-alter-10b0-4.ibd). During opening of table, InnoDB fails to re-load the tablespace for #sql-alter-10b04.ibd due to the following code in dict_load_tablespace(): /* The tablespace may already be open. */ table->space = fil_space_for_table_exists_in_mem(table->space_id, table->flags); if (table->space) { return; } and it does rename successfully as well. Now the data directory has '#sql-alter-10b0-4.frm' '#sql-ib19.ibd' t1.frm '#sql-alter-10b0-4.ibd' InnoDB tries to undo the rename operation (#sql-ib19.ibd to t1) InnoDB tries to open the tablespace for t1 and it fails to load the file in data directory. and it leads to skipping the rename operation. Now, InnoDB doesn't have table t1.

            I think that the following failure might be related to this:

            atomic.rename_trigger 'innodb'           w4 [ fail ]
                    Test ended at 2022-02-07 00:56:07
            CURRENT_TEST: atomic.rename_trigger
            mysqltest: At line 159: query 'let $res=`select t1.a+t2.b+t3.c+t4.d from t1,t2,t3,t4`' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
            "engine: innodb  crash point: ddl_log_rename_after_failed_rename_trigger  position: 3"
            More results from queries before failure can be found in /buildbot/amd64-ubuntu-1804-msan/build/mysql-test/var/4/log/rename_trigger.log
             - skipping '/buildbot/amd64-ubuntu-1804-msan/build/mysql-test/var/4/log/atomic.rename_trigger-innodb/'
            ***Warnings generated in error logs during shutdown after running tests: atomic.rename_trigger
            2022-02-07  0:56:06 0 [ERROR] InnoDB: Tablespace 5 was not found at ./test/t5.ibd.
            2022-02-07  0:56:06 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
            2022-02-07  0:56:06 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1445] with error Tablespace not found
            

            marko Marko Mäkelä added a comment - I think that the following failure might be related to this: atomic.rename_trigger 'innodb' w4 [ fail ] Test ended at 2022-02-07 00:56:07 CURRENT_TEST: atomic.rename_trigger mysqltest: At line 159: query 'let $res=`select t1.a+t2.b+t3.c+t4.d from t1,t2,t3,t4`' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB' … "engine: innodb crash point: ddl_log_rename_after_failed_rename_trigger position: 3" More results from queries before failure can be found in /buildbot/amd64-ubuntu-1804-msan/build/mysql-test/var/4/log/rename_trigger.log - skipping '/buildbot/amd64-ubuntu-1804-msan/build/mysql-test/var/4/log/atomic.rename_trigger-innodb/' ***Warnings generated in error logs during shutdown after running tests: atomic.rename_trigger 2022-02-07 0:56:06 0 [ERROR] InnoDB: Tablespace 5 was not found at ./test/t5.ibd. 2022-02-07 0:56:06 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace. 2022-02-07 0:56:06 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1445] with error Tablespace not found

            Another possibly related failure:

            encryption.innodb-redo-badkey 'cbc,innodb,strict_crc32' w4 [ fail ]  Found warnings/errors in server log file!
                    Test ended at 2022-02-06 22:48:09
            line
            2022-02-06 22:48:06 0 [ERROR] InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT for tablespace 5
            2022-02-06 22:48:07 0 [ERROR] InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT for tablespace 5
            

            marko Marko Mäkelä added a comment - Another possibly related failure : encryption.innodb-redo-badkey 'cbc,innodb,strict_crc32' w4 [ fail ] Found warnings/errors in server log file! Test ended at 2022-02-06 22:48:09 line 2022-02-06 22:48:06 0 [ERROR] InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT for tablespace 5 2022-02-06 22:48:07 0 [ERROR] InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT for tablespace 5

            I tried this on 10.8 (after MDEV-14425), but it did not fail:

            diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
            index 0273fcd9a03..8f02ff51242 100644
            --- a/storage/innobase/fil/fil0fil.cc
            +++ b/storage/innobase/fil/fil0fil.cc
            @@ -1857,6 +1857,7 @@ static bool fil_rename_tablespace(uint32_t id, const char *old_path,
             	ut_ad(strchr(new_file_name, '/'));
             
             	if (!recv_recovery_is_on()) {
            +		log_make_checkpoint();
             		log_sys.latch.wr_lock(SRW_LOCK_CALL);
             	}
             
            

            10.8 e8a2a70cf869cf32a50144704646a5f34e8d7be9 (patched)

            innodb.instant_alter_crash 'innodb'      [ pass ]   2079
            innodb.alter_crash 'innodb'              [ pass ]   2489
            atomic.rename_trigger 'innodb'           [ pass ]  33901
            

            Also, the crash injection test case recovered fine. Does this still fail on the latest 10.6?

            marko Marko Mäkelä added a comment - I tried this on 10.8 (after MDEV-14425 ), but it did not fail: diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc index 0273fcd9a03..8f02ff51242 100644 --- a/storage/innobase/fil/fil0fil.cc +++ b/storage/innobase/fil/fil0fil.cc @@ -1857,6 +1857,7 @@ static bool fil_rename_tablespace(uint32_t id, const char *old_path, ut_ad(strchr(new_file_name, '/')); if (!recv_recovery_is_on()) { + log_make_checkpoint(); log_sys.latch.wr_lock(SRW_LOCK_CALL); } 10.8 e8a2a70cf869cf32a50144704646a5f34e8d7be9 (patched) innodb.instant_alter_crash 'innodb' [ pass ] 2079 innodb.alter_crash 'innodb' [ pass ] 2489 atomic.rename_trigger 'innodb' [ pass ] 33901 Also, the crash injection test case recovered fine. Does this still fail on the latest 10.6?
            marko Marko Mäkelä added a comment - - edited

            After I ran the crash injection test on 10.7 af87186c1d3f9eeab73bc38b1faa51ba17fa8017, the var/mysqld.1/data/test only contained db.opt, t1.frm, t1.ibd.

            Edit: With the patch, the recovery fails. (But not on 10.8 with the patch.)

            10.7 af87186c1d3f9eeab73bc38b1faa51ba17fa8017 (patched)

            mysqltest: At line 20: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
            

            marko Marko Mäkelä added a comment - - edited After I ran the crash injection test on 10.7 af87186c1d3f9eeab73bc38b1faa51ba17fa8017, the var/mysqld.1/data/test only contained db.opt , t1.frm , t1.ibd . Edit: With the patch, the recovery fails. (But not on 10.8 with the patch.) 10.7 af87186c1d3f9eeab73bc38b1faa51ba17fa8017 (patched) mysqltest: At line 20: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB

            10.8 with MDEV-14425 but right before MDEV-27774 fails to recover:

            10.8 8c7c92adf384b2d0294502d5b38684e68d5eaf27 (patched)

            mysqltest: At line 20: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
            

            marko Marko Mäkelä added a comment - 10.8 with MDEV-14425 but right before MDEV-27774 fails to recover: 10.8 8c7c92adf384b2d0294502d5b38684e68d5eaf27 (patched) mysqltest: At line 20: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB

            Problem exists on 10.6.2 also. The below commit is the reason for the issue:

            [~/mariarepo/10.6/10.6-sample]$git log -p -1 1bd681c8b3c
            commit 1bd681c8b3c5213ce1f7976940a7dc38b48a0d39
            Author: Marko Mäkelä <marko.makela@mariadb.com>
            Date:   Wed Jun 9 17:02:55 2021 +0300
             
                MDEV-25506 (3 of 3): Do not delete .ibd files before commit
            

            thiru Thirunarayanan Balathandayuthapani added a comment - Problem exists on 10.6.2 also. The below commit is the reason for the issue: [~/mariarepo/10.6/10.6-sample]$git log -p -1 1bd681c8b3c commit 1bd681c8b3c5213ce1f7976940a7dc38b48a0d39 Author: Marko Mäkelä <marko.makela@mariadb.com> Date: Wed Jun 9 17:02:55 2021 +0300   MDEV-25506 (3 of 3): Do not delete .ibd files before commit

            trx_resurrect_table_locks() fails to load the table because sys table entries has deleted marked record for it. It leads to failure of loading of tables during
            recovery.

            thiru Thirunarayanan Balathandayuthapani added a comment - trx_resurrect_table_locks() fails to load the table because sys table entries has deleted marked record for it. It leads to failure of loading of tables during recovery.

            Thank you. I see that the following code in dict_load_table_on_id() makes an incorrect assumption:

            	if (page_rec_is_user_rec(rec)) {
            		/*---------------------------------------------------*/
            		/* Now we have the record in the secondary index
            		containing the table ID and NAME */
            check_rec:
            		field = rec_get_nth_field_old(
            			rec, DICT_FLD__SYS_TABLE_IDS__ID, &len);
            		ut_ad(len == 8);
             
            		/* Check if the table id in record is the one searched for */
            		if (table_id == mach_read_from_8(field)) {
            			if (rec_get_deleted_flag(rec, 0)) {
            				/* Until purge has completed, there
            				may be delete-marked duplicate records
            				for the same SYS_TABLES.ID, but different
            				SYS_TABLES.NAME. */
            				while (btr_pcur_move_to_next(&pcur, &mtr)) {
            					rec = btr_pcur_get_rec(&pcur);
             
            					if (page_rec_is_user_rec(rec)) {
            						goto check_rec;
            					}
            				}
            

            This code is essentially implementing READ UNCOMMITTED while we need READ COMMITTED here. It is wrong to filter out delete-marked secondary index records (records of SYS_TABLES.ID). The secondary index indeed only contains delete-marked records for this SYS_TABLES.ID=18: (ID,NAME)=(18,'test/#sql-ib19'),(18,'test/t1'). In the clustered index, I see a delete-marked record for the former and a non-delete-marked record for 'test/t1', with the wrong ID=19 for the most recent (uncommitted) version. Both records contain DB_TRX_ID=25.

            Implementing proper READ COMMITTED might be easiest by allocating a transaction with a read view, and adapting some logic from row_search_mvcc() for accessing the dictionary tables. The functions row_sel_sec_rec_is_for_clust_rec() and row_vers_build_for_consistent_read() have to be part of the solution.

            marko Marko Mäkelä added a comment - Thank you. I see that the following code in dict_load_table_on_id() makes an incorrect assumption: if (page_rec_is_user_rec(rec)) { /*---------------------------------------------------*/ /* Now we have the record in the secondary index containing the table ID and NAME */ check_rec: field = rec_get_nth_field_old( rec, DICT_FLD__SYS_TABLE_IDS__ID, &len); ut_ad(len == 8);   /* Check if the table id in record is the one searched for */ if (table_id == mach_read_from_8(field)) { if (rec_get_deleted_flag(rec, 0)) { /* Until purge has completed, there may be delete-marked duplicate records for the same SYS_TABLES.ID, but different SYS_TABLES.NAME. */ while (btr_pcur_move_to_next(&pcur, &mtr)) { rec = btr_pcur_get_rec(&pcur);   if (page_rec_is_user_rec(rec)) { goto check_rec; } } This code is essentially implementing READ UNCOMMITTED while we need READ COMMITTED here. It is wrong to filter out delete-marked secondary index records (records of SYS_TABLES.ID ). The secondary index indeed only contains delete-marked records for this SYS_TABLES.ID=18 : (ID,NAME)=(18,'test/#sql-ib19'),(18,'test/t1'). In the clustered index, I see a delete-marked record for the former and a non-delete-marked record for 'test/t1', with the wrong ID=19 for the most recent (uncommitted) version. Both records contain DB_TRX_ID=25 . Implementing proper READ COMMITTED might be easiest by allocating a transaction with a read view, and adapting some logic from row_search_mvcc() for accessing the dictionary tables. The functions row_sel_sec_rec_is_for_clust_rec() and row_vers_build_for_consistent_read() have to be part of the solution.
            marko Marko Mäkelä added a comment - - edited

            I think that we can implement "poor man’s MVCC" for the secondary index scan like this:

            diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc
            index 71dedd48c29..88351b06201 100644
            --- a/storage/innobase/dict/dict0load.cc
            +++ b/storage/innobase/dict/dict0load.cc
            @@ -2536,11 +2536,16 @@ dict_load_table_on_id(
             
             		/* Check if the table id in record is the one searched for */
             		if (table_id == mach_read_from_8(field)) {
            -			if (rec_get_deleted_flag(rec, 0)) {
            -				/* Until purge has completed, there
            -				may be delete-marked duplicate records
            -				for the same SYS_TABLES.ID, but different
            -				SYS_TABLES.NAME. */
            +			field = rec_get_nth_field_old(rec,
            +				DICT_FLD__SYS_TABLE_IDS__NAME, &len);
            +			table = dict_sys.load_table(
            +				{reinterpret_cast<const char*>(field),
            +				 len}, ignore_err);
            +			if (table && table->id != table_id) {
            +				ut_ad(rec_get_deleted_flag(rec, 0));
            +				table = nullptr;
            +			}
            +			if (!table) {
             				while (btr_pcur_move_to_next(&pcur, &mtr)) {
             					rec = btr_pcur_get_rec(&pcur);
             
            @@ -2548,13 +2553,6 @@ dict_load_table_on_id(
             						goto check_rec;
             					}
             				}
            -			} else {
            -				/* Now we get the table name from the record */
            -				field = rec_get_nth_field_old(rec,
            -					DICT_FLD__SYS_TABLE_IDS__NAME, &len);
            -				table = dict_sys.load_table(
            -					{reinterpret_cast<const char*>(field),
            -					 len}, ignore_err);
             			}
             		}
             	}
            

            For the first matching record, dict_sys.load_table() would find nothing, because it is still performing READ UNCOMMITTED and that record is delete-marked. For the second one, it returns a table with the wrong id=19 because it fails to fetch the latest committed version.

            Edit: I was thinking to use purge_sys.view for the MVCC read, but I realized that READ COMMITTED semantics would require the newest possible read view instead of the oldest one.

            marko Marko Mäkelä added a comment - - edited I think that we can implement "poor man’s MVCC" for the secondary index scan like this: diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc index 71dedd48c29..88351b06201 100644 --- a/storage/innobase/dict/dict0load.cc +++ b/storage/innobase/dict/dict0load.cc @@ -2536,11 +2536,16 @@ dict_load_table_on_id( /* Check if the table id in record is the one searched for */ if (table_id == mach_read_from_8(field)) { - if (rec_get_deleted_flag(rec, 0)) { - /* Until purge has completed, there - may be delete-marked duplicate records - for the same SYS_TABLES.ID, but different - SYS_TABLES.NAME. */ + field = rec_get_nth_field_old(rec, + DICT_FLD__SYS_TABLE_IDS__NAME, &len); + table = dict_sys.load_table( + {reinterpret_cast<const char*>(field), + len}, ignore_err); + if (table && table->id != table_id) { + ut_ad(rec_get_deleted_flag(rec, 0)); + table = nullptr; + } + if (!table) { while (btr_pcur_move_to_next(&pcur, &mtr)) { rec = btr_pcur_get_rec(&pcur); @@ -2548,13 +2553,6 @@ dict_load_table_on_id( goto check_rec; } } - } else { - /* Now we get the table name from the record */ - field = rec_get_nth_field_old(rec, - DICT_FLD__SYS_TABLE_IDS__NAME, &len); - table = dict_sys.load_table( - {reinterpret_cast<const char*>(field), - len}, ignore_err); } } } For the first matching record, dict_sys.load_table() would find nothing, because it is still performing READ UNCOMMITTED and that record is delete-marked. For the second one, it returns a table with the wrong id=19 because it fails to fetch the latest committed version. Edit: I was thinking to use purge_sys.view for the MVCC read, but I realized that READ COMMITTED semantics would require the newest possible read view instead of the oldest one.

            It may be much easier to roll back a recovered DDL transaction before resurrecting the table locks, by simply continuing to use READ UNCOMMITTED, which thanks to the added rollback would be equivalent to READ COMMITTED.

            If we implement READ COMMITTED, it will have to be implemented for accessing all dictionary tables. For example, if both the table and some indexed columns or base columns of virtual columns were renamed in a recovered incomplete DDL transaction, from each dictionary table we must fetch the correct names.

            marko Marko Mäkelä added a comment - It may be much easier to roll back a recovered DDL transaction before resurrecting the table locks, by simply continuing to use READ UNCOMMITTED , which thanks to the added rollback would be equivalent to READ COMMITTED . If we implement READ COMMITTED , it will have to be implemented for accessing all dictionary tables. For example, if both the table and some indexed columns or base columns of virtual columns were renamed in a recovered incomplete DDL transaction, from each dictionary table we must fetch the correct names.

            It turns out that we do have to implement the READ COMMITTED access to the data dictionary. Even if we defer trx_resurrect_table_locks() after the trx_rollback_active(false) call that would roll back the DDL transaction, that very rollback would fail in its attempt to load the table definition.

            marko Marko Mäkelä added a comment - It turns out that we do have to implement the READ COMMITTED access to the data dictionary. Even if we defer trx_resurrect_table_locks() after the trx_rollback_active(false) call that would roll back the DDL transaction, that very rollback would fail in its attempt to load the table definition.

            Here is a test that halts the commit and then forces a checkpoint so that the FILE_ record recovery will not interfere with the processing of the undo logs and the data dictionary tables:

            --source include/have_innodb.inc
            --source include/have_debug.inc
            --source include/have_debug_sync.inc
             
            CREATE TABLE t1 (a INT NOT NULL) ENGINE=InnoDB STATS_PERSISTENT=0;
             
            connect ddl,localhost,root;
            SET DEBUG_SYNC='after_trx_committed_in_memory SIGNAL stuck WAIT_FOR ever EXECUTE 2';
            send ALTER TABLE t1 ADD PRIMARY KEY(a);
             
            connection default;
            SET DEBUG_SYNC='now WAIT_FOR stuck';
            SET DEBUG_SYNC='now SIGNAL ever';
            SET DEBUG_SYNC='now WAIT_FOR stuck';
             
            SET GLOBAL innodb_log_checkpoint_now=ON;
             
            --let $shutdown_timeout=0
            --source include/restart_mysqld.inc
             
            disconnect ddl;
             
            SHOW CREATE TABLE t1;
            SELECT * FROM t1;
            DROP TABLE t1;
            

            The test actually passes (with SHOW CREATE TABLE wrongly indicating that the table includes a PRIMARY KEY), and an orphan table will be left behind:

            10.9 35fcae10408aa4c54967d118c8ccde1c3736b85f

            MTR's internal check of the test case 'main.MDEV-27234' failed.
            --- /dev/shm/10.9/mysql-test/var/tmp/check-mysqld_1.result	2022-03-10 10:53:59.764275437 +0200
            +++ /dev/shm/10.9/mysql-test/var/tmp/check-mysqld_1.reject	2022-03-10 10:54:00.352281258 +0200
            @@ -1092,3 +1092,4 @@
             partition	1.0	DISABLED	STORAGE ENGINE	100900.0	NULL	NULL	Mikael Ronstrom, MySQL AB	Partition Storage Engine Helper	GPL	OFF	Stable	1.0
             VARIABLE_NAME	VARIABLE_VALUE
             DEBUG_SYNC	ON - current signal: ''
            +#sql-ib19.ibd
             
            mysqltest: Result length mismatch
            

            On 10.6 36a19f94ce102be87b22fdc7ad147ab7d5ed4331 the test behaves in exactly the same way.

            marko Marko Mäkelä added a comment - Here is a test that halts the commit and then forces a checkpoint so that the FILE_ record recovery will not interfere with the processing of the undo logs and the data dictionary tables: --source include/have_innodb.inc --source include/have_debug.inc --source include/have_debug_sync.inc   CREATE TABLE t1 (a INT NOT NULL ) ENGINE=InnoDB STATS_PERSISTENT=0;   connect ddl,localhost,root; SET DEBUG_SYNC= 'after_trx_committed_in_memory SIGNAL stuck WAIT_FOR ever EXECUTE 2' ; send ALTER TABLE t1 ADD PRIMARY KEY (a);   connection default ; SET DEBUG_SYNC= 'now WAIT_FOR stuck' ; SET DEBUG_SYNC= 'now SIGNAL ever' ; SET DEBUG_SYNC= 'now WAIT_FOR stuck' ;   SET GLOBAL innodb_log_checkpoint_now= ON ;   --let $shutdown_timeout=0 --source include/restart_mysqld.inc   disconnect ddl;   SHOW CREATE TABLE t1; SELECT * FROM t1; DROP TABLE t1; The test actually passes (with SHOW CREATE TABLE wrongly indicating that the table includes a PRIMARY KEY ), and an orphan table will be left behind: 10.9 35fcae10408aa4c54967d118c8ccde1c3736b85f MTR's internal check of the test case 'main.MDEV-27234' failed. … --- /dev/shm/10.9/mysql-test/var/tmp/check-mysqld_1.result 2022-03-10 10:53:59.764275437 +0200 +++ /dev/shm/10.9/mysql-test/var/tmp/check-mysqld_1.reject 2022-03-10 10:54:00.352281258 +0200 @@ -1092,3 +1092,4 @@ partition 1.0 DISABLED STORAGE ENGINE 100900.0 NULL NULL Mikael Ronstrom, MySQL AB Partition Storage Engine Helper GPL OFF Stable 1.0 VARIABLE_NAME VARIABLE_VALUE DEBUG_SYNC ON - current signal: '' +#sql-ib19.ibd   mysqltest: Result length mismatch On 10.6 36a19f94ce102be87b22fdc7ad147ab7d5ed4331 the test behaves in exactly the same way.

            The only dictionary tables that contain any secondary indexes are SYS_TABLES and SYS_FOREIGN. The function dict_load_foreigns() does access the secondary index on SYS_FOREIGN.FOR_NAME, in the READ UNCOMMITTED fashion. We must not filter out delete-marked records there, and we must make dict_load_foreign() retrieve the last committed version, or return a status that no committed record existed.

            Every access to the clustered indexes must be corrected in a similar way, so that we will load the latest committed version (READ COMMITTED).

            I expect that the function trx_t::evict_table() can be removed as part of fixing this bug.

            marko Marko Mäkelä added a comment - The only dictionary tables that contain any secondary indexes are SYS_TABLES and SYS_FOREIGN . The function dict_load_foreigns() does access the secondary index on SYS_FOREIGN.FOR_NAME , in the READ UNCOMMITTED fashion. We must not filter out delete-marked records there, and we must make dict_load_foreign() retrieve the last committed version, or return a status that no committed record existed. Every access to the clustered indexes must be corrected in a similar way, so that we will load the latest committed version ( READ COMMITTED ). I expect that the function trx_t::evict_table() can be removed as part of fixing this bug.

            data.tar.xz is a data directory with innodb_page_size=64k that might demonstrate this bug. On accessing the table t3 (on which DDL was in progress when the server was killed), the following mismatch would be reported:

            10.6 dafc5fb9c17637c6233e8a73c890f158d645a37c

            2022-03-15 16:16:19 3 [ERROR] InnoDB could not find key no 0 with name Marvão_uidx1 from dict cache for table test/t3
            

            marko Marko Mäkelä added a comment - data.tar.xz is a data directory with innodb_page_size=64k that might demonstrate this bug. On accessing the table t3 (on which DDL was in progress when the server was killed), the following mismatch would be reported: 10.6 dafc5fb9c17637c6233e8a73c890f158d645a37c 2022-03-15 16:16:19 3 [ERROR] InnoDB could not find key no 0 with name Marvão_uidx1 from dict cache for table test/t3

            Some adjustment may be needed for recovering instant ALTER TABLE metadata (MDEV-11369 and MDEV-15562). Currently, the rec_get_offsets() call in btr_cur_instant_init_low() assumes that dict_index_t::n_fields corresponds to the READ UNCOMMITTED version of the table definition. An assertion would fail during the first restart of the test innodb.instant_alter_crash.

            marko Marko Mäkelä added a comment - Some adjustment may be needed for recovering instant ALTER TABLE metadata ( MDEV-11369 and MDEV-15562 ). Currently, the rec_get_offsets() call in btr_cur_instant_init_low() assumes that dict_index_t::n_fields corresponds to the READ UNCOMMITTED version of the table definition. An assertion would fail during the first restart of the test innodb.instant_alter_crash .
            marko Marko Mäkelä added a comment - - edited

            The trx_t::evict_table() is necessary for evicting table definitions on the rollback of DDL transactions, for example, on a CREATE TABLE that failed due to incorrect FOREIGN KEY constraints.

            My current work-in-progress fix still shows the index mismatch message for data.tar.xz:

            2022-03-21 17:11:23 3 [ERROR] Cannot find index Marvão_uidx1 in InnoDB index dictionary.
            2022-03-21 17:11:23 3 [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t3
            2022-03-21 17:11:23 3 [ERROR] InnoDB could not find key no 0 with name Marvão_uidx1 from dict cache for table test/t3
            2022-03-21 17:11:23 3 [ERROR] InnoDB: Table test/t3 contains 3 indexes inside InnoDB, which is different from the number of indexes 3 defined in the .frm file. See https://mariadb.com/kb/en/innodb-troubleshooting/
            

            marko Marko Mäkelä added a comment - - edited The trx_t::evict_table() is necessary for evicting table definitions on the rollback of DDL transactions, for example, on a CREATE TABLE that failed due to incorrect FOREIGN KEY constraints. My current work-in-progress fix still shows the index mismatch message for data.tar.xz : 2022-03-21 17:11:23 3 [ERROR] Cannot find index Marvão_uidx1 in InnoDB index dictionary. 2022-03-21 17:11:23 3 [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t3 2022-03-21 17:11:23 3 [ERROR] InnoDB could not find key no 0 with name Marvão_uidx1 from dict cache for table test/t3 2022-03-21 17:11:23 3 [ERROR] InnoDB: Table test/t3 contains 3 indexes inside InnoDB, which is different from the number of indexes 3 defined in the .frm file. See https://mariadb.com/kb/en/innodb-troubleshooting/

            I spent a lot of time on the test case before I confirmed with rr replay that before the server is killed, the ADD PRIMARY KEY was actually committed. The actual reason why the #sql-ib file was not dropped is that the fil_delete_tablespace() call in row_purge_remove_clust_if_poss_low() did not find the tablespace, because dict_check_sys_tables() was skipping delete-marked records of SYS_TABLES.

            My work-in-progress fix of that broke a number of mtr tests (mostly IMPORT TABLESPACE) and still fails to recover data.tar.xz correctly.

            marko Marko Mäkelä added a comment - I spent a lot of time on the test case before I confirmed with rr replay that before the server is killed, the ADD PRIMARY KEY was actually committed. The actual reason why the #sql-ib file was not dropped is that the fil_delete_tablespace() call in row_purge_remove_clust_if_poss_low() did not find the tablespace, because dict_check_sys_tables() was skipping delete-marked records of SYS_TABLES . My work-in-progress fix of that broke a number of mtr tests (mostly IMPORT TABLESPACE ) and still fails to recover data.tar.xz correctly.

            The data.tar.xz cannot be helped by this fix. The warning message mentions an unexpectedly missing index. In fact, in dict_load_index_low(), a delete-marked record of SYS_INDEXES for it was encountered and correctly filtered out in my development branch, because no active transaction for the DB_TRX_ID was found. We would need an rr replay trace of such a failure to understand why the DDL recovery apparently chose the wrong .frm file when recovering from a crash during DROP INDEX Marvão_uidx1.

            marko Marko Mäkelä added a comment - The data.tar.xz cannot be helped by this fix. The warning message mentions an unexpectedly missing index. In fact, in dict_load_index_low() , a delete-marked record of SYS_INDEXES for it was encountered and correctly filtered out in my development branch, because no active transaction for the DB_TRX_ID was found. We would need an rr replay trace of such a failure to understand why the DDL recovery apparently chose the wrong .frm file when recovering from a crash during DROP INDEX Marvão_uidx1 .

            I was able to fix the recovery of data.tar.xz. It turned out that there was a SYS_INDEXES record for an ADD INDEX stub that had been deleted by a committed transaction. The table->def_trx_id was wrongly updated to the DB_TRX_ID of that record, and InnoDB wrongly informed the DDL log recovery that the ALTER TABLE transaction had been committed inside InnoDB. Yes, we must pay attention to the DB_TRX_ID of committed delete-marked SYS_INDEXES records, but only if the SYS_INDEXES.NAME does not identify the index as incomplete (by starting with a 0xff byte).

            For the crash in the recovery of the test innodb.instant_alter_crash I got an idea: btr_cur_instant_init_low() should perform a READ COMMITTED of the instant ADD COLUMN metadata record. In the case of the crash, that record had been inserted by an uncommitted transaction.

            marko Marko Mäkelä added a comment - I was able to fix the recovery of data.tar.xz . It turned out that there was a SYS_INDEXES record for an ADD INDEX stub that had been deleted by a committed transaction. The table->def_trx_id was wrongly updated to the DB_TRX_ID of that record, and InnoDB wrongly informed the DDL log recovery that the ALTER TABLE transaction had been committed inside InnoDB. Yes, we must pay attention to the DB_TRX_ID of committed delete-marked SYS_INDEXES records, but only if the SYS_INDEXES.NAME does not identify the index as incomplete (by starting with a 0xff byte). For the crash in the recovery of the test innodb.instant_alter_crash I got an idea: btr_cur_instant_init_low() should perform a READ COMMITTED of the instant ADD COLUMN metadata record. In the case of the crash, that record had been inserted by an uncommitted transaction.
            mleich Matthias Leich added a comment - - edited

            origin/bb-10.6-MDEV-27234 13dfdd3953c076177f8b61355cb70a887a958b4b 2022-03-24T12:13:54+02:00
            behaved well in RQG testing (battery for broad range coverage only).

            mleich Matthias Leich added a comment - - edited origin/bb-10.6- MDEV-27234 13dfdd3953c076177f8b61355cb70a887a958b4b 2022-03-24T12:13:54+02:00 behaved well in RQG testing (battery for broad range coverage only).

            People

              marko Marko Mäkelä
              thiru Thirunarayanan Balathandayuthapani
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.