[MDEV-27234] InnoDB dictionary recovery wrongly uses READ UNCOMMITTED isolation level instead of READ COMMITTED Created: 2021-12-12  Updated: 2022-12-01  Resolved: 2022-03-28

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: 10.6.8, 10.7.4, 10.8.3, 10.9.1

Type: Bug Priority: Major
Reporter: Thirunarayanan Balathandayuthapani Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Attachments: File data.tar.xz    
Issue Links:
Blocks
blocks MDEV-27836 Assertion `n_fields <= ulint(index->n... Closed
Problem/Incident
causes MDEV-29440 InnoDB instant ALTER TABLE recovery w... Closed
is caused by MDEV-25506 Atomic DDL: .frm file is removed and ... Closed
Relates
relates to MDEV-23721 Assertion ‘node->table->is_temporary(... Closed
relates to MDEV-26954 Failed ALTER causes table corruption ... Closed
relates to MDEV-27111 atomic.rename_table test case fails o... Closed

 Description   

The following patch tries to do extra checkpoint in fil_rename_tablespace(). This lead
to fail to write few tablespace file name during fil_names_clear().

diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
index 681cee32fd5..0b6e4764d98 100644
--- a/storage/innobase/fil/fil0fil.cc
+++ b/storage/innobase/fil/fil0fil.cc
@@ -1918,6 +1918,7 @@ fil_rename_tablespace(
        ut_ad(strchr(new_file_name, '/'));
 
        if (!recv_recovery_is_on()) {
+               log_make_checkpoint();
                mysql_mutex_lock(&log_sys.mutex);
        }

The following test case were failing in 10.6:

innodb.innodb-alter-tempfile
CURRENT_TEST: innodb.innodb-alter-tempfile
mysqltest: At line 45: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
 
innodb.alter_crash 'innodb'              w4 [ fail ]
        Test ended at 2021-12-12 10:22:42
 
CURRENT_TEST: innodb.alter_crash
mysqltest: At line 125: query 'INSERT INTO t2 VALUES (5,6),(7,8)' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
 
innodb.instant_alter_crash 'innodb'      w5 [ fail ]
        Test ended at 2021-12-12 10:23:28
 
CURRENT_TEST: innodb.instant_alter_crash
mysqltest: At line 200: query 'SHOW CREATE TABLE t3' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB
 
innodb.truncate_crash 'innodb'           w4 [ fail ]
        Test ended at 2021-12-12 10:24:16
 
CURRENT_TEST: innodb.truncate_crash
mysqltest: At line 21: query 'SELECT COUNT(*) FROM t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB



 Comments   
Comment by Thirunarayanan Balathandayuthapani [ 2021-12-14 ]

let datadir= `select @@datadir`;
 
--let $_server_id= `SELECT @@server_id`
--let $_expect_file_name=$MYSQLTEST_VARDIR/tmp/mysqld.$_server_id.expect
 
CREATE TABLE t1 (f1 INT NOT NULL, f2 INT NOT NULL) ENGINE=innodb;
SET debug_dbug='+d,innodb_alter_commit_crash_before_commit';
 
--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
 
--error 2013
ALTER TABLE t1 ADD PRIMARY KEY (f2, f1);
 
 
--source include/start_mysqld.inc
 
show create table t1;

Before crashing:
================
InnoDB does the following:
1) create a table t1(table id is 18)
2) create a table #sql-alter-10b0-4.ibd(table id is 19) as part of alter operation
3) InnoDB renames t1.ibd to #sql-ib19.ibd
4) InnoDB renames #sql-alter-10b0-4.ibd to t1.ibd
5) InnoDB drops #sql-ib19.ibd
6) Kills the server before alter finishes.

Data directory has the following:
=================================
'#sql-alter-10b0-4.frm'  '#sql-ib19.ibd'   t1.frm   t1.ibd

InnoDB sys tables
===================
2 entries for table id (18). one is test/t1, other is #sql-ibd19.ibd
Both entries are deleted.

2 entries for table id (19). one is #sql-alter-10b0-4.ibd, other is test/t1.
first one is deleted, 2nd one exist.

Recovery:
=========

InnoDB tries to load the table during trx_resurrect():

  • table id 18 fails to load the table due to deleted entries in SYS_TABLES record.
  • table id 19 load was successful and loads the tablespace also. (for tablespace t1)

Undo log processing done by InnoDB:

InnoDB does removal of new entries in SYS_TABLES.

InnoDB does undo of rename operation (t1 to #sql-alter-10b0-4.ibd).
During opening of table, InnoDB fails to re-load the tablespace for #sql-alter-10b04.ibd
due to the following code in dict_load_tablespace():

        /* The tablespace may already be open. */
        table->space = fil_space_for_table_exists_in_mem(table->space_id,
                                                         table->flags);
        if (table->space) {
                return;
        }

and it does rename successfully as well. Now the data directory has
'#sql-alter-10b0-4.frm' '#sql-ib19.ibd' t1.frm '#sql-alter-10b0-4.ibd'

InnoDB tries to undo the rename operation (#sql-ib19.ibd to t1)

  • InnoDB tries to open the tablespace for t1 and it fails to load the file in data directory.
    and it leads to skipping the rename operation. Now, InnoDB doesn't have table t1.
Comment by Marko Mäkelä [ 2022-02-07 ]

I think that the following failure might be related to this:

atomic.rename_trigger 'innodb'           w4 [ fail ]
        Test ended at 2022-02-07 00:56:07
CURRENT_TEST: atomic.rename_trigger
mysqltest: At line 159: query 'let $res=`select t1.a+t2.b+t3.c+t4.d from t1,t2,t3,t4`' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
"engine: innodb  crash point: ddl_log_rename_after_failed_rename_trigger  position: 3"
More results from queries before failure can be found in /buildbot/amd64-ubuntu-1804-msan/build/mysql-test/var/4/log/rename_trigger.log
 - skipping '/buildbot/amd64-ubuntu-1804-msan/build/mysql-test/var/4/log/atomic.rename_trigger-innodb/'
***Warnings generated in error logs during shutdown after running tests: atomic.rename_trigger
2022-02-07  0:56:06 0 [ERROR] InnoDB: Tablespace 5 was not found at ./test/t5.ibd.
2022-02-07  0:56:06 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
2022-02-07  0:56:06 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1445] with error Tablespace not found

Comment by Marko Mäkelä [ 2022-02-07 ]

Another possibly related failure:

encryption.innodb-redo-badkey 'cbc,innodb,strict_crc32' w4 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2022-02-06 22:48:09
line
2022-02-06 22:48:06 0 [ERROR] InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT for tablespace 5
2022-02-06 22:48:07 0 [ERROR] InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before FILE_CHECKPOINT for tablespace 5

Comment by Marko Mäkelä [ 2022-03-08 ]

I tried this on 10.8 (after MDEV-14425), but it did not fail:

diff --git a/storage/innobase/fil/fil0fil.cc b/storage/innobase/fil/fil0fil.cc
index 0273fcd9a03..8f02ff51242 100644
--- a/storage/innobase/fil/fil0fil.cc
+++ b/storage/innobase/fil/fil0fil.cc
@@ -1857,6 +1857,7 @@ static bool fil_rename_tablespace(uint32_t id, const char *old_path,
 	ut_ad(strchr(new_file_name, '/'));
 
 	if (!recv_recovery_is_on()) {
+		log_make_checkpoint();
 		log_sys.latch.wr_lock(SRW_LOCK_CALL);
 	}
 

10.8 e8a2a70cf869cf32a50144704646a5f34e8d7be9 (patched)

innodb.instant_alter_crash 'innodb'      [ pass ]   2079
innodb.alter_crash 'innodb'              [ pass ]   2489
atomic.rename_trigger 'innodb'           [ pass ]  33901

Also, the crash injection test case recovered fine. Does this still fail on the latest 10.6?

Comment by Marko Mäkelä [ 2022-03-08 ]

After I ran the crash injection test on 10.7 af87186c1d3f9eeab73bc38b1faa51ba17fa8017, the var/mysqld.1/data/test only contained db.opt, t1.frm, t1.ibd.

Edit: With the patch, the recovery fails. (But not on 10.8 with the patch.)

10.7 af87186c1d3f9eeab73bc38b1faa51ba17fa8017 (patched)

mysqltest: At line 20: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB

Comment by Marko Mäkelä [ 2022-03-08 ]

10.8 with MDEV-14425 but right before MDEV-27774 fails to recover:

10.8 8c7c92adf384b2d0294502d5b38684e68d5eaf27 (patched)

mysqltest: At line 20: query 'show create table t1' failed: ER_GET_ERRNO (1030): Got error 194 "Tablespace is missing for a table" from storage engine InnoDB

Comment by Thirunarayanan Balathandayuthapani [ 2022-03-08 ]

Problem exists on 10.6.2 also. The below commit is the reason for the issue:

[~/mariarepo/10.6/10.6-sample]$git log -p -1 1bd681c8b3c
commit 1bd681c8b3c5213ce1f7976940a7dc38b48a0d39
Author: Marko Mäkelä <marko.makela@mariadb.com>
Date:   Wed Jun 9 17:02:55 2021 +0300
 
    MDEV-25506 (3 of 3): Do not delete .ibd files before commit

Comment by Thirunarayanan Balathandayuthapani [ 2022-03-09 ]

trx_resurrect_table_locks() fails to load the table because sys table entries has deleted marked record for it. It leads to failure of loading of tables during
recovery.

Comment by Marko Mäkelä [ 2022-03-09 ]

Thank you. I see that the following code in dict_load_table_on_id() makes an incorrect assumption:

	if (page_rec_is_user_rec(rec)) {
		/*---------------------------------------------------*/
		/* Now we have the record in the secondary index
		containing the table ID and NAME */
check_rec:
		field = rec_get_nth_field_old(
			rec, DICT_FLD__SYS_TABLE_IDS__ID, &len);
		ut_ad(len == 8);
 
		/* Check if the table id in record is the one searched for */
		if (table_id == mach_read_from_8(field)) {
			if (rec_get_deleted_flag(rec, 0)) {
				/* Until purge has completed, there
				may be delete-marked duplicate records
				for the same SYS_TABLES.ID, but different
				SYS_TABLES.NAME. */
				while (btr_pcur_move_to_next(&pcur, &mtr)) {
					rec = btr_pcur_get_rec(&pcur);
 
					if (page_rec_is_user_rec(rec)) {
						goto check_rec;
					}
				}

This code is essentially implementing READ UNCOMMITTED while we need READ COMMITTED here. It is wrong to filter out delete-marked secondary index records (records of SYS_TABLES.ID). The secondary index indeed only contains delete-marked records for this SYS_TABLES.ID=18: (ID,NAME)=(18,'test/#sql-ib19'),(18,'test/t1'). In the clustered index, I see a delete-marked record for the former and a non-delete-marked record for 'test/t1', with the wrong ID=19 for the most recent (uncommitted) version. Both records contain DB_TRX_ID=25.

Implementing proper READ COMMITTED might be easiest by allocating a transaction with a read view, and adapting some logic from row_search_mvcc() for accessing the dictionary tables. The functions row_sel_sec_rec_is_for_clust_rec() and row_vers_build_for_consistent_read() have to be part of the solution.

Comment by Marko Mäkelä [ 2022-03-09 ]

I think that we can implement "poor man’s MVCC" for the secondary index scan like this:

diff --git a/storage/innobase/dict/dict0load.cc b/storage/innobase/dict/dict0load.cc
index 71dedd48c29..88351b06201 100644
--- a/storage/innobase/dict/dict0load.cc
+++ b/storage/innobase/dict/dict0load.cc
@@ -2536,11 +2536,16 @@ dict_load_table_on_id(
 
 		/* Check if the table id in record is the one searched for */
 		if (table_id == mach_read_from_8(field)) {
-			if (rec_get_deleted_flag(rec, 0)) {
-				/* Until purge has completed, there
-				may be delete-marked duplicate records
-				for the same SYS_TABLES.ID, but different
-				SYS_TABLES.NAME. */
+			field = rec_get_nth_field_old(rec,
+				DICT_FLD__SYS_TABLE_IDS__NAME, &len);
+			table = dict_sys.load_table(
+				{reinterpret_cast<const char*>(field),
+				 len}, ignore_err);
+			if (table && table->id != table_id) {
+				ut_ad(rec_get_deleted_flag(rec, 0));
+				table = nullptr;
+			}
+			if (!table) {
 				while (btr_pcur_move_to_next(&pcur, &mtr)) {
 					rec = btr_pcur_get_rec(&pcur);
 
@@ -2548,13 +2553,6 @@ dict_load_table_on_id(
 						goto check_rec;
 					}
 				}
-			} else {
-				/* Now we get the table name from the record */
-				field = rec_get_nth_field_old(rec,
-					DICT_FLD__SYS_TABLE_IDS__NAME, &len);
-				table = dict_sys.load_table(
-					{reinterpret_cast<const char*>(field),
-					 len}, ignore_err);
 			}
 		}
 	}

For the first matching record, dict_sys.load_table() would find nothing, because it is still performing READ UNCOMMITTED and that record is delete-marked. For the second one, it returns a table with the wrong id=19 because it fails to fetch the latest committed version.

Edit: I was thinking to use purge_sys.view for the MVCC read, but I realized that READ COMMITTED semantics would require the newest possible read view instead of the oldest one.

Comment by Marko Mäkelä [ 2022-03-09 ]

It may be much easier to roll back a recovered DDL transaction before resurrecting the table locks, by simply continuing to use READ UNCOMMITTED, which thanks to the added rollback would be equivalent to READ COMMITTED.

If we implement READ COMMITTED, it will have to be implemented for accessing all dictionary tables. For example, if both the table and some indexed columns or base columns of virtual columns were renamed in a recovered incomplete DDL transaction, from each dictionary table we must fetch the correct names.

Comment by Marko Mäkelä [ 2022-03-09 ]

It turns out that we do have to implement the READ COMMITTED access to the data dictionary. Even if we defer trx_resurrect_table_locks() after the trx_rollback_active(false) call that would roll back the DDL transaction, that very rollback would fail in its attempt to load the table definition.

Comment by Marko Mäkelä [ 2022-03-10 ]

Here is a test that halts the commit and then forces a checkpoint so that the FILE_ record recovery will not interfere with the processing of the undo logs and the data dictionary tables:

--source include/have_innodb.inc
--source include/have_debug.inc
--source include/have_debug_sync.inc
 
CREATE TABLE t1 (a INT NOT NULL) ENGINE=InnoDB STATS_PERSISTENT=0;
 
connect ddl,localhost,root;
SET DEBUG_SYNC='after_trx_committed_in_memory SIGNAL stuck WAIT_FOR ever EXECUTE 2';
send ALTER TABLE t1 ADD PRIMARY KEY(a);
 
connection default;
SET DEBUG_SYNC='now WAIT_FOR stuck';
SET DEBUG_SYNC='now SIGNAL ever';
SET DEBUG_SYNC='now WAIT_FOR stuck';
 
SET GLOBAL innodb_log_checkpoint_now=ON;
 
--let $shutdown_timeout=0
--source include/restart_mysqld.inc
 
disconnect ddl;
 
SHOW CREATE TABLE t1;
SELECT * FROM t1;
DROP TABLE t1;

The test actually passes (with SHOW CREATE TABLE wrongly indicating that the table includes a PRIMARY KEY), and an orphan table will be left behind:

10.9 35fcae10408aa4c54967d118c8ccde1c3736b85f

MTR's internal check of the test case 'main.MDEV-27234' failed.
--- /dev/shm/10.9/mysql-test/var/tmp/check-mysqld_1.result	2022-03-10 10:53:59.764275437 +0200
+++ /dev/shm/10.9/mysql-test/var/tmp/check-mysqld_1.reject	2022-03-10 10:54:00.352281258 +0200
@@ -1092,3 +1092,4 @@
 partition	1.0	DISABLED	STORAGE ENGINE	100900.0	NULL	NULL	Mikael Ronstrom, MySQL AB	Partition Storage Engine Helper	GPL	OFF	Stable	1.0
 VARIABLE_NAME	VARIABLE_VALUE
 DEBUG_SYNC	ON - current signal: ''
+#sql-ib19.ibd
 
mysqltest: Result length mismatch

On 10.6 36a19f94ce102be87b22fdc7ad147ab7d5ed4331 the test behaves in exactly the same way.

Comment by Marko Mäkelä [ 2022-03-10 ]

The only dictionary tables that contain any secondary indexes are SYS_TABLES and SYS_FOREIGN. The function dict_load_foreigns() does access the secondary index on SYS_FOREIGN.FOR_NAME, in the READ UNCOMMITTED fashion. We must not filter out delete-marked records there, and we must make dict_load_foreign() retrieve the last committed version, or return a status that no committed record existed.

Every access to the clustered indexes must be corrected in a similar way, so that we will load the latest committed version (READ COMMITTED).

I expect that the function trx_t::evict_table() can be removed as part of fixing this bug.

Comment by Marko Mäkelä [ 2022-03-15 ]

data.tar.xz is a data directory with innodb_page_size=64k that might demonstrate this bug. On accessing the table t3 (on which DDL was in progress when the server was killed), the following mismatch would be reported:

10.6 dafc5fb9c17637c6233e8a73c890f158d645a37c

2022-03-15 16:16:19 3 [ERROR] InnoDB could not find key no 0 with name Marvão_uidx1 from dict cache for table test/t3

Comment by Marko Mäkelä [ 2022-03-21 ]

Some adjustment may be needed for recovering instant ALTER TABLE metadata (MDEV-11369 and MDEV-15562). Currently, the rec_get_offsets() call in btr_cur_instant_init_low() assumes that dict_index_t::n_fields corresponds to the READ UNCOMMITTED version of the table definition. An assertion would fail during the first restart of the test innodb.instant_alter_crash.

Comment by Marko Mäkelä [ 2022-03-21 ]

The trx_t::evict_table() is necessary for evicting table definitions on the rollback of DDL transactions, for example, on a CREATE TABLE that failed due to incorrect FOREIGN KEY constraints.

My current work-in-progress fix still shows the index mismatch message for data.tar.xz:

2022-03-21 17:11:23 3 [ERROR] Cannot find index Marvão_uidx1 in InnoDB index dictionary.
2022-03-21 17:11:23 3 [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t3
2022-03-21 17:11:23 3 [ERROR] InnoDB could not find key no 0 with name Marvão_uidx1 from dict cache for table test/t3
2022-03-21 17:11:23 3 [ERROR] InnoDB: Table test/t3 contains 3 indexes inside InnoDB, which is different from the number of indexes 3 defined in the .frm file. See https://mariadb.com/kb/en/innodb-troubleshooting/

Comment by Marko Mäkelä [ 2022-03-22 ]

I spent a lot of time on the test case before I confirmed with rr replay that before the server is killed, the ADD PRIMARY KEY was actually committed. The actual reason why the #sql-ib file was not dropped is that the fil_delete_tablespace() call in row_purge_remove_clust_if_poss_low() did not find the tablespace, because dict_check_sys_tables() was skipping delete-marked records of SYS_TABLES.

My work-in-progress fix of that broke a number of mtr tests (mostly IMPORT TABLESPACE) and still fails to recover data.tar.xz correctly.

Comment by Marko Mäkelä [ 2022-03-23 ]

The data.tar.xz cannot be helped by this fix. The warning message mentions an unexpectedly missing index. In fact, in dict_load_index_low(), a delete-marked record of SYS_INDEXES for it was encountered and correctly filtered out in my development branch, because no active transaction for the DB_TRX_ID was found. We would need an rr replay trace of such a failure to understand why the DDL recovery apparently chose the wrong .frm file when recovering from a crash during DROP INDEX Marvão_uidx1.

Comment by Marko Mäkelä [ 2022-03-24 ]

I was able to fix the recovery of data.tar.xz. It turned out that there was a SYS_INDEXES record for an ADD INDEX stub that had been deleted by a committed transaction. The table->def_trx_id was wrongly updated to the DB_TRX_ID of that record, and InnoDB wrongly informed the DDL log recovery that the ALTER TABLE transaction had been committed inside InnoDB. Yes, we must pay attention to the DB_TRX_ID of committed delete-marked SYS_INDEXES records, but only if the SYS_INDEXES.NAME does not identify the index as incomplete (by starting with a 0xff byte).

For the crash in the recovery of the test innodb.instant_alter_crash I got an idea: btr_cur_instant_init_low() should perform a READ COMMITTED of the instant ADD COLUMN metadata record. In the case of the crash, that record had been inserted by an uncommitted transaction.

Comment by Matthias Leich [ 2022-03-24 ]

origin/bb-10.6-MDEV-27234 13dfdd3953c076177f8b61355cb70a887a958b4b 2022-03-24T12:13:54+02:00
behaved well in RQG testing (battery for broad range coverage only).

Generated at Thu Feb 08 09:51:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.