[MDEV-25852] Orphan #sql*.ibd files are occasionally left behind after killed ALTER TABLE Created: 2021-06-03  Updated: 2021-06-14  Resolved: 2021-06-09

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: 10.6.2

Type: Bug Priority: Blocker
Reporter: Marko Mäkelä Assignee: Thirunarayanan Balathandayuthapani
Resolution: Fixed Votes: 0
Labels: recovery

Attachments: File data_copy.tar.bz2     File dd-t2.tar.gz     File dd.tar.gz    
Issue Links:
Relates
relates to MDEV-25909 Unnecessary calls to fil_ibd_load() s... Open
relates to MDEV-24626 Remove synchronous write of page0 and... Closed
relates to MDEV-25506 Atomic DDL: .frm file is removed and ... Closed

 Description   

Even with the third fix related to MDEV-25506 (to never delete .ibd files before the DDL operation has been committed), we still have occasional result difference of the following test, which is based on innodb_fts.crash_recovery,release:

--source include/have_innodb.inc
--source include/not_embedded.inc
FLUSH TABLES;
 
--connect(ddl1, localhost, root,,)
CREATE TABLE t1(a TEXT,b TEXT,INDEX(a(5))) ENGINE=InnoDB;
send ALTER TABLE t1 ADD INDEX(b(5));
--connect(ddl2, localhost, root,,)
CREATE TABLE t2(a TEXT,b TEXT,INDEX(a(5))) ENGINE=InnoDB;
send ALTER TABLE t2 DROP INDEX a, ADD INDEX(b(5)), FORCE;
--connect(ddl3, localhost, root,,)
CREATE TABLE t3(a TEXT,b TEXT,INDEX(a(5))) ENGINE=InnoDB;
send ALTER TABLE t3 DROP INDEX a, ADD INDEX(b(5)), ALGORITHM=COPY;
 
--connection default
let $shutdown_timeout=0;
--source include/shutdown_mysqld.inc
--exec tar czf $datadir/../dd.tar.gz -C $datadir .
--source include/start_mysqld.inc
 
disconnect ddl1;
disconnect ddl2;
disconnect ddl3;
 
DROP TABLE t1,t2,t3;
 
SELECT * FROM information_schema.innodb_sys_tables WHERE name LIKE 'test/%';
let $datadir=`select @@datadir`;
list_files $datadir/test *.ibd;

Occasionally, the list_files will list a file #sql-ib*.ibd or #sql-backup-*.ibd. If that last line is commented out, the test will continue, and the discrepancy will be reported by check-testcase (which does not count as a failure for mtr). Here are a couple of examples:

--- /dev/shm/10.6/mysql-test/var/13/tmp/check-mysqld_1.result	2021-06-03 16:38:44.877394351 +0300
+++ /dev/shm/10.6/mysql-test/var/13/tmp/check-mysqld_1.reject	2021-06-03 16:38:45.473401026 +0300
@@ -1097,3 +1097,4 @@
 partition	1.0	DISABLED	STORAGE ENGINE	100602.0	NULL	NULL	Mikael Ronstrom, MySQL AB	Partition Storage Engine Helper	GPL	OFF	Stable	1.0
 VARIABLE_NAME	VARIABLE_VALUE
 DEBUG_SYNC	ON - current signal: ''
+#sql-ib168.ibd
 
mysqltest: Result length mismatch

and

--- /dev/shm/10.6/mysql-test/var/25/tmp/check-mysqld_1.result	2021-06-03 16:38:31.953249791 +0300
+++ /dev/shm/10.6/mysql-test/var/25/tmp/check-mysqld_1.reject	2021-06-03 16:38:32.509256004 +0300
@@ -1097,3 +1097,4 @@
 partition	1.0	DISABLED	STORAGE ENGINE	100602.0	NULL	NULL	Mikael Ronstrom, MySQL AB	Partition Storage Engine Helper	GPL	OFF	Stable	1.0
 VARIABLE_NAME	VARIABLE_VALUE
 DEBUG_SYNC	ON - current signal: ''
+#sql-backup-387caf-a.ibd
 
mysqltest: Result length mismatch

Nothing else is failing. Attached is a copy of a data directory where we end up with an orphan file. The following patch would fix that, but as collateral damage, we would fail in a different way, by prematurely deleting a file t2.ibd during the test:

innodb_fts.cr 'innodb'                   w2 [ fail ]  Found warnings/errors in server log file!
        Test ended at 2021-06-03 17:53:36
line
2021-06-03 17:53:36 3 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2021-06-03 17:53:36 3 [ERROR] InnoDB: The error means the system cannot find the path specified.
2021-06-03 17:53:36 3 [ERROR] InnoDB: File ./test/t2.ibd: 'delete' returned OS error 71.

Here is my incorrect patch:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 12ec2e29bfb..21f9ea86acc 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -597,6 +597,8 @@ static struct
     lsn_t lsn;
     /** File name from the FILE_ record */
     std::string file_name;
+    /** whether a FILE_DELETE record was encountered */
+    mutable bool deleted;
   };
 
   using map= std::map<const uint32_t, item, std::less<const uint32_t>,
@@ -638,7 +640,7 @@ static struct
 
     char *fil_path= fil_make_filepath(nullptr, {filename, strlen(filename)},
                                       IBD, false);
-    const item defer= {lsn, fil_path};
+    const item defer= {lsn, fil_path, false};
     auto p= defers.emplace(space, defer);
     if (!p.second && p.first->second.lsn <= defer.lsn)
       p.first->second= defer;
@@ -1034,9 +1036,10 @@ fil_name_process(char* name, ulint len, ulint space_id,
 
 	if (deleted) {
 		/* Got FILE_DELETE */
-
-		deferred_spaces.remove(
-			static_cast<uint32_t>(space_id));
+		if (auto d = deferred_spaces.find(static_cast<uint32_t>(
+							  space_id))) {
+			d->deleted = true;
+		}
 		if (!p.second && f.status != file_name_t::DELETED) {
 			f.status = file_name_t::DELETED;
 			if (f.space != NULL) {



 Comments   
Comment by Marko Mäkelä [ 2021-06-03 ]

dd.tar.gz is a data directory for which the orphan file would remain if you do not apply my patch to the bb-10.6-MDEV-25506 branch.
dd-t2.tar.gz is a data directory where recovery would wrongly delete the file test/t2.ibd if you apply my patch.

Comment by Marko Mäkelä [ 2021-06-03 ]

I debugged dd-t2.tar.gz with the patch a little further.- We delete the wrong file here:

#3  fil_delete_tablespace (id=id@entry=6)
    at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:1699
#4  0x000055d4764adc9a in row_purge_remove_clust_if_poss_low (
    node=node@entry=0x55d478d32db8, mode=<optimized out>, mode@entry=2)
    at /mariadb/10.6/storage/innobase/row/row0purge.cc:188
#5  0x000055d4764a8494 in row_purge_remove_clust_if_poss (node=0x55d478d32db8)
    at /mariadb/10.6/storage/innobase/row/row0purge.cc:275
#6  row_purge_del_mark (node=0x55d478d32db8)
    at /mariadb/10.6/storage/innobase/row/row0purge.cc:734

That file name belongs to another tablespace, with id=7:

(rr) p *fil_system.spaces.array@fil_system.spaces.n_cells
$3 = {{node = 0x0} <repeats 6376 times>, {node = 0x55d478db5468}, {node = 0x0}, {node = 0x55d478d0c9e8}, {node = 0x55d478d0c7e8}, {node = 0x55d478d0c5e8}, {node = 0x0}, {node = 0x55d478d78b08}, {
    node = 0x55d478d785e8}, {node = 0x55d478d78828}, {node = 0x55d478d1faf8}, {node = 0x0} <repeats 15558 times>, {node = 0x55d478d74198}, {node = 0x0} <repeats 28822 times>}
(rr) p ((fil_space_t*)0x55d478db5468).chain.start.name
$4 = 0x55d478d0c448 "./test/t3.ibd"
(rr) p ((fil_space_t*)0x55d478d0c9e8).chain.start.name
$5 = 0x55d478d0c398 "./test/t2.ibd"
(rr) p ((fil_space_t*)0x55d478d0c9e8).id
$6 = 7

The incorrect name for tablespace 6 was assigned here:

#0  fil_space_t::add (this=0x55d478d0c7e8, name=0x55d478d0a5d8 "./test/t2.ibd", handle={m_file = -1, m_psi = <synthetic pointer>}, size=7, is_raw=false, atomic_write=<optimized out>, max_pages=4294967295)
    at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:312
#1  0x000055d4763bdb6b in $_0::create (it=<optimized out>, name=<optimized out>, flags=<optimized out>, crypt_data=<optimized out>, size=7) at /mariadb/10.6/storage/innobase/log/log0recv.cc:733
#2  recv_sys_t::recover_deferred (this=<optimized out>, p=<optimized out>, name=<optimized out>, free_block=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:781
#3  0x000055d4763cbb20 in $_0::reinit_all (this=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:700
#4  recv_recovery_from_checkpoint_start (flush_lsn=<optimized out>) at /mariadb/10.6/storage/innobase/log/log0recv.cc:4065

I think that we must make the deferred_spaces aware of FILE_RENAME records. I made an attempt at that, and managed to fix both dd.tar.gz and dd-t2.tar.gz, but the above test case would occasionally end up with orphan files, and at lest 10 existing regression tests would fail:

Failing test(s): encryption.innodb-encryption-alter encryption.innodb_encrypt_freed innodb.innodb-index innodb.innodb-alter-tempfile parts.debug_innodb_crash innodb.truncate_crash innodb.innodb-blob innodb.instant_alter_crash atomic.create_table innodb.innodb_bulk_create_index_debug

Comment by Marko Mäkelä [ 2021-06-04 ]

thiru, thank you, your refinement of my fix does the trick.

There is also a race condition in the test. We must wait for purge, so that any #sql-ib.ibd files for dropped tables will be removed before the test ends. With that fixed, a debug build of the server test failed due to an unrelated message:

innodb_fts.cr 'innodb'                   w5 [ 631 fail ]  Found warnings/errors in server log file!
        Test ended at 2021-06-04 17:12:17
line
2021-06-04 17:12:16 0 [Warning] InnoDB: Failed to load the clustered index for table `test`.`#sql-ib3120` because of TABLE_ID mismatch. Refusing to load the rest of the indexes (if any) and the whole table altogether.
2021-06-04 17:12:16 0 [Warning] InnoDB: Failed to load table `test`.`#sql-ib3120`:Data structure corruption

The release build of the same completed successfully:

innodb_fts.cr 'innodb'                   w3 [ 1000 pass ]   1480
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 55298.590 of 2408 seconds executing testcases
 
Completed: All 27000 tests were successful.

I plan to repeat another variant of the test, with FULLTEXT INDEX instead of a normal INDEX, so that more files will be created and deleted. But I think that this problem is solved now, and I will take care of pushing the fix. Thank you for finishing it!

Comment by Marko Mäkelä [ 2021-06-04 ]

The following worked too, for 27000 runs of the release build. However, if I replace either --echo with send, it would occasionally report orphan FTS_*.ibd files for the common tables (but no orphan tables). That should be fixed by MDEV-25850.

--source include/have_innodb.inc
--source include/not_embedded.inc
FLUSH TABLES;
let $datadir=`select @@datadir`;
 
--connect(ddl1, localhost, root,,)
CREATE TABLE t1(a TEXT,b TEXT,FULLTEXT INDEX(a)) ENGINE=InnoDB;
send ALTER TABLE t1 ADD INDEX(b(5));
--connect(ddl2, localhost, root,,)
CREATE TABLE t2(a TEXT,b TEXT,FULLTEXT INDEX(a)) ENGINE=InnoDB;
--echo ALTER TABLE t2 DROP INDEX a, ADD FULLTEXT INDEX(b), FORCE;
--connect(ddl3, localhost, root,,)
CREATE TABLE t3(a TEXT,b TEXT,FULLTEXT INDEX(a)) ENGINE=InnoDB;
--echo ALTER TABLE t3 DROP INDEX a, ADD FULLTEXT INDEX(b), ALGORITHM=COPY;
 
--connection default
let $shutdown_timeout=0;
--source include/shutdown_mysqld.inc
#--exec tar czf $datadir/../dd.tar.gz -C $datadir .
--source include/start_mysqld.inc
 
# Ensure that the history list length will actually be decremented by purge.
SET @saved_frequency = @@GLOBAL.innodb_purge_rseg_truncate_frequency;
SET GLOBAL innodb_purge_rseg_truncate_frequency = 1;
source ../../innodb/include/wait_all_purged.inc;
SET GLOBAL innodb_purge_rseg_truncate_frequency = @saved_frequency;
 
disconnect ddl1;
disconnect ddl2;
disconnect ddl3;
 
DROP TABLE t1,t2,t3;
 
SELECT * FROM information_schema.innodb_sys_tables WHERE name LIKE 'test/%';
list_files $datadir/test *.ibd;

Comment by Marko Mäkelä [ 2021-06-07 ]

Unfortunately, this still needs more work. data_copy.tar.bz2 fails to recover if the current fix is present:

2021-06-07 14:02:30 0 [ERROR] InnoDB: Tablespace 402 was not found at ./test/#sql-alter-26a6cc-2f.ibd.
2021-06-07 14:02:30 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
2021-06-07 14:02:30 0 [ERROR] InnoDB: Tablespace 404 was not found at ./test/#sql-alter-26a6cc-2f.ibd.
2021-06-07 14:02:30 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1433] with error Tablespace not found

Furthermore, I started to think that it is useless to open the same files over and over again:

		case FIL_LOAD_DEFER:
			/** Skip the deferred spaces
			when lsn is already processed */
			if (*store != store_t::STORE_IF_EXISTS) {
				deferred_spaces.add(
					static_cast<uint32_t>(space_id),
					name, lsn);
			}
			break;

We should already have filtered out duplicated name at this point. Currently, we are paying the overhead of several system calls in fil_ibd_load(). If some file t.ibd did not contain a valid tablespace ID, it will not contain it on a subsequent call either, because we are checking the consistency of the files before we are applying any changes, right?

Maybe we should have a mapping from file names to tablespace IDs. If the first page of the file was not correctly written before recovery, then we would store 0 as the tablespace ID for that file name.

Last, I do not see why Datafile::read_first_page() has to be so complicated. We could simply attempt to read innodb_page_size bytes once. If the file is too short or the data is all-zero, we will add it to deferred_spaces. If its FSP_SPACE_FLAGS do not match innodb_page_size we will reject the file (normally, refuse recovery). Finally, on checksum mismatch we might also simply add the file to deferred_spaces. Page 0 will never be reinitialized, and when it is initialized during tablespace creation, also all other pages will be reinitialized. So it should be of no help to try to find the tablespace ID in subsequent pages of the file.

Generated at Thu Feb 08 09:40:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.