After some more consideration, I do not think that it is safe to fix this in GA versions (before MariaDB 10.3).
There are a few problems, and there exists a solution that can avoid orphan #sql tables in most cases. By implementing transactional logging inside InnoDB, we will be able to guarantee that the InnoDB data dictionary and file system will be in sync. Some mechanism outside InnoDB is needed to remove mismatch between .frm files and InnoDB, especially in CREATE, DROP, and RENAME operations.
The current state of Data Definition Language Operations in InnoDB
The highlights of DDL operations in InnoDB are as follows:
- When a record is inserted in the internal SYS_INDEXES, an index tree will be created.
- When a SYS_INDEXES record is delete-marked, the correpsonding index tree will be freed, already before transaction commit. So, it is not really possible to roll back.
- The undo log header can store one trx_t::table_id. On recovery, the identified table will be dropped. This allows a crash-safe implementation of a single CREATE/DROP TABLE per transaction.
- RENAME TABLE is not crash-safe. If the rename operation was performed in the file system but the transaction was not committed before InnoDB was killed and restarted, the change to the SYS_TABLES.NAME will be rolled back, but the rename in the file system will remain in effect.
- If a transaction involves creating, dropping or renaming multiple InnoDB tables (say, a partitioned table or a table with fulltext indexes), internally there may be multiple transactions, and it is not really crash-safe.
- For ALTER TABLE…ADD INDEX…ALGORITHM=INPLACE we will create ‘index stubs’ such that SYS_INDEXES.NAME starts with the invalid UTF-8 sequence 0xff. This name prefix is also used for crash-safe DROP INDEX.
Transactional logging of Data Definition Language Operations
To implement transactional Data Manipulation Language operations, InnoDB implements transactional logging. Each transaction has a log, called the undo log. Actually, the name 'undo log' is slightly misleading, because the log not only facilitates ROLLBACK, but it also covers actions that have to be performed after COMMIT (the purge of no-longer-needed history).
There is a clear analogy to DDL operations. Similar to DELETE, DROP TABLE or TRUNCATE TABLE should not actually remove any data before the transaction is committed. Only after transaction commit, the data can be removed. (Because the data dictionary essentially follows the READ COMMITTED isolation, that is, there is no multi-versioning, the data can be removed immediately after commit.)
The goal is to allow arbitrary transactional multi-table DDL operations inside InnoDB. This goal can be achieved by introducing and writing undo log records to facilitate the following:
- CREATE file: on rollback, delete the file if it exists
- CREATE index (outside CREATE file): on rollback, free the index tree if the root page matches
- DELETE file, DELETE index: similar to CREATE, but the action is executed after commit only
- RENAME file: on rollback, rename the file back if it had already been renamed
- TRUNCATE index: after commit, truncate the index tree identified by the root page number
- BULK-LOAD index: on rollback, truncate the index tree identified by the root page number
This kind of logging allows any combination of DDL operations within an InnoDB ACID transaction. It also contains a provision to implement MDEV-515 bulk insert and MDEV-13564 backup-friendly TRUNCATE TABLE.
For now, we may keep the 'index stubs' for ADD INDEX/DROP INDEX. They must be removed when we remove the InnoDB data dictionary tables (MDEV-11655).
ALTER TABLE…ALGORITHM=INPLACE
Inside InnoDB, we make a distinction whether this operation is rebuilding the table or not. If yes, then after MDEV-14378, MariaDB will rename the old table to temporary name #sql-ib before replacing the old table.
With the above-mentioned logging in place, the #sql-ib file will not be visible after InnoDB recovery, and the operation can be observed as follows:
- Create a #sql….frm file for the altered table definition.
- Invoke prepare_inplace_alter_table(), inplace_alter_table(), commit_inplace_alter_table() inside the storage engine.
- Replace the tablename.frm with the #sql….frm file.
If the server is killed before all the above steps complete, there are two possibilities:
- A likely incomplete (corrupted) table #sql… will exist. (It is safe for InnoDB to drop the table, so let us assume that only the #sql….frm file will exist after InnoDB recovery.)
- The operation was committed inside InnoDB. #sql….frm exists, and should be renamed to tablename.frm. Otherwise, tablename.frm will be out of sync with InnoDB, and bad things (including crash or corruption) can occur if the table is accessed.
ALTER TABLE…ALGORITHM=COPY
This is divided into the following steps:
- Create table #sql…. (First create the .frm file, then create the table inside InnoDB.) COMMIT.
- Copy the data from tablename to #sql…. COMMIT.
- Rename tablename to #sql2…. COMMIT.
- Rename #sql… to tablename. COMMIT.
- Drop #sql2… COMMIT.
If we (possibly naïvely) assume that the file system operations are crash-safe (with appropriate fsync() or similar in place, also for all metadata), then it seems that we can have the following crash points:
- Table #sql… exists, and the copying was not completed. InnoDB recovery can remove it; something outside InnoDB (a log file or DBA) would cause the .frm file to be removed.
- Crash during rename to #sql2…: First we rename the .frm file, then call handler::rename_table(). On crash recovery, we can have #sql2….frm but still tablename.ibd. It is still safe to drop #sql… inside InnoDB. But something outside InnoDB will have to rename #sql2… back to tablename.
- Rename from #sql… to tablename: After crash, we can have tablename.frm and both #sql….ibd and #sql2….ibd, and no tablename.ibd.
3. Drop #sql2…: After crash, we can have #sql2….frm but no #sql2….ibd. The .frm file can be deleted outside InnoDB.
So, it is not safe for InnoDB to drop any #sql2… tables, but #sql and #sql-ib are safe to drop, provided that the renaming inside InnoDB works safely.
We should try to implement the RENAMEs as a single transaction. In that way, after InnoDB recovery we would always have a tablename.ibd. For automatic recovery, we would still need an appropriate logging and recovery mechanism for renaming the .frm files.
RENAME TABLE
Inside InnoDB, we will implement RENAME as follows:
- Write an undo log record for rolling back the rename operation.
- Write a MLOG_FILE_RENAME2 record for rolling forward the rename operation.
- Rename the file in the file system, and make the change durable.
On transaction rollback (which can be before or after InnoDB restart), we will do the following provided that the renaming was not already rolled back:
- Write a MLOG_FILE_RENAME2 record for rolling back the rename operation.
- Rename the file in the file system, and make the change durable.
Note: If executing the RENAME operations fails at rollback, we are pretty much out of luck. Hopefully it is likely that if the RENAME originally was possible, it should also be possible to execute the operation in the reverse direction. An I/O error or a change of file system permissions could break this assumption.
CREATE TABLE and DROP TABLE
For these, a recovery mechanism (drop an incompletely created table) must be provided outside InnoDB. Currently, it is possible to have a failure where a .frm file exists but the table does not exist inside the storage engine.
InnoDB will delete the files on rollback of CREATE TABLE, or after the commit of DROP TABLE. If the deletion fails for some reason (such as missing file system permissions), orphan files will be left behind.
I think that we must be careful here.
These particular names seem to originate from ALTER TABLE…ALGORITHM=COPY. The #sql tables whose names do not start with #sql-ib or #sql2 are probably safe to remove. The #sql2 prefix is what ALGORITHM=COPY is renaming the original table to, right before renaming the altered table from #sql name to the user-specified name. We do not want to risk losing both copies of the table. I will have to carefully review the code to see if it is possible that InnoDB can recover two #sql tables if ALGORITHM=COPY crashes at the right moment.
Similarly, I would be careful with tables whose names start with #sql-ib (used by table-rebuilding `ALTER TABLE…ALGORITHM=INPLACE`). Note that after
MDEV-14378(in MariaDB 10.3.3), the #sql-ib name prefix will only be used for renaming the original table, right before replacing the original table with the rebuilt one.