Status: Open (View Workflow)
Fix Version/s: None
MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic DROP DATABASE). They are not transactional yet, e.g., CREATE TABLE or ALTER TABLE may unnecessarily use multiple transactions (MDEV-25921).
The .frm files can be preserved for backward compatibility (tables could be copied by copying data and .frm files).
The data dictionary can be kept in the .frm files, which can be made crash-safe and transactional by adding some transactional logging.
We can create a ‘cache’ of the .frm files in a persistent tables table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing INFORMATION_SCHEMA queries.
To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make .ibd files self-contained and store a metadata record with the following information:
- DB_TRX_ID,DB_ROLL_PTR to facilitate rollback
- Secondary index IDs and root page numbers
- Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID
This proposal is based on the following assumptions:
- The data dictionary is stored in a crash-safe storage engine (such as ENGINE=Aria), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
(MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
- MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
- InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
(In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index (primary key) starts at page 3.)
Any metadata that can be changed by DDL operations only.
Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (
MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.
Note 1: tables only stores persistent metadata about tables and sequences.
CREATE TEMPORARY TABLE would not write anything there, but CREATE TABLE…ENGINE=MEMORY would, if the table definition (unlike the contents) is supposed to survive server restart.
Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
can identify them with the _ prefix in tables.definition. For
example, if no explicit ROW_FORMAT is specified, InnoDB will store
_ROW_FORMAT with the current innodb_default_row_format. (If we do not
store the actual format options in tables.definition, we will
lose an opportunity to perform a consistency check between .ibd files
and tables when opening a table definition.)
Note 1.2: In accordance with MDEV-16417, tables that are referencing the current table via FOREIGN KEY constraints would be identified in the .frm file and the tables.frm column. (The referencing tables’ REFERENCES clauses would be present in both tables.description and tables.frm.)
Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in mysql.tables.definition and the .frm file.
For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
Likewise for InnoDB, INDEX DIRECTORY will be removed.
For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
For any storage engine, an invalid ROW_FORMAT would be replaced with
the actually used value.
Note 1.4: For partitioned tables, there will be a single entry in tables.
Note 1.5: For additional safety and portability, whenever a change to
a tables record is committed or when FLUSH TABLES is issued, we
will update or create the affected or related
datadir/schemaname/tablename.frm files. The files can continue
to be used to move tables to other server instances.
Note 1.6: On a table definition cache miss in handler::ha_open(), both
tables and the file system will be consulted for .frm or .par
files. If no record is found, a record will be inserted into
tables based on the .frm file contents. This allows the user to
replace tables by copying files. If the .frm file is found valid but different
from tables.frm, then the old records for the table by this name
will be deleted from tables and a new
record will be inserted into mysql.tables.
Note 1.7: It should be possible to import InnoDB tables or partitions
across server instances using the same file-copying method, with no
special IMPORT syntax. To speed up the operation, a server GUID can be
embedded in the .ibd files to distinguish ‘foreign’ files that may
need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
Note 1.8: If any table or columns attached to FOREIGN KEY…REFERENCES are renamed, the
tables.definition and tables.frm of all affected tables must be updated within
the same transaction.
In the MySQL 8.0 Global Data Dictionary, there will likely be a
special Data Dictionary Tablespace that will contain all data
dictionary tables. There is one hard-coded data dictionary table:
On startup, the server will read the contents of this table and refuse
to start up unless the stored version_id is the current
version. With a special start-up option, a future version of the
server can upgrade from a supported earlier version. In the process,
it would likely create a new data dictionary tablespace and copy and
convert all metadata from the old-version data dictionary.
In MariaDB, we would probably have to hard-code the definition of the tables table, and compare it to tables.frm if it exists. An entry for tables itself could be omitted from the tables table.
MariaDB or InnoDB should ideally not apply any ‘create if not exists’
logic on any files. A normal startup will be refused if some of the
predefined files do not exist.
On a special startup option --initialize, MariaDB will create a new
instance. In this mode, the specified --datadir must be empty.
We can initialize a new server instance in the following order:
- the redo log(s) (possibly including dummy files for preventing startup on older versions)
- (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
- (the tables table)
- fsync() all files
A startup attempt on an older MariaDB would fail because the (possibly
dummy) redo log or binlog files are in incompatible format.
We must have enough code to support SHOW CREATE TABLE of all old tables.
InnoDB may refuse startup for upgrade in the following cases:
- the redo log is not clean
- any tables exist in the InnoDB system tablespace
- any tables contain secondary indexes not from page number 4 onwards
- the change buffer is not empty
- the undo logs are not empty
In these cases, the user would have to address the problem in the older
On upgrade, once MDEV-11633 is completed, MariaDB would abandon
(optionally destroy) the InnoDB system tablespace files and any undo
tablespace files. We would no longer create such files.
Upgrade would create and populate the tables tables based on the old data dictionary contents.
If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:
- the upgrade will be refused due to a failed consistency or compatibility check
- the upgrade completes
- the server is killed and must be restarted with upgrade
Do we need to check if InnoDB data files use the same ROW_FORMAT as is stored in the .frm file or in the mysql.tables entry? Probably not; we should only require that the .ibd file is self-consistent.
If we implement tables, how to ensure consistency with .frm files?
- If users add files? Silently copy from .frm to tables. Alternative: Require an extra IMPORT command to be executed, to facilitate auditing.
- If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until DROP TABLE has been executed.
- If users replace files? Detect when loading a table definition to the cache:
Optionally, compare stat("name.frm") to create_time, LENGTH(frm).
UPDATE tables SET frm=… from the file name.frm if the file looks valid.
FLUSH TABLES will force the next handler::open() to check .frm
Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.
How to find the FOREIGN KEY children? REFERENCES clauses in SHOW CREATE TABLE point to parents only. As noted in MDEV-16417, we can extend the frm information to additionally store the names of referencing tables.
Certain operations must know the FOREIGN KEY child/parent graph.
RENAME TABLE, DROP TABLE, ALTER TABLE, TRUNCATE TABLE must either be prohibited, or the renaming of columns or tables must be cascaded.
INSERT, UPDATE, DELETE, REPLACE must be prohibited or cascaded based on the parent and child tables.
SELECT, ROLLBACK, COMMIT can continue to ignore FOREIGN KEY constraints.
MDEV-22361 would implement native, cross-engine FOREIGN KEY processing in the SQL layer. This would make ON…CASCADE and ON…SET NULL] fire TRIGGER, at last. This would also make FOREIGN KEY work on partitioned tables (MDEV-12483).
MySQL 8.0 aims to replace .frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.
The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:
- SQL statement (SHOW CREATE TABLE)
- Metadata decomposed into tables,columns,indexes,index_column_usage,…
- Not fully decomposed: some SQL expressions are stored as text
- Not fully decomposed: dd::Properties is a string of key-value pairs
- JSON-based Serialized Dictionary Information (SDI)
- SDI in .ibd files for import/export & disaster recovery (WL#7053, WL#7066)
- Additional InnoDB files for import/export; may be replaced by the above two
The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See Faster INFORMATION_SCHEMA.
The decomposition has the drawback of overhead. Using the terms popularized by Mark Callaghan, it is causing read amplification, storage amplification, and write amplification.
Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a .frm file.
The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)
For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the .SDI file and the WL#7053 embedded metadata inside the .ibd files.
In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.