Details

    • New Feature
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic DROP DATABASE, MDEV-25292 Atomic CREATE OR REPLACE, MDEV-27180 Atomic partitioning). They are not transactional yet, e.g., CREATE TABLE or ALTER TABLE may unnecessarily use multiple transactions (MDEV-25921).

      Proposal for Global Data Dictionary in MariaDB

      The .frm files can be preserved for backward compatibility (tables could be copied by copying data and .frm files).
      The data dictionary can be kept in the .frm files, which can be made crash-safe and transactional by adding some transactional logging.

      We can create a ‘cache’ of the .frm files in a persistent tables table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing INFORMATION_SCHEMA queries.

      To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make .ibd files self-contained and store a metadata record with the following information:

      • DB_TRX_ID,DB_ROLL_PTR to facilitate rollback
      • Secondary index IDs and root page numbers
      • Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

      Prerequisites and Features

      This proposal is based on the following assumptions:

      1. The data dictionary is stored in a crash-safe storage engine (such as ENGINE=Aria), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
        (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
      2. MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
      3. InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
        (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index (primary key) starts at page 3.)

      Definitions

      Static metadata

      Any metadata that can be changed by DDL operations only.

      Dynamic metadata:

      Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

      Data dictionary table definition

      /* Table definitions. */
      CREATE TABLE tables
      (
        table_schema VARCHAR(64) NOT NULL,
        table_name VARCHAR(64) NOT NULL,
        create_time DATETIME NOT NULL DEFAULT now,
        definition TEXT NOT NULL, /* without table or sequence name */
        frm BLOB NOT NULL,
        PRIMARY KEY(table_schema, table_name)
      );
       
      /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
      INSERT INTO tables (table_schema, table_name, definition, frm)
      VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
              <frm file contents>);
      

      Note 1: tables only stores persistent metadata about tables and sequences.
      CREATE TEMPORARY TABLE would not write anything there, but CREATE TABLE…ENGINE=MEMORY would, if the table definition (unlike the contents) is supposed to survive server restart.

      Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
      can identify them with the _ prefix in tables.definition. For
      example, if no explicit ROW_FORMAT is specified, InnoDB will store
      _ROW_FORMAT with the current innodb_default_row_format. (If we do not
      store the actual format options in tables.definition, we will
      lose an opportunity to perform a consistency check between .ibd files
      and tables when opening a table definition.)

      Note 1.2: In accordance with MDEV-16417, tables that are referencing the current table via FOREIGN KEY constraints would be identified in the .frm file and the tables.frm column. (The referencing tables’ REFERENCES clauses would be present in both tables.description and tables.frm.)

      Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in mysql.tables.definition and the .frm file.
      For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
      Likewise for InnoDB, INDEX DIRECTORY will be removed.
      For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
      unless ROW_FORMAT=COMPRESSED.
      For any storage engine, an invalid ROW_FORMAT would be replaced with
      the actually used value.

      Note 1.4: For partitioned tables, there will be a single entry in tables.

      Note 1.5: For additional safety and portability, whenever a change to
      a tables record is committed or when FLUSH TABLES is issued, we
      will update or create the affected or related
      datadir/schemaname/tablename.frm files. The files can continue
      to be used to move tables to other server instances.

      Note 1.6: On a table definition cache miss in handler::ha_open(), both
      tables and the file system will be consulted for .frm or .par
      files. If no record is found, a record will be inserted into
      tables based on the .frm file contents. This allows the user to
      replace tables by copying files. If the .frm file is found valid but different
      from tables.frm, then the old records for the table by this name
      will be deleted from tables and a new
      record will be inserted into mysql.tables.

      Note 1.7: It should be possible to import InnoDB tables or partitions
      across server instances using the same file-copying method, with no
      special IMPORT syntax. To speed up the operation, a server GUID can be
      embedded in the .ibd files to distinguish ‘foreign’ files that may
      need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
      delete-marked records).

      Note 1.8: If any table or columns attached to FOREIGN KEY…REFERENCES are renamed, the
      tables.definition and tables.frm of all affected tables must be updated within
      the same transaction.

      Bootstrap, Startup and Upgrading

      In the MySQL 8.0 Global Data Dictionary, there will likely be a
      special Data Dictionary Tablespace that will contain all data
      dictionary tables. There is one hard-coded data dictionary table:

      CREATE TABLE mysql.version SELECT 1 version_id;
      

      On startup, the server will read the contents of this table and refuse
      to start up unless the stored version_id is the current
      version. With a special start-up option, a future version of the
      server can upgrade from a supported earlier version. In the process,
      it would likely create a new data dictionary tablespace and copy and
      convert all metadata from the old-version data dictionary.

      In MariaDB, we would probably have to hard-code the definition of the tables table, and compare it to tables.frm if it exists. An entry for tables itself could be omitted from the tables table.

      Bootstrap (--initialize)

      MariaDB or InnoDB should ideally not apply any ‘create if not exists’
      logic on any files. A normal startup will be refused if some of the
      predefined files do not exist.

      On a special startup option --initialize, MariaDB will create a new
      instance. In this mode, the specified --datadir must be empty.

      We can initialize a new server instance in the following order:

      1. the redo log(s) (possibly including dummy files for preventing startup on older versions)
      2. (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
      3. (the tables table)
      4. fsync() all files

      A startup attempt on an older MariaDB would fail because the (possibly
      dummy) redo log or binlog files are in incompatible format.

      Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

      We must have enough code to support SHOW CREATE TABLE of all old tables.
      InnoDB may refuse startup for upgrade in the following cases:

      * the redo log is not clean
      * any tables exist in the InnoDB system tablespace
      * any tables contain secondary indexes not from page number 4 onwards
      * the change buffer is not empty
      * the undo logs are not empty

      In these cases, the user would have to address the problem in the older
      server version:

      SET GLOBAL innodb_file_per_table=1;
      OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
      SET GLOBAL innodb_fast_shutdown=0;
      /* shut down the server */
      

      On upgrade, once MDEV-11633 is completed, MariaDB would abandon
      (optionally destroy) the InnoDB system tablespace files and any undo
      tablespace files. We would no longer create such files.

      Upgrade would create and populate the tables tables based on the old data dictionary contents.

      If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

      • the upgrade will be refused due to a failed consistency or compatibility check
      • the upgrade completes
      • the server is killed and must be restarted with upgrade

      Notes on Consistency Checks

      Do we need to check if InnoDB data files use the same ROW_FORMAT as is stored in the .frm file or in the mysql.tables entry? Probably not; we should only require that the .ibd file is self-consistent.

      If we implement tables, how to ensure consistency with .frm files?

      1. If users add files? Silently copy from .frm to tables. Alternative: Require an extra IMPORT command to be executed, to facilitate auditing.
      2. If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until DROP TABLE has been executed.
      3. If users replace files? Detect when loading a table definition to the cache:
        Optionally, compare stat("name.frm") to create_time, LENGTH(frm).
        UPDATE tables SET frm=… from the file name.frm if the file looks valid.
        FLUSH TABLES will force the next handler::open() to check .frm

      Notes on FOREIGN KEY

      Discovering the FOREIGN KEY Relationships

      Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

      How to find the FOREIGN KEY children? REFERENCES clauses in SHOW CREATE TABLE point to parents only. As noted in MDEV-16417, we can extend the frm information to additionally store the names of referencing tables.

      Cleaning up the FOREIGN KEY implementation

      Certain operations must know the FOREIGN KEY child/parent graph.
      RENAME TABLE, DROP TABLE, ALTER TABLE, TRUNCATE TABLE must either be prohibited, or the renaming of columns or tables must be cascaded.

      INSERT, UPDATE, DELETE, REPLACE must be prohibited or cascaded based on the parent and child tables.

      SELECT, ROLLBACK, COMMIT can continue to ignore FOREIGN KEY constraints.

      MDEV-22361 would implement native, cross-engine FOREIGN KEY processing in the SQL layer. This would make ON…CASCADE and ON…SET NULL] fire TRIGGER, at last. This would also make FOREIGN KEY work on partitioned tables (MDEV-12483).

      Comparison to the Global Data Dictionary in MySQL 8.0

      MySQL 8.0 aims to replace .frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

      The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

      1. SQL statement (SHOW CREATE TABLE)
      2. Metadata decomposed into tables,columns,indexes,index_column_usage,…
      • Not fully decomposed: some SQL expressions are stored as text
      • Not fully decomposed: dd::Properties is a string of key-value pairs
      1. JSON-based Serialized Dictionary Information (SDI)
      2. SDI in .ibd files for import/export & disaster recovery (WL#7053, WL#7066)
      3. Additional InnoDB files for import/export; may be replaced by the above two

      Normalized and serialized storage

      The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See Faster INFORMATION_SCHEMA.

      The decomposition has the drawback of overhead. Using the terms popularized by Mark Callaghan, it is causing read amplification, storage amplification, and write amplification.

      Read amplification

      Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a .frm file.

      Storage amplification

      The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

      Write amplification

      For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the .SDI file and the WL#7053 embedded metadata inside the .ibd files.

      In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            Description InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of awkward binary file formats related to data dictionary and undo logging. There already is a well-defined serialized representation of metadata: the CREATE TABLE statement in SQL, or the output of SHOW CREATE TABLE. The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:SQL}
            CREATE TABLE mysql.engines
            (
              engine_id TINYINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
              name VARCHAR(64) NOT NULL UNIQUE
            );
            {code}
            The table mysql.engines would be initialized when the server is initialized or upgraded. Because it will be very seldomly modified, it could be created in any persistent storage engine. Refer to the section on initialization, startup and upgrade below.

            {code:SQL}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_id SERIAL,
              engine_id TINYINT UNSIGNED NOT NULL REFERENCES engines(engine_id),
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NULL,
              UNIQUE INDEX(table_schema, table_name)
            );

            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_table_id BIGINT UNSIGNED NOT NULL REFERENCES tables(table_id),
              child_table_id BIGINT UNSIGNED NOT NULL REFERENCES tables(table_id),
              PRIMARY KEY(parent_table_id, child_table_id)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:SQL}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED,
              engine_id TINYINT UNSIGNED REFERENCES engines(engine_id),
              recovery_data BLOB NOT NULL,
              PRIMARY KEY(sequence_number, engine_id)
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            * (type=DROP_INDEX, space_id, page_number, index_id)
            The attributes space_id, page_number, index_id, are private to InnoDB,
            and they would be stored in the blob in an InnoDB-defined format.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the mysql.engines table using a temporary name (say, mysql.engines0)
            # the mysql.tables table (note: not mysql.engines yet!)
            # the mysql.tables_children table
            # the mysql.ddl_log table
            # the MDEV-XXXXX mysql.transactions and mysql.transaction_logs tables
            # fsync() all files
            # finally, rename the file(s) for mysql.engines (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.engines is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is suposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably should not simply assume that the mysql.tables
            always has the same format in all subsequent versions of MariaDB. In
            the above proposal, mysql.tables would refer to an mysql.engines
            table. The proposal is to store a version identifier pseudo-record:
            {code:SQL}
            INSERT INTO engines VALUES (0,'MariaDB 10.4');
            {code}
            On startup, we would hard-code the table definition without actually
            creating a table, and then try to open the table. Any supported
            storage engine would be valid for storing mysql.engines. Startup
            would be refused if files from multiple storage engines are found,
            say, $datadir/mysql/engines.MYI and $datadir/mysql/engines.ibd.

            If the record for mysql.engines.engine_id identifies a supported
            signature, we will continue startup. Based on the signature, we will
            hard-code the definition of mysql.tables (or some other table that
            might replace it in a future version of MariaDB). And again, we would
            ensure that files for exactly one storage engine are found for this table.

            In summary, MariaDB 10.4 would only support metadata for the tag
            'MariaDB 10.4', and it would be hard-coded as follows:
            {code:SQL}
            /* Hard-coded like this in all MariaDB versions starting from 10.4 */
            CREATE TABLE engines
            (
              engine_id TINYINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
              name VARCHAR(64) NOT NULL UNIQUE
            );

            /* Hard-coded like this if 'MariaDB 10.4' is returned by
            SELECT name FROM mysql.engines WHERE engine_id=0 */
            CREATE TABLE tables
            (
              table_id SERIAL,
              engine_id TINYINT UNSIGNED NOT NULL REFERENCES engines(engine_id),
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL,
              dynamic_metadata TEXT NULL,
              UNIQUE INDEX(table_schema, table_name)
            );
            {code}
            The metadata for all other persistent tables would be stored in
            mysql.tables (or whatever set of tables might replace it in some
            future MariaDB version). In a later version, we are free to change the
            definition of mysql.tables or use a different set of tables instead of
            it, given that we introduce a new tag in mysql.engines.

            For the sake of completeness, we might consider storing the metadata
            for mysql.engines and mysql.tables in mysql.tables as well. But
            this kind of redundancy could lead to inconsistencies if some code
            actually expects to find the metadata in the tables, and the metadata
            has somehow been updated to be inconsistent with the hard-coded
            definitions.

            h3. After Global Data Dictionary (from MariaDB 10.4)

            For upgrade between Global Data Dictionary versions, there should be
            an explicit startup option for upgrading. For example, if MariaDB 10.5
            changes the format of mysql.tables, it would normally refuse startup
            if it encounters a 'MariaDB 10.4' tag in mysql.engines.

            Only if a special --upgrade option is specified and if the server
            version recognizes the tagged data dictionary version (for example,
            'MariaDB 10.4'), an upgrade would be performed. This upgrade would
            implement the functionality of the currently external upgrade
            scripts. In this way, we can guarantee that an upgrade is always
            performed correctly.

            If the version that we are upgrading to supports multi-table DDL
            transactions, the upgrade should be atomic and crash-safe out of the
            box (assuming that the undo log format does not change). Otherwise, we
            may need some special logic to implement the upgrade in an idempotent
            way (ALTER TABLE if not already altered).

            Once the data dictionary and the system tables have been upgraded to a
            newer version, there is no easy way to go back.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old
            tables. InnoDB would refuse startup for upgrade in the following
            cases:

             * the redo log is not clean (use the WL#8845 code from MySQL 8.0.0)
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:SQL}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be theoretically
            possible to downgrade to an older MariaDB, provided that there were no
            InnoDB DDL operations and there is a slow shutdown (the undo and redo
            logs are empty). But, perhaps we should not even try to allow this,
            because in subsequent upgrades, it could be more tricky to downgrade.
            Maybe we should just educate users that the new special option
            --upgrade is a ‘point of no return’. After invoking --upgrade, the
            only supported outcomes are the following:

            * the --upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with --upgrade
            InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of awkward binary file formats related to data dictionary and undo logging. There already is a well-defined serialized representation of metadata: the CREATE TABLE statement in SQL, or the output of SHOW CREATE TABLE. The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:SQL}
            CREATE TABLE mysql.engines
            (
              engine_id TINYINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
              name VARCHAR(64) NOT NULL UNIQUE
            );
            {code}
            The table mysql.engines would be initialized when the server is initialized or upgraded. Because it will be very seldomly modified, it could be created in any persistent storage engine. Refer to the section on initialization, startup and upgrade below.

            {code:SQL}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_id SERIAL,
              engine_id TINYINT UNSIGNED NOT NULL REFERENCES engines(engine_id),
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NULL,
              UNIQUE INDEX(table_schema, table_name)
            );

            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_table_id BIGINT UNSIGNED NOT NULL REFERENCES tables(table_id),
              child_table_id BIGINT UNSIGNED NOT NULL REFERENCES tables(table_id),
              PRIMARY KEY(parent_table_id, child_table_id)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:SQL}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED,
              engine_id TINYINT UNSIGNED REFERENCES engines(engine_id),
              recovery_data BLOB NOT NULL,
              PRIMARY KEY(sequence_number, engine_id)
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            * (type=DROP_INDEX, space_id, page_number, index_id)
            The attributes space_id, page_number, index_id, are private to InnoDB,
            and they would be stored in the blob in an InnoDB-defined format.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the mysql.engines table using a temporary name (say, mysql.engines0)
            # the mysql.tables table (note: not mysql.engines yet!)
            # the mysql.tables_children table
            # the mysql.ddl_log table
            # the MDEV-11657 mysql.transactions and mysql.transaction_logs tables
            # fsync() all files
            # finally, rename the file(s) for mysql.engines (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.engines is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is suposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably should not simply assume that the mysql.tables
            always has the same format in all subsequent versions of MariaDB. In
            the above proposal, mysql.tables would refer to an mysql.engines
            table. The proposal is to store a version identifier pseudo-record:
            {code:SQL}
            INSERT INTO engines VALUES (0,'MariaDB 10.4');
            {code}
            On startup, we would hard-code the table definition without actually
            creating a table, and then try to open the table. Any supported
            storage engine would be valid for storing mysql.engines. Startup
            would be refused if files from multiple storage engines are found,
            say, $datadir/mysql/engines.MYI and $datadir/mysql/engines.ibd.

            If the record for mysql.engines.engine_id identifies a supported
            signature, we will continue startup. Based on the signature, we will
            hard-code the definition of mysql.tables (or some other table that
            might replace it in a future version of MariaDB). And again, we would
            ensure that files for exactly one storage engine are found for this table.

            In summary, MariaDB 10.4 would only support metadata for the tag
            'MariaDB 10.4', and it would be hard-coded as follows:
            {code:SQL}
            /* Hard-coded like this in all MariaDB versions starting from 10.4 */
            CREATE TABLE engines
            (
              engine_id TINYINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
              name VARCHAR(64) NOT NULL UNIQUE
            );

            /* Hard-coded like this if 'MariaDB 10.4' is returned by
            SELECT name FROM mysql.engines WHERE engine_id=0 */
            CREATE TABLE tables
            (
              table_id SERIAL,
              engine_id TINYINT UNSIGNED NOT NULL REFERENCES engines(engine_id),
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL,
              dynamic_metadata TEXT NULL,
              UNIQUE INDEX(table_schema, table_name)
            );
            {code}
            The metadata for all other persistent tables would be stored in
            mysql.tables (or whatever set of tables might replace it in some
            future MariaDB version). In a later version, we are free to change the
            definition of mysql.tables or use a different set of tables instead of
            it, given that we introduce a new tag in mysql.engines.

            For the sake of completeness, we might consider storing the metadata
            for mysql.engines and mysql.tables in mysql.tables as well. But
            this kind of redundancy could lead to inconsistencies if some code
            actually expects to find the metadata in the tables, and the metadata
            has somehow been updated to be inconsistent with the hard-coded
            definitions.

            h3. After Global Data Dictionary (from MariaDB 10.4)

            For upgrade between Global Data Dictionary versions, there should be
            an explicit startup option for upgrading. For example, if MariaDB 10.5
            changes the format of mysql.tables, it would normally refuse startup
            if it encounters a 'MariaDB 10.4' tag in mysql.engines.

            Only if a special --upgrade option is specified and if the server
            version recognizes the tagged data dictionary version (for example,
            'MariaDB 10.4'), an upgrade would be performed. This upgrade would
            implement the functionality of the currently external upgrade
            scripts. In this way, we can guarantee that an upgrade is always
            performed correctly.

            If the version that we are upgrading to supports multi-table DDL
            transactions, the upgrade should be atomic and crash-safe out of the
            box (assuming that the undo log format does not change). Otherwise, we
            may need some special logic to implement the upgrade in an idempotent
            way (ALTER TABLE if not already altered).

            Once the data dictionary and the system tables have been upgraded to a
            newer version, there is no easy way to go back.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old
            tables. InnoDB would refuse startup for upgrade in the following
            cases:

             * the redo log is not clean (use the WL#8845 code from MySQL 8.0.0)
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:SQL}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be theoretically
            possible to downgrade to an older MariaDB, provided that there were no
            InnoDB DDL operations and there is a slow shutdown (the undo and redo
            logs are empty). But, perhaps we should not even try to allow this,
            because in subsequent upgrades, it could be more tricky to downgrade.
            Maybe we should just educate users that the new special option
            --upgrade is a ‘point of no return’. After invoking --upgrade, the
            only supported outcomes are the following:

            * the --upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with --upgrade
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            will.bryant Will Bryant added a comment -

            Storing the entire table definition in serialized format is an interesting idea.

            It would be good to think ahead to how it would support making DDL changes not only transactional but also deferred (as opposed to rewriting the table completely).

            For example, both PostgreSQL and Amazon's Aurora fork of MySQL/InnoDB support adding a nullable column at the end of the table without rewriting the entire table, which is a very useful special case because this is the most common DDL change for many large production environments and this saves a huge amount of IO and migration time.

            Comparing it to the route that MySQL and PostgreSQL went, because they have one row per table column it is possibly easier to implement extensions such as adding a table definition version number to each row (ie. the first version of the table structure that had this column added), which they could then compare to a number stored on each row or each page.

            How would you approach storing this metadata if a similar feature was added to MariaDB after 10.4? Would you introduce new attributes with _ prefixes?

            will.bryant Will Bryant added a comment - Storing the entire table definition in serialized format is an interesting idea. It would be good to think ahead to how it would support making DDL changes not only transactional but also deferred (as opposed to rewriting the table completely). For example, both PostgreSQL and Amazon's Aurora fork of MySQL/InnoDB support adding a nullable column at the end of the table without rewriting the entire table, which is a very useful special case because this is the most common DDL change for many large production environments and this saves a huge amount of IO and migration time. Comparing it to the route that MySQL and PostgreSQL went, because they have one row per table column it is possibly easier to implement extensions such as adding a table definition version number to each row (ie. the first version of the table structure that had this column added), which they could then compare to a number stored on each row or each page. How would you approach storing this metadata if a similar feature was added to MariaDB after 10.4? Would you introduce new attributes with _ prefixes?

            will.bryant, MDEV-11369 supports instant ADD COLUMN (even with BLOB NOT NULL DEFAULT 'some value' or TIMESTAMP NOT NULL DEFAULT NOW) in MariaDB 10.3.
            That did not involve any change of the data dictionary. The default values (as they were evaluated during ALTER TABLE) are stored in a special 'default row' record inside the clustered index.
            I think that also the follow-up task MDEV-11424 is technically doable without changing the format of the data dictionary, by storing the necessary metadata inside the data file itself. Yes, that would involve storing some kind of dictionary version inside the clustered index. I do not think that we should complicate the high-level dictionary with that.

            marko Marko Mäkelä added a comment - will.bryant , MDEV-11369 supports instant ADD COLUMN (even with BLOB NOT NULL DEFAULT 'some value' or TIMESTAMP NOT NULL DEFAULT NOW) in MariaDB 10.3. That did not involve any change of the data dictionary. The default values (as they were evaluated during ALTER TABLE) are stored in a special 'default row' record inside the clustered index. I think that also the follow-up task MDEV-11424 is technically doable without changing the format of the data dictionary, by storing the necessary metadata inside the data file itself. Yes, that would involve storing some kind of dictionary version inside the clustered index. I do not think that we should complicate the high-level dictionary with that.
            marko Marko Mäkelä made changes -
            Description InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of awkward binary file formats related to data dictionary and undo logging. There already is a well-defined serialized representation of metadata: the CREATE TABLE statement in SQL, or the output of SHOW CREATE TABLE. The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:SQL}
            CREATE TABLE mysql.engines
            (
              engine_id TINYINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
              name VARCHAR(64) NOT NULL UNIQUE
            );
            {code}
            The table mysql.engines would be initialized when the server is initialized or upgraded. Because it will be very seldomly modified, it could be created in any persistent storage engine. Refer to the section on initialization, startup and upgrade below.

            {code:SQL}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_id SERIAL,
              engine_id TINYINT UNSIGNED NOT NULL REFERENCES engines(engine_id),
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NULL,
              UNIQUE INDEX(table_schema, table_name)
            );

            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_table_id BIGINT UNSIGNED NOT NULL REFERENCES tables(table_id),
              child_table_id BIGINT UNSIGNED NOT NULL REFERENCES tables(table_id),
              PRIMARY KEY(parent_table_id, child_table_id)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:SQL}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED,
              engine_id TINYINT UNSIGNED REFERENCES engines(engine_id),
              recovery_data BLOB NOT NULL,
              PRIMARY KEY(sequence_number, engine_id)
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            * (type=DROP_INDEX, space_id, page_number, index_id)
            The attributes space_id, page_number, index_id, are private to InnoDB,
            and they would be stored in the blob in an InnoDB-defined format.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the mysql.engines table using a temporary name (say, mysql.engines0)
            # the mysql.tables table (note: not mysql.engines yet!)
            # the mysql.tables_children table
            # the mysql.ddl_log table
            # the MDEV-11657 mysql.transactions and mysql.transaction_logs tables
            # fsync() all files
            # finally, rename the file(s) for mysql.engines (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.engines is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is suposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably should not simply assume that the mysql.tables
            always has the same format in all subsequent versions of MariaDB. In
            the above proposal, mysql.tables would refer to an mysql.engines
            table. The proposal is to store a version identifier pseudo-record:
            {code:SQL}
            INSERT INTO engines VALUES (0,'MariaDB 10.4');
            {code}
            On startup, we would hard-code the table definition without actually
            creating a table, and then try to open the table. Any supported
            storage engine would be valid for storing mysql.engines. Startup
            would be refused if files from multiple storage engines are found,
            say, $datadir/mysql/engines.MYI and $datadir/mysql/engines.ibd.

            If the record for mysql.engines.engine_id identifies a supported
            signature, we will continue startup. Based on the signature, we will
            hard-code the definition of mysql.tables (or some other table that
            might replace it in a future version of MariaDB). And again, we would
            ensure that files for exactly one storage engine are found for this table.

            In summary, MariaDB 10.4 would only support metadata for the tag
            'MariaDB 10.4', and it would be hard-coded as follows:
            {code:SQL}
            /* Hard-coded like this in all MariaDB versions starting from 10.4 */
            CREATE TABLE engines
            (
              engine_id TINYINT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
              name VARCHAR(64) NOT NULL UNIQUE
            );

            /* Hard-coded like this if 'MariaDB 10.4' is returned by
            SELECT name FROM mysql.engines WHERE engine_id=0 */
            CREATE TABLE tables
            (
              table_id SERIAL,
              engine_id TINYINT UNSIGNED NOT NULL REFERENCES engines(engine_id),
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL,
              dynamic_metadata TEXT NULL,
              UNIQUE INDEX(table_schema, table_name)
            );
            {code}
            The metadata for all other persistent tables would be stored in
            mysql.tables (or whatever set of tables might replace it in some
            future MariaDB version). In a later version, we are free to change the
            definition of mysql.tables or use a different set of tables instead of
            it, given that we introduce a new tag in mysql.engines.

            For the sake of completeness, we might consider storing the metadata
            for mysql.engines and mysql.tables in mysql.tables as well. But
            this kind of redundancy could lead to inconsistencies if some code
            actually expects to find the metadata in the tables, and the metadata
            has somehow been updated to be inconsistent with the hard-coded
            definitions.

            h3. After Global Data Dictionary (from MariaDB 10.4)

            For upgrade between Global Data Dictionary versions, there should be
            an explicit startup option for upgrading. For example, if MariaDB 10.5
            changes the format of mysql.tables, it would normally refuse startup
            if it encounters a 'MariaDB 10.4' tag in mysql.engines.

            Only if a special --upgrade option is specified and if the server
            version recognizes the tagged data dictionary version (for example,
            'MariaDB 10.4'), an upgrade would be performed. This upgrade would
            implement the functionality of the currently external upgrade
            scripts. In this way, we can guarantee that an upgrade is always
            performed correctly.

            If the version that we are upgrading to supports multi-table DDL
            transactions, the upgrade should be atomic and crash-safe out of the
            box (assuming that the undo log format does not change). Otherwise, we
            may need some special logic to implement the upgrade in an idempotent
            way (ALTER TABLE if not already altered).

            Once the data dictionary and the system tables have been upgraded to a
            newer version, there is no easy way to go back.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old
            tables. InnoDB would refuse startup for upgrade in the following
            cases:

             * the redo log is not clean (use the WL#8845 code from MySQL 8.0.0)
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:SQL}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be theoretically
            possible to downgrade to an older MariaDB, provided that there were no
            InnoDB DDL operations and there is a slow shutdown (the undo and redo
            logs are empty). But, perhaps we should not even try to allow this,
            because in subsequent upgrades, it could be more tricky to downgrade.
            Maybe we should just educate users that the new special option
            --upgrade is a ‘point of no return’. After invoking --upgrade, the
            only supported outcomes are the following:

            * the --upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with --upgrade
            InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of some InnoDB binary file subformats related to data dictionary and undo logging.
            The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).
            The data dictionary can be kept in the *.frm files, which can be made crash-safe and transactional by adding some transactional logging.

            Optionally, we can create a ‘cache’ of the *.frm files in a persistent table.

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:sql}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            {code:sql}
            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_schema VARCHAR(64) NOT NULL,
              parent_name VARCHAR(64) NOT NULL,
              child_schema VARCHAR(64) NOT NULL,
              child_name VARCHAR(64) NOT NULL,
              PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
            );
            {code}

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:sql}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED PRIMARY KEY,
              recovery_data BLOB NOT NULL
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            The attribute space_id is private to InnoDB,
            and it would be appended to the blob in an InnoDB-defined format.

            Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
            Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary. Such TRUNCATE logging would also be useful for improving the performance of inserting into an empty table or partition (MDEV-515).
            Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The [WL#9536 design|https://dev.mysql.com/worklog/task/?id=9536] looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the MDEV-11657 mysql.transactions table using a temporary name (say, mysql.transactions0)
            # (the mysql.tables, tables_children tables)
            # the mysql.ddl_log table
            # fsync() all files
            # finally, rename the file(s) for mysql.transactions (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.transactions is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is supposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably can assume that the mysql.transactions always has the same format in all subsequent versions of MariaDB.
            If the mysql.transactions table is in InnoDB format, instant ADD COLUMN (MDEV-11369) can be used to extend the table.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade
            marko Marko Mäkelä made changes -
            Description InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of some InnoDB binary file subformats related to data dictionary and undo logging.
            The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).
            The data dictionary can be kept in the *.frm files, which can be made crash-safe and transactional by adding some transactional logging.

            Optionally, we can create a ‘cache’ of the *.frm files in a persistent table.

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:sql}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            {code:sql}
            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_schema VARCHAR(64) NOT NULL,
              parent_name VARCHAR(64) NOT NULL,
              child_schema VARCHAR(64) NOT NULL,
              child_name VARCHAR(64) NOT NULL,
              PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
            );
            {code}

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:sql}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED PRIMARY KEY,
              recovery_data BLOB NOT NULL
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            The attribute space_id is private to InnoDB,
            and it would be appended to the blob in an InnoDB-defined format.

            Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
            Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary. Such TRUNCATE logging would also be useful for improving the performance of inserting into an empty table or partition (MDEV-515).
            Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The [WL#9536 design|https://dev.mysql.com/worklog/task/?id=9536] looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the MDEV-11657 mysql.transactions table using a temporary name (say, mysql.transactions0)
            # (the mysql.tables, tables_children tables)
            # the mysql.ddl_log table
            # fsync() all files
            # finally, rename the file(s) for mysql.transactions (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.transactions is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is supposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably can assume that the mysql.transactions always has the same format in all subsequent versions of MariaDB.
            If the mysql.transactions table is in InnoDB format, instant ADD COLUMN (MDEV-11369) can be used to extend the table.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade
            InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of some InnoDB binary file subformats related to data dictionary and undo logging.
            The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).
            The data dictionary can be kept in the *.frm files, which can be made crash-safe and transactional by adding some transactional logging.

            Optionally, we can create a ‘cache’ of the *.frm files in a persistent table.

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:sql}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            {code:sql}
            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_schema VARCHAR(64) NOT NULL,
              parent_name VARCHAR(64) NOT NULL,
              child_schema VARCHAR(64) NOT NULL,
              child_name VARCHAR(64) NOT NULL,
              PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
            );
            {code}

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:sql}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED PRIMARY KEY,
              recovery_data BLOB NOT NULL
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            The attribute space_id is private to InnoDB,
            and it would be appended to the blob in an InnoDB-defined format.

            Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
            Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary. Such TRUNCATE logging would also be useful for improving the performance of inserting into an empty table or partition (MDEV-515).
            Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The [WL#9536 design|https://dev.mysql.com/worklog/task/?id=9536] looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

            h3. Example: Transactional RENAME TABLE
            RENAME TABLE differs from CREATE and DROP operations by modifying persistent metadata in the file system before the transaction is committed.
            CREATE operations would write 'DELETE_FILE' or similar fixup records that would be deleted at the DDL transaction commit. DROP operations would write 'DELETE_FILE' or similar records as part of the DDL transaction commit, and a post-commit fixup step would take care of actually deleting files.

            RENAME TABLE involves renaming persistent files before the transaction is committed. So, the fixup records must be of the form 'ROLLBACK_RENAME_FILE'. If any of the rename operations fail (for example due to insufficient file system permissions), then the transaction will be rolled back, and any renames that succeeded will be compensated by another rename operation. If the rollback-of-rename fails, then we are out of luck and should probably block access to the table.

            If no symbolic links are involved (there is no DATA DIRECTORY or INDEX DIRECTORY attribute), the file system operation can probably be a simple matter of
            {code:c}
            #ifdef _WIN32
            MoveFileEx(from, to, MOVEFILE_WRITE_THROUGH);
            #else
            rename(from, to);
            // TODO: persist the rename() by calling syncfs(fd), fsync(directory), or fsync(file)
            #endif
            {code}
            With symbolic links, there are more failure modes, and it is challenging to guarantee a reasonably reliable crash recovery, especially when multiple file systems are allowed. Perhaps an approach with hard links and a duplicated symbolic link would work:
            {code:sh}
            ln /some/where/a.ibd /some/where/b.ibd
            sync
            ln -s /some/where/b.ibd b.ibd
            sync
            rm /some/where/a.ibd a.ibd
            sync
            {code}
            On recovery, when seeing that we have to rename b back to a, we should necessarily find at least one of the symbolic links, and also at least one of the link targets, and could perform a normalization step that would leave us with exactly one symbolic link and one link target. If that normalization step leaves us with b.ibd, we would perform the above rename step in the opposite direction, to crash-safely rollback the rename back to a.ibd. Only after the file system is in a stable state where only the symbolic link and data file for a.ibd exist, we would delete the fixup record.

            A single RENAME TABLE statement can operate on multiple tables. For simplicity, the above example covers a single-table rename only.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the MDEV-11657 mysql.transactions table using a temporary name (say, mysql.transactions0)
            # (the mysql.tables, tables_children tables)
            # the mysql.ddl_log table
            # fsync() all files
            # finally, rename the file(s) for mysql.transactions (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.transactions is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is supposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably can assume that the mysql.transactions always has the same format in all subsequent versions of MariaDB.
            If the mysql.transactions table is in InnoDB format, instant ADD COLUMN (MDEV-11369) can be used to extend the table.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            manjot Manjot Singh (Inactive) made changes -
            marko Marko Mäkelä added a comment - - edited

            The mysql.ddl_log table should not be used for engine-specific recovery, but only for rolling back or forward changes to .frm files or equivalent, if needed.

            Inside InnoDB, DDL operations can be recovered by writing undo log records, as hinted in MDEV-14585:

            • In MariaDB 10.3 (MDEV-14717), RENAME TABLE is already crash-safe inside InnoDB.
            • For CREATE TABLE, we would need an undo log record that says "drop on rollback".
              During MDEV-13564/MDEV-17158 debugging, it turned out that after a crash during CREATE TABLE, recovery would not remove .ibd files. The new undo log record would fix this.
            • For insert into an empty table or partition (MDEV-515), we would write an undo log reord "truncate on rollback".
            • For DROP TABLE we would need an undo log record "drop on commit". We would need a new kind of "synchronous purge" operation.
            • Since MDEV-13564 TRUNCATE TABLE is internally mapped to RENAME TABLE, CREATE TABLE and DROP TABLE inside InnoDB.
            marko Marko Mäkelä added a comment - - edited The mysql.ddl_log table should not be used for engine-specific recovery, but only for rolling back or forward changes to .frm files or equivalent, if needed. Inside InnoDB, DDL operations can be recovered by writing undo log records, as hinted in MDEV-14585 : In MariaDB 10.3 ( MDEV-14717 ), RENAME TABLE is already crash-safe inside InnoDB. For CREATE TABLE , we would need an undo log record that says "drop on rollback". During MDEV-13564 / MDEV-17158 debugging, it turned out that after a crash during CREATE TABLE , recovery would not remove .ibd files. The new undo log record would fix this. For insert into an empty table or partition ( MDEV-515 ), we would write an undo log reord "truncate on rollback". For DROP TABLE we would need an undo log record "drop on commit". We would need a new kind of "synchronous purge" operation. Since MDEV-13564 TRUNCATE TABLE is internally mapped to RENAME TABLE , CREATE TABLE and DROP TABLE inside InnoDB.
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Description InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (*.frm files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace *.frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

            In MySQL 8.0.0, the *.frm files have been replaced, but the InnoDB data dictionary was retained. DDL is not yet crash-safe, because the changes to the Global Data Dictionary tables are committed separately from the changes to the InnoDB data dictionary tables.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: dd::Properties is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in *.ibd files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a *.frm file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the *.SDI file and the WL#7053 embedded metadata inside the *.ibd files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of some InnoDB binary file subformats related to data dictionary and undo logging.
            The *.frm files can be preserved for backward compatibility (tables could be copied by copying data and *.frm files).
            The data dictionary can be kept in the *.frm files, which can be made crash-safe and transactional by adding some transactional logging.

            Optionally, we can create a ‘cache’ of the *.frm files in a persistent table.

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # There will be a common redo log for and all crash-safe storage engines. Alternatively, mysql.tables will have to be created in InnoDB, and crash-safe DDL for other tables than InnoDB is only possible by using the XA 2PC mechanism via the binlog.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index is at page 3.)

            h2. Definitions

            Let us introduce a CREATE TABLE attribute ROLLBACK=NO for disabling undo logging on a persistent (crash-safe) table. The semantics is that any modification to the table data *may* be persisted between the time the modification is issued and the transaction is committed. Pending modifications may be explicitly persisted by the storage engine. Upon transaction commit, all modifications will have been persisted.

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:sql}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables VALUES (3, 2, 'test', 't1',
            'TABLE(a INT) _ROW_FORMAT=DYNAMIC', <frm file contents>);
            {code}

            Note 1: mysql.tables only stores persistent metadata about tables
            and sequences.
            CREATE TEMPORARY TABLE would not write anything there, but CREATE
            TABLE…ENGINE=MEMORY would, if the table definition (but not necessarily
            the contents) is supposed to survive server restart.

            Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between *.ibd files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will
            be removed or replaced both in tables.definition, tables.frm and in .frm files.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in
            mysql.tables. For backward compatibility and easy import/export,
            multiple *.frm files will continue to be created, one for each
            partition or subpartition.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            datadir/schemaname/tablename*.frm files. The *.frm files can continue
            to be used to move tables or partitions to other server instances.

            Note 1.6: On a table definition cache miss in handler::ha_open(), both
            mysql.tables and the file system will be consulted for .frm or .par
            files. If no record is found, a record will be inserted into
            mysql.tables based on the .frm file contents. This allows the user to
            replace tables by copying files. Because .frm or .par files contain no
            FOREIGN KEY metadata, nothing will be inserted into
            mysql.tables_children. If the .frm file is found valid but different
            from mysql.tables.frm, then the old records for the table by this name
            will be deleted from mysql.tables and mysql.tables_children and a new
            record will be inserted into mysql.tables.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the *.ibd files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            {code:sql}
            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_schema VARCHAR(64) NOT NULL,
              parent_name VARCHAR(64) NOT NULL,
              child_schema VARCHAR(64) NOT NULL,
              child_name VARCHAR(64) NOT NULL,
              PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
            );
            {code}

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:sql}
            /* Crash recovery information. (For InnoDB: Which tables or indexes to
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED PRIMARY KEY,
              recovery_data BLOB NOT NULL
            );
            {code}

            The mysql.ddl_log.recovery_data could be decomposed further
            into type,filename1,filename2,blob.
            In InnoDB, the required operations are:
            * (type=DELETE_FILE, space_id, filename)
            * (type=ROLLBACK_RENAME_FILE, space_id, old_name, new_name)
            The attribute space_id is private to InnoDB,
            and it would be appended to the blob in an InnoDB-defined format.

            Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
            Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary. Such TRUNCATE logging would also be useful for improving the performance of inserting into an empty table or partition (MDEV-515).
            Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The [WL#9536 design|https://dev.mysql.com/worklog/task/?id=9536] looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

            h3. Example: Transactional RENAME TABLE
            RENAME TABLE differs from CREATE and DROP operations by modifying persistent metadata in the file system before the transaction is committed.
            CREATE operations would write 'DELETE_FILE' or similar fixup records that would be deleted at the DDL transaction commit. DROP operations would write 'DELETE_FILE' or similar records as part of the DDL transaction commit, and a post-commit fixup step would take care of actually deleting files.

            RENAME TABLE involves renaming persistent files before the transaction is committed. So, the fixup records must be of the form 'ROLLBACK_RENAME_FILE'. If any of the rename operations fail (for example due to insufficient file system permissions), then the transaction will be rolled back, and any renames that succeeded will be compensated by another rename operation. If the rollback-of-rename fails, then we are out of luck and should probably block access to the table.

            If no symbolic links are involved (there is no DATA DIRECTORY or INDEX DIRECTORY attribute), the file system operation can probably be a simple matter of
            {code:c}
            #ifdef _WIN32
            MoveFileEx(from, to, MOVEFILE_WRITE_THROUGH);
            #else
            rename(from, to);
            // TODO: persist the rename() by calling syncfs(fd), fsync(directory), or fsync(file)
            #endif
            {code}
            With symbolic links, there are more failure modes, and it is challenging to guarantee a reasonably reliable crash recovery, especially when multiple file systems are allowed. Perhaps an approach with hard links and a duplicated symbolic link would work:
            {code:sh}
            ln /some/where/a.ibd /some/where/b.ibd
            sync
            ln -s /some/where/b.ibd b.ibd
            sync
            rm /some/where/a.ibd a.ibd
            sync
            {code}
            On recovery, when seeing that we have to rename b back to a, we should necessarily find at least one of the symbolic links, and also at least one of the link targets, and could perform a normalization step that would leave us with exactly one symbolic link and one link target. If that normalization step leaves us with b.ibd, we would perform the above rename step in the opposite direction, to crash-safely rollback the rename back to a.ibd. Only after the file system is in a stable state where only the symbolic link and data file for a.ibd exist, we would delete the fixup record.

            A single RENAME TABLE statement can operate on multiple tables. For simplicity, the above example covers a single-table rename only.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB will no longer apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # the MDEV-11657 mysql.transactions table using a temporary name (say, mysql.transactions0)
            # (the mysql.tables, tables_children tables)
            # the mysql.ddl_log table
            # fsync() all files
            # finally, rename the file(s) for mysql.transactions (no undo or redo logging!)

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            If the server is killed before the renaming of the data file for
            mysql.transactions is persisted, a subsequent server startup without
            --initialize would be refused, because the data file does not exist.
            A subsequent --initialize could detect this situation (by ensuring
            that the --datadir does not contain any other files than those that
            --initialize is supposed to create) and restart after deleting all
            files.

            h3. Normal startup in MariaDB 10.4

            MariaDB probably can assume that the mysql.transactions always has the same format in all subsequent versions of MariaDB.
            If the mysql.transactions table is in InnoDB format, instant ADD COLUMN (MDEV-11369) can be used to extend the table.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support SHOW CREATE TABLE of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the mysql.engines and
            mysql.tables tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade
            InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries ({{.frm}} files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of some InnoDB binary file subformats related to data dictionary.
            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash-safe and transactional by adding some transactional logging.

            Optionally, we can create a ‘cache’ of the {{.frm}} files in one or two persistent tables: {{mysql.tables}} and {{mysql.tables_children}} (for {{FOREIGN KEY}} relationships). Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:sql}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{mysql.tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in mysql.tables.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{mysql.tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{mysql.tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. Because {{.frm}} or {{.par}} files contain no
            {{FOREIGN KEY}} metadata, nothing will be inserted into
            {{mysql.tables_children}}. If the {{.frm}} file is found valid but different
            from {{mysql.tables.frm}}, then the old records for the table by this name
            will be deleted from {{mysql.tables}} and {{mysql.tables_children}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            {code:sql}
            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_schema VARCHAR(64) NOT NULL,
              parent_name VARCHAR(64) NOT NULL,
              child_schema VARCHAR(64) NOT NULL,
              child_name VARCHAR(64) NOT NULL,
              PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
            );
            {code}

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:sql}
            /* Crash recovery information. (Which .frm files to remove
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED PRIMARY KEY,
              type ENUM('DELETE_FILE','ROLLBACK_RENAME_FILE') NOT NULL,
              filename VARCHAR(512) NOT NULL,
              old_name VARCHAR(512)
            );
            {code}

            Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
            Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary.
            Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The [WL#9536 design|https://dev.mysql.com/worklog/task/?id=9536] looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

            h3. Example: Transactional RENAME TABLE
            RENAME TABLE differs from CREATE and DROP operations by modifying persistent metadata in the file system before the transaction is committed.
            CREATE operations would write 'DELETE_FILE' or similar fixup records that would be deleted at the DDL transaction commit. DROP operations would write 'DELETE_FILE' or similar records as part of the DDL transaction commit, and a post-commit fixup step would take care of actually deleting files.

            RENAME TABLE involves renaming persistent files before the transaction is committed. So, the fixup records must be of the form 'ROLLBACK_RENAME_FILE'. If any of the rename operations fail (for example due to insufficient file system permissions), then the transaction will be rolled back, and any renames that succeeded will be compensated by another rename operation. If the rollback-of-rename fails, then we are out of luck and should probably block access to the table.

            If no symbolic links are involved (there is no DATA DIRECTORY or INDEX DIRECTORY attribute), the file system operation can probably be a simple matter of
            {code:c}
            #ifdef _WIN32
            MoveFileEx(from, to, MOVEFILE_WRITE_THROUGH);
            #else
            rename(from, to);
            // TODO: persist the rename() by calling syncfs(fd), fsync(directory), or fsync(file)
            #endif
            {code}
            With symbolic links, there are more failure modes, and it is challenging to guarantee a reasonably reliable crash recovery, especially when multiple file systems are allowed. Perhaps an approach with hard links and a duplicated symbolic link would work:
            {code:sh}
            ln /some/where/a.ibd /some/where/b.ibd
            sync
            ln -s /some/where/b.ibd b.ibd
            sync
            rm /some/where/a.ibd a.ibd
            sync
            {code}
            On recovery, when seeing that we have to rename {{b}} back to {{a}}, we should necessarily find at least one of the symbolic links, and also at least one of the link targets, and could perform a normalization step that would leave us with exactly one symbolic link and one link target. If that normalization step leaves us with {{b.ibd}}, we would perform the above rename step in the opposite direction, to crash-safely rollback the rename back to {{a.ibd}}. Only after the file system is in a stable state where only the symbolic link and data file for {{a.ibd}} exist, we would delete the fixup record.

            We could also write the operations on the symbolic links to the InnoDB redo log. With MDEV-14717, {{RENAME TABLE}} (without {{DATA DIRECTORY}}) is crash-safe in InnoDB.

            A single {{RENAME TABLE}} statement can operate on multiple tables. For simplicity, the above example covers a single-table rename only.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # (the mysql.tables, tables_children tables)
            # the mysql.ddl_log table
            # fsync() all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{mysql.tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks
            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self-consistent.

            If we implement {{mysql.tables}} and {{mysql.tables_children}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{mysql.tables}}.
            # If users delete files? Delete orphan records and report “table not found”.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            Update {{mysql.tables.frm}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only.

            What if we included the {{REFERENCES}} clauses in {{.frm}} files?
            Read all {{.frm}} at startup (slow!) and remember any {{FOREIGN KEY}}?

            We could {{CREATE TABLE mysql.tables_children}} and optionally {{name.frk}} files (all such files would be read on startup).

            h2. Cleaning up the {{FOREIGN KEY}} implementation
            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            Currently, metadata locks are not sufficiently taken for some operations. MySQL 8.0 finally implemented [WL#6049|https://dev.mysql.com/worklog/task/?id=6049], but MariaDB is missing it.

            Could we have native, cross-engine {{FOREIGN KEY}} processing in the SQL layer ([WL#148|https://dev.mysql.com/worklog/task/?id=148])? This would make {{ON…CASCADE}} and {{ON…SET NULL}}] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV-12483).
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            0001-WIP-Remove-dict_field_t-name-25ad38abe5b3fb6bb2aafff315de256b8f1e7839.patch is a work-in-progress patch for removing dict_field_t::name. Removing that redundant field would be simpler after removing the InnoDB data dictionary tables.

            marko Marko Mäkelä added a comment - 0001-WIP-Remove-dict_field_t-name-25ad38abe5b3fb6bb2aafff315de256b8f1e7839.patch is a work-in-progress patch for removing dict_field_t::name . Removing that redundant field would be simpler after removing the InnoDB data dictionary tables.
            monty Michael Widenius made changes -
            monty Michael Widenius made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            GeoffMontee Geoff Montee (Inactive) made changes -
            midenok Aleksey Midenkov made changes -
            midenok Aleksey Midenkov made changes -
            marko Marko Mäkelä made changes -
            Parent MDEV-11633 [ 59463 ]
            Issue Type Technical task [ 7 ] Task [ 3 ]
            GeoffMontee Geoff Montee (Inactive) made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            In MariaDB Server 10.6, all DDL operations inside InnoDB are expected to be atomic. This was improved in MDEV-18518, MDEV-24589, MDEV-25506 without any file format changes whatsoever!

            As far as InnoDB is concerned, we do not really need any ddl_log table for recovery. The undo log is sufficient. In MDEV-14717, we introduced the record TRX_UNDO_RENAME_TABLE. In MDEV-24589 and MDEV-18518, we refined the ‘trigger’ that exists between operations on SYS_INDEXES records and index trees. When that InnoDB internal table is removed as part of this task, we will have to replace the logic that invokes dict_drop_index_tree() on ROLLBACK of CREATE, or on purge after DROP has been committed. It could be simplest to introduce two new undo log records, like TRX_UNDO_CREATE_INDEX (triggered on rollback) and TRX_UNDO_DROP_INDEX (triggered on purge).

            For DDL recovery outside the InnoDB storage engine, MDEV-17567 will introduce a log mechanism in the SQL layer. That mechanism will ensure that no orphan .frm files will be left behind or out of sync with the storage engine.

            The .frm files provide a serialized description of the data dictionary. If we did not care about the upgrade process, it could be nicer to have a text-based format, something like a tablename.frm that would contain the CREATE TABLE string in plain text and a list of child table names that contain any FOREIGN KEY constraint that REFERENCES this table.

            To speed up server startup (say, when a user executes SHOW TABLES and there are millions of tables), the contents of the .frm files can be cached in a table. Something SQLite-style could be sufficient:

            CREATE TABLE tables(name VARCHAR(64) PRIMARY KEY, definition TEXT NOT NULL, referenced_by_names TEXT);
            

            For this to work, the .ibd and undo log file format in InnoDB would have to be changed as follows:

            • somewhere in the undo pages, maintain a mapping from numeric table identifiers (in InnoDB undo log records) to 32-bit tablespace ID (identifying .ibd files)
            • extend the .ibd files with explicit information on the secondary index root page numbers, and cover updates of it with the TRX_UNDO_CREATE_INDEX and TRX_UNDO_DROP_INDEX records
            marko Marko Mäkelä added a comment - In MariaDB Server 10.6, all DDL operations inside InnoDB are expected to be atomic. This was improved in MDEV-18518 , MDEV-24589 , MDEV-25506 without any file format changes whatsoever! As far as InnoDB is concerned, we do not really need any ddl_log table for recovery. The undo log is sufficient. In MDEV-14717 , we introduced the record TRX_UNDO_RENAME_TABLE . In MDEV-24589 and MDEV-18518 , we refined the ‘trigger’ that exists between operations on SYS_INDEXES records and index trees. When that InnoDB internal table is removed as part of this task, we will have to replace the logic that invokes dict_drop_index_tree() on ROLLBACK of CREATE , or on purge after DROP has been committed. It could be simplest to introduce two new undo log records, like TRX_UNDO_CREATE_INDEX (triggered on rollback) and TRX_UNDO_DROP_INDEX (triggered on purge). For DDL recovery outside the InnoDB storage engine, MDEV-17567 will introduce a log mechanism in the SQL layer. That mechanism will ensure that no orphan .frm files will be left behind or out of sync with the storage engine. The .frm files provide a serialized description of the data dictionary. If we did not care about the upgrade process, it could be nicer to have a text-based format, something like a tablename.frm that would contain the CREATE TABLE string in plain text and a list of child table names that contain any FOREIGN KEY constraint that REFERENCES this table. To speed up server startup (say, when a user executes SHOW TABLES and there are millions of tables), the contents of the .frm files can be cached in a table. Something SQLite-style could be sufficient: CREATE TABLE tables( name VARCHAR (64) PRIMARY KEY , definition TEXT NOT NULL , referenced_by_names TEXT); For this to work, the .ibd and undo log file format in InnoDB would have to be changed as follows: somewhere in the undo pages, maintain a mapping from numeric table identifiers (in InnoDB undo log records) to 32-bit tablespace ID (identifying .ibd files) extend the .ibd files with explicit information on the secondary index root page numbers, and cover updates of it with the TRX_UNDO_CREATE_INDEX and TRX_UNDO_DROP_INDEX records
            marko Marko Mäkelä made changes -
            Description InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries ({{.frm}} files and some data dictionary tables in the InnoDB system tablespace).

            h1. Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

            h1. Proposal for Global Data Dictionary in MariaDB 10.4
            MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

            MariaDB could get rid of some InnoDB binary file subformats related to data dictionary.
            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash-safe and transactional by adding some transactional logging.

            Optionally, we can create a ‘cache’ of the {{.frm}} files in one or two persistent tables: {{mysql.tables}} and {{mysql.tables_children}} (for {{FOREIGN KEY}} relationships). Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (Between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

            h3. Data dictionary table definitions

            {code:sql}
            /* Table definitions. */
            CREATE TABLE mysql.tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{mysql.tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the _ prefix in mysql.tables.definition. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in mysql.tables.definition, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and mysql.tables when opening a table definition.)

            Note 1.2: mysql.tables.description should omit any dynamic
            metadata attributes that are stored in mysql.tables.dynamic_metadata,
            such as AUTO_INCREMENT or UPDATE_TIME.

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in mysql.tables.

            Note 1.5: For additional safety and portability, whenever a change to
            a mysql.tables record is committed or when FLUSH TABLES is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{mysql.tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{mysql.tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. Because {{.frm}} or {{.par}} files contain no
            {{FOREIGN KEY}} metadata, nothing will be inserted into
            {{mysql.tables_children}}. If the {{.frm}} file is found valid but different
            from {{mysql.tables.frm}}, then the old records for the table by this name
            will be deleted from {{mysql.tables}} and {{mysql.tables_children}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            {code:sql}
            /* Tables with REFERENCES clauses pointing to a parent table. */
            CREATE TABLE mysql.tables_children
            (
              parent_schema VARCHAR(64) NOT NULL,
              parent_name VARCHAR(64) NOT NULL,
              child_schema VARCHAR(64) NOT NULL,
              child_name VARCHAR(64) NOT NULL,
              PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
            );
            {code}

            Note 2: mysql.tables_children allows the lookup of child tables whose
            FOREIGN KEY constraints refer to a given parent table. For the child
            tables, table.definition will contain the REFERENCES clauses referring
            to parent tables.

            Note 2.1: When both the parent table and the child table have been
            created, an entry in tables_children must exist. This table allows
            MariaDB to load child table definitions when a parent table is being
            loaded, for proper meta-data locking (MySQL WL#6049).

            Note 2.2: If any parent table or parent table columns are renamed, the
            table.definition of the affected child tables must be updated within
            the same transaction.

            {code:sql}
            /* Crash recovery information. (Which .frm files to remove
            drop on commit or rollback, or which files to rename back on rollback.) */
            CREATE TABLE mysql.ddl_log
            (
              sequence_number INT UNSIGNED PRIMARY KEY,
              type ENUM('DELETE_FILE','ROLLBACK_RENAME_FILE') NOT NULL,
              filename VARCHAR(512) NOT NULL,
              old_name VARCHAR(512)
            );
            {code}

            Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
            Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary.
            Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The [WL#9536 design|https://dev.mysql.com/worklog/task/?id=9536] looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

            h3. Example: Transactional RENAME TABLE
            RENAME TABLE differs from CREATE and DROP operations by modifying persistent metadata in the file system before the transaction is committed.
            CREATE operations would write 'DELETE_FILE' or similar fixup records that would be deleted at the DDL transaction commit. DROP operations would write 'DELETE_FILE' or similar records as part of the DDL transaction commit, and a post-commit fixup step would take care of actually deleting files.

            RENAME TABLE involves renaming persistent files before the transaction is committed. So, the fixup records must be of the form 'ROLLBACK_RENAME_FILE'. If any of the rename operations fail (for example due to insufficient file system permissions), then the transaction will be rolled back, and any renames that succeeded will be compensated by another rename operation. If the rollback-of-rename fails, then we are out of luck and should probably block access to the table.

            If no symbolic links are involved (there is no DATA DIRECTORY or INDEX DIRECTORY attribute), the file system operation can probably be a simple matter of
            {code:c}
            #ifdef _WIN32
            MoveFileEx(from, to, MOVEFILE_WRITE_THROUGH);
            #else
            rename(from, to);
            // TODO: persist the rename() by calling syncfs(fd), fsync(directory), or fsync(file)
            #endif
            {code}
            With symbolic links, there are more failure modes, and it is challenging to guarantee a reasonably reliable crash recovery, especially when multiple file systems are allowed. Perhaps an approach with hard links and a duplicated symbolic link would work:
            {code:sh}
            ln /some/where/a.ibd /some/where/b.ibd
            sync
            ln -s /some/where/b.ibd b.ibd
            sync
            rm /some/where/a.ibd a.ibd
            sync
            {code}
            On recovery, when seeing that we have to rename {{b}} back to {{a}}, we should necessarily find at least one of the symbolic links, and also at least one of the link targets, and could perform a normalization step that would leave us with exactly one symbolic link and one link target. If that normalization step leaves us with {{b.ibd}}, we would perform the above rename step in the opposite direction, to crash-safely rollback the rename back to {{a.ibd}}. Only after the file system is in a stable state where only the symbolic link and data file for {{a.ibd}} exist, we would delete the fixup record.

            We could also write the operations on the symbolic links to the InnoDB redo log. With MDEV-14717, {{RENAME TABLE}} (without {{DATA DIRECTORY}}) is crash-safe in InnoDB.

            A single {{RENAME TABLE}} statement can operate on multiple tables. For simplicity, the above example covers a single-table rename only.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB 10.4, there would be no shared tablespaces, but instead all
            tables in the mysql schema would be isolated files managed by some
            storage engine. But, we do need to hard-code some metadata.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files (mysql.* tables) does not exist.

            On a special startup option --initialize, MariaDB will create a new
            instance. In this mode, the specified --datadir must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # (the mysql.tables, tables_children tables)
            # the mysql.ddl_log table
            # fsync() all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{mysql.tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks
            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self-consistent.

            If we implement {{mysql.tables}} and {{mysql.tables_children}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{mysql.tables}}.
            # If users delete files? Delete orphan records and report “table not found”.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            Update {{mysql.tables.frm}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only.

            What if we included the {{REFERENCES}} clauses in {{.frm}} files?
            Read all {{.frm}} at startup (slow!) and remember any {{FOREIGN KEY}}?

            We could {{CREATE TABLE mysql.tables_children}} and optionally {{name.frk}} files (all such files would be read on startup).

            h2. Cleaning up the {{FOREIGN KEY}} implementation
            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            Currently, metadata locks are not sufficiently taken for some operations. MySQL 8.0 finally implemented [WL#6049|https://dev.mysql.com/worklog/task/?id=6049], but MariaDB is missing it.

            Could we have native, cross-engine {{FOREIGN KEY}} processing in the SQL layer ([WL#148|https://dev.mysql.com/worklog/task/?id=148])? This would make {{ON…CASCADE}} and {{ON…SET NULL}}] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV-12483).
            With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB
            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:
            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the _ prefix in {{tables.definition}}. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB, we would probably have to hard-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks
            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation
            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 78890 ] MariaDB v4 [ 130574 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            danblack Daniel Black made changes -
            AirFocus AirFocus made changes -
            Description With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB
            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:
            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata
            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:
            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the _ prefix in {{tables.definition}}. For
            example, if no explicit ROW_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
            delete-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard-coded data dictionary table:
            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}
            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version_id is the current
            version. With a special start-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old-version data dictionary.

            In MariaDB, we would probably have to hard-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (--initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             * the redo log is not clean
             * any tables exist in the InnoDB system tablespace
             * any tables contain secondary indexes not from page number 4 onwards
             * the change buffer is not empty
             * the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:
            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks
            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation
            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0
            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…
            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key-value pairs
            # JSON-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage
            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification
            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification
            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

            h3. Write amplification
            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB

            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash\-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:

            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV\-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self\-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata

            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:

            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV\-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the \_ prefix in {{tables.definition}}. For
            example, if no explicit ROW\_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row\_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV\-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW\_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW\_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file\-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX\_ID, LSN and removing
            delete\-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard\-coded data dictionary table:

            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}

            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version\_id is the current
            version. With a special start\-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old\-version data dictionary.

            In MariaDB, we would probably have to hard\-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (\-\-initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV\-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             \* the redo log is not clean
             \* any tables exist in the InnoDB system tablespace
             \* any tables contain secondary indexes not from page number 4 onwards
             \* the change buffer is not empty
             \* the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:

            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV\-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks

            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self\-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)\*(child\|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV\-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation

            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}\] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV\-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0

            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…

            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key\-value pairs

            # JSON\-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage

            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION\_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION\_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification

            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification

            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS\_TABLES.ID.)

            h3. Write amplification

            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            danblack Daniel Black made changes -
            midenok Aleksey Midenkov made changes -

            marko I would not mix InnoDB and SQL tasks. Moving out .frm into transactional table is big enough task on its own.

            midenok Aleksey Midenkov added a comment - marko I would not mix InnoDB and SQL tasks. Moving out .frm into transactional table is big enough task on its own.
            midenok Aleksey Midenkov made changes -
            Description With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB

            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash\-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:

            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV\-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self\-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata

            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:

            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV\-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the \_ prefix in {{tables.definition}}. For
            example, if no explicit ROW\_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row\_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV\-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW\_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW\_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file\-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX\_ID, LSN and removing
            delete\-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard\-coded data dictionary table:

            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}

            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version\_id is the current
            version. With a special start\-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old\-version data dictionary.

            In MariaDB, we would probably have to hard\-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (\-\-initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV\-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             \* the redo log is not clean
             \* any tables exist in the InnoDB system tablespace
             \* any tables contain secondary indexes not from page number 4 onwards
             \* the change buffer is not empty
             \* the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:

            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV\-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks

            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self\-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)\*(child\|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV\-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation

            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}\] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV\-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0

            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…

            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key\-value pairs

            # JSON\-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage

            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION\_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION\_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification

            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification

            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS\_TABLES.ID.)

            h3. Write amplification

            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}, MDEV-25292). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB

            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash\-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:

            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV\-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self\-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata

            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:

            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV\-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the \_ prefix in {{tables.definition}}. For
            example, if no explicit ROW\_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row\_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV\-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW\_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW\_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file\-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX\_ID, LSN and removing
            delete\-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard\-coded data dictionary table:

            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}

            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version\_id is the current
            version. With a special start\-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old\-version data dictionary.

            In MariaDB, we would probably have to hard\-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (\-\-initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV\-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             \* the redo log is not clean
             \* any tables exist in the InnoDB system tablespace
             \* any tables contain secondary indexes not from page number 4 onwards
             \* the change buffer is not empty
             \* the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:

            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV\-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks

            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self\-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)\*(child\|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV\-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation

            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}\] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV\-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0

            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…

            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key\-value pairs

            # JSON\-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage

            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION\_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION\_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification

            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification

            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS\_TABLES.ID.)

            h3. Write amplification

            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            midenok Aleksey Midenkov made changes -
            Description With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}, MDEV-25292). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB

            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash\-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:

            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV\-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self\-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata

            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:

            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV\-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the \_ prefix in {{tables.definition}}. For
            example, if no explicit ROW\_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row\_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV\-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW\_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW\_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file\-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX\_ID, LSN and removing
            delete\-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard\-coded data dictionary table:

            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}

            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version\_id is the current
            version. With a special start\-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old\-version data dictionary.

            In MariaDB, we would probably have to hard\-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (\-\-initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV\-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             \* the redo log is not clean
             \* any tables exist in the InnoDB system tablespace
             \* any tables contain secondary indexes not from page number 4 onwards
             \* the change buffer is not empty
             \* the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:

            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV\-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks

            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self\-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)\*(child\|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV\-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation

            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}\] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV\-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0

            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…

            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key\-value pairs

            # JSON\-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage

            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION\_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION\_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification

            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification

            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS\_TABLES.ID.)

            h3. Write amplification

            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            With MDEV-17567 Atomic DDL in MariaDB Server 10.6, DDL operations are crash-safe and mostly atomic (except MDEV-25920 Atomic {{DROP DATABASE}}, MDEV-25292 Atomic CREATE OR REPLACE, MDEV-27180 Atomic partitioning). They are not transactional yet, e.g., {{CREATE TABLE}} or {{ALTER TABLE}} may unnecessarily use multiple transactions (MDEV-25921).

            h1. Proposal for Global Data Dictionary in MariaDB

            The {{.frm}} files can be preserved for backward compatibility (tables could be copied by copying data and {{.frm}} files).
            The data dictionary can be kept in the {{.frm}} files, which can be made crash\-safe and transactional by adding some transactional logging.

            We can create a ‘cache’ of the {{.frm}} files in a persistent {{tables}} table. Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing {{INFORMATION_SCHEMA}} queries.

            To allow the InnoDB internal data dictionary tables to be removed (MDEV-11633), we will have to make {{.ibd}} files self-contained and store a metadata record with the following information:

            * {{DB_TRX_ID,DB_ROLL_PTR}} to facilitate rollback
            * Secondary index IDs and root page numbers
            * Table identifier to map undo log records to tablespaces; alternative: make undo logs use tablespace ID instead of table ID

            h2. Prerequisites and Features

            This proposal is based on the following assumptions:

            # The data dictionary is stored in a crash-safe storage engine (such as {{ENGINE=Aria}}), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV\-11657.
            (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
            # MariaDB will not support {{CREATE TABLESPACE}} for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
            # InnoDB data files will be made self\-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
            (In MariaDB, or between MySQL 4.1 and MySQL 5.7, we can only know that in {{.ibd}} files, the clustered index (primary key) starts at page 3.)

            h2. Definitions

            h3. Static metadata

            Any metadata that can be changed by DDL operations only.

            h3. Dynamic metadata:

            Any other metadata. Examples of dynamic metadata attributes include {{AUTO_INCREMENT}} (MDEV\-6076) and {{UPDATE_TIME}}. Any dynamic metadata will be persisted within the storage engine; not written back to the data dictionary core. The latest dynamic metadata can only be accessed by invoking the storage engine.

            h3. Data dictionary table definition

            {code:sql}
            /* Table definitions. */
            CREATE TABLE tables
            (
              table_schema VARCHAR(64) NOT NULL,
              table_name VARCHAR(64) NOT NULL,
              create_time DATETIME NOT NULL DEFAULT now,
              definition TEXT NOT NULL, /* without table or sequence name */
              frm BLOB NOT NULL,
              PRIMARY KEY(table_schema, table_name)
            );

            /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
            INSERT INTO tables (table_schema, table_name, definition, frm)
            VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
                    <frm file contents>);
            {code}

            Note 1: {{tables}} only stores persistent metadata about tables and sequences.
            {{CREATE TEMPORARY TABLE}} would not write anything there, but {{CREATE TABLE…ENGINE=MEMORY}} would, if the table definition (unlike the contents) is supposed to survive server restart.

            Note 1.1: If we wish {{SHOW CREATE TABLE}} to omit implied attributes, we
            can identify them with the \_ prefix in {{tables.definition}}. For
            example, if no explicit ROW\_FORMAT is specified, InnoDB will store
            _ROW_FORMAT with the current innodb_default_row\_format. (If we do not
            store the actual format options in {{tables.definition}}, we will
            lose an opportunity to perform a consistency check between {{.ibd}} files
            and {{tables}} when opening a table definition.)

            Note 1.2: In accordance with MDEV\-16417, tables that are referencing the current table via {{FOREIGN KEY}} constraints would be identified in the {{.frm}} file and the {{tables.frm}} column. (The referencing tables’ {{REFERENCES}} clauses would be present in both {{tables.description}} and {{tables.frm}}.)

            Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in {{mysql.tables.definition}} and the {{.frm}} file.
            For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
            Likewise for InnoDB, INDEX DIRECTORY will be removed.
            For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
            unless ROW\_FORMAT=COMPRESSED.
            For any storage engine, an invalid ROW\_FORMAT would be replaced with
            the actually used value.

            Note 1.4: For partitioned tables, there will be a single entry in {{tables}}.

            Note 1.5: For additional safety and portability, whenever a change to
            a {{tables}} record is committed or when {{FLUSH TABLES}} is issued, we
            will update or create the affected or related
            {{datadir/schemaname/tablename.frm}} files. The files can continue
            to be used to move tables to other server instances.

            Note 1.6: On a table definition cache miss in {{handler::ha_open()}}, both
            {{tables}} and the file system will be consulted for {{.frm}} or {{.par}}
            files. If no record is found, a record will be inserted into
            {{tables}} based on the {{.frm}} file contents. This allows the user to
            replace tables by copying files. If the {{.frm}} file is found valid but different
            from {{tables.frm}}, then the old records for the table by this name
            will be deleted from {{tables}} and a new
            record will be inserted into {{mysql.tables}}.

            Note 1.7: It should be possible to import InnoDB tables or partitions
            across server instances using the same file\-copying method, with no
            special IMPORT syntax. To speed up the operation, a server GUID can be
            embedded in the {{.ibd}} files to distinguish ‘foreign’ files that may
            need adjustment or cleanup (such as resetting TRX\_ID, LSN and removing
            delete\-marked records).

            Note 1.8: If any table or columns attached to {{FOREIGN KEY…REFERENCES}} are renamed, the
            {{tables.definition}} and {{tables.frm}} of all affected tables must be updated within
            the same transaction.

            h2. Bootstrap, Startup and Upgrading

            In the MySQL 8.0 Global Data Dictionary, there will likely be a
            special Data Dictionary Tablespace that will contain all data
            dictionary tables. There is one hard\-coded data dictionary table:

            {code:SQL}
            CREATE TABLE mysql.version SELECT 1 version_id;
            {code}

            On startup, the server will read the contents of this table and refuse
            to start up unless the stored version\_id is the current
            version. With a special start\-up option, a future version of the
            server can upgrade from a supported earlier version. In the process,
            it would likely create a new data dictionary tablespace and copy and
            convert all metadata from the old\-version data dictionary.

            In MariaDB, we would probably have to hard\-code the definition of the {{tables}} table, and compare it to {{tables.frm}} if it exists. An entry for {{tables}} itself could be omitted from the {{tables}} table.

            h3. Bootstrap (\-\-initialize)

            MariaDB or InnoDB should ideally not apply any ‘create if not exists’
            logic on any files. A normal startup will be refused if some of the
            predefined files do not exist.

            On a special startup option {{--initialize}}, MariaDB will create a new
            instance. In this mode, the specified {{--datadir}} must be empty.

            We can initialize a new server instance in the following order:

            # the redo log(s) (possibly including dummy files for preventing startup on older versions)
            # (the InnoDB system and undo tablespaces, until MDEV\-11633 is completed)
            # (the {{tables}} table)
            # {{fsync()}} all files

            A startup attempt on an older MariaDB would fail because the (possibly
            dummy) redo log or binlog files are in incompatible format.

            h3. Upgrade To Transactional Data Dictionary (from earlier MariaDB version)

            We must have enough code to support {{SHOW CREATE TABLE}} of all old tables.
            InnoDB may refuse startup for upgrade in the following cases:

             \* the redo log is not clean
             \* any tables exist in the InnoDB system tablespace
             \* any tables contain secondary indexes not from page number 4 onwards
             \* the change buffer is not empty
             \* the undo logs are not empty

            In these cases, the user would have to address the problem in the older
            server version:

            {code:sql}
            SET GLOBAL innodb_file_per_table=1;
            OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
            SET GLOBAL innodb_fast_shutdown=0;
            /* shut down the server */
            {code}

            On upgrade, once MDEV\-11633 is completed, MariaDB would abandon
            (optionally destroy) the InnoDB system tablespace files and any undo
            tablespace files. We would no longer create such files.

            Upgrade would create and populate the {{tables}} tables based on the old data dictionary contents.

            If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

            * the upgrade will be refused due to a failed consistency or compatibility check
            * the upgrade completes
            * the server is killed and must be restarted with upgrade

            h1. Notes on Consistency Checks

            Do we need to check if InnoDB data files use the same {{ROW_FORMAT}} as is stored in the {{.frm}} file or in the {{mysql.tables}} entry? Probably not; we should only require that the {{.ibd}} file is self\-consistent.

            If we implement {{tables}}, how to ensure consistency with {{.frm}} files?

            # If users add files? Silently copy from {{.frm}} to {{tables}}. Alternative: Require an extra {{IMPORT}} command to be executed, to facilitate auditing.
            # If users delete files? Delete orphan records and report “table not found”. Alternative: Complain about missing files until {{DROP TABLE}} has been executed.
            # If users replace files? Detect when loading a table definition to the cache:
            Optionally, compare {{stat("name.frm")}} to {{create_time, LENGTH(frm)}}.
            {{UPDATE tables SET frm=…}} from the file {{name.frm}} if the file looks valid.
            {{FLUSH TABLES}} will force the next {{handler::open()}} to check {{.frm}}

            h1. Notes on {{FOREIGN KEY}}

            h2. Discovering the {{FOREIGN KEY}} Relationships

            Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)\*(child\|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

            How to find the {{FOREIGN KEY}} children? {{REFERENCES}} clauses in {{SHOW CREATE TABLE}} point to parents only. As noted in MDEV\-16417, we can extend the {{frm}} information to additionally store the names of referencing tables.

            h2. Cleaning up the {{FOREIGN KEY}} implementation

            Certain operations must know the {{FOREIGN KEY}} child/parent graph.
            {{RENAME TABLE}}, {{DROP TABLE}}, {{ALTER TABLE}}, {{TRUNCATE TABLE}} must either be prohibited, or the renaming of columns or tables must be cascaded.

            {{INSERT}}, {{UPDATE}}, {{DELETE}}, {{REPLACE}} must be prohibited or cascaded based on the parent and child tables.

            {{SELECT}}, {{ROLLBACK}}, {{COMMIT}} can continue to ignore {{FOREIGN KEY}} constraints.

            MDEV-22361 would implement native, cross-engine {{FOREIGN KEY}} processing in the SQL layer. This would make {{ON…CASCADE}} and {{ON…SET NULL}}\] fire {{TRIGGER}}, at last. This would also make {{FOREIGN KEY}} work on partitioned tables (MDEV\-12483).

            h1. Comparison to the Global Data Dictionary in MySQL 8.0

            MySQL 8.0 aims to replace {{.frm}} files and InnoDB {{SYS_*}} tables with a collection of tables that implement the Global Data Dictionary.

            The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

            # SQL statement (SHOW CREATE TABLE)
            # Metadata decomposed into tables,columns,indexes,index_column_usage,…

            * Not fully decomposed: some SQL expressions are stored as text
            * Not fully decomposed: {{dd::Properties}} is a string of key\-value pairs

            # JSON\-based Serialized Dictionary Information (SDI)
            # SDI in {{.ibd}} files for import/export & disaster recovery ([WL#7053|http://dev.mysql.com/worklog/task/?id=7053], [WL#7066|http://dev.mysql.com/worklog/task/?id=7066])
            # Additional InnoDB files for import/export; may be replaced by the above two

            h2. Normalized and serialized storage

            The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION\_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking {{handlerton}} or {{handler}} functions. See [Faster INFORMATION\_SCHEMA|http://mysqlserverteam.com/mysql-8-0-data-dictionary-status-in-the-8-0-0-dmr/].

            The decomposition has the drawback of overhead. Using the [terms popularized by Mark Callaghan|http://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html], it is causing read amplification, storage amplification, and write amplification.

            h3. Read amplification

            Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a {{.frm}} file.

            h3. Storage amplification

            The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS\_TABLES.ID.)

            h3. Write amplification

            For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the {{.SDI}} file and the WL#7053 embedded metadata inside the {{.ibd}} files.

            In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.
            julien.fritsch Julien Fritsch made changes -
            Issue Type Task [ 3 ] New Feature [ 2 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 106007

            People

              Unassigned Unassigned
              marko Marko Mäkelä
              Votes:
              15 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.