Details

    • Type: Technical task
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      InnoDB is one of the the best-known MySQL and MariaDB storage engines that aims to be crash-safe. But, DDL operations cannot be crash-safe as long as there are two independently updated data dictionaries (.frm files and some data dictionary tables in the InnoDB system tablespace).

      Global Data Dictionary in MySQL 8.0

      MySQL 8.0 aims to replace .frm files and InnoDB SYS_* tables with a collection of tables that implement the Global Data Dictionary.

      The MySQL 8.0 approach is rather complex, using multiple fundamentally different representation formats for the same data:

      1. SQL statement (SHOW CREATE TABLE)
      2. Metadata decomposed into tables,columns,indexes,index_column_usage,…
      • Not fully decomposed: some SQL expressions are stored as text
      • Not fully decomposed: dd::Properties is a string of key-value pairs
      1. JSON-based Serialized Dictionary Information (SDI)
      2. SDI in .ibd files for import/export & disaster recovery (WL#7053, WL#7066)
      3. Additional InnoDB files for import/export; may be replaced by the above two

      Normalized and serialized storage

      The decomposition of the metadata into relational tables has the benefit that it is possible to implement some INFORMATION_SCHEMA views based on the data dictionary tables. Not all of them; there are some ‘computed columns’ that the storage engines will be asked to fill in by invoking handlerton or handler functions. See Faster INFORMATION_SCHEMA.

      The decomposition has the drawback of overhead. Using the terms popularized by Mark Callaghan, it is causing read amplification, storage amplification, and write amplification.

      Read amplification

      Metadata will normally be loaded from a bunch of small records scattered in multiple tables, instead of loading serialized data, such as a .frm file.

      Storage amplification

      The data dictionary tables in MySQL 8.0 include a large number of secondary indexes. (In the core InnoDB data dictionary tables, there is only one secondary index on the unique column SYS_TABLES.ID.)

      Write amplification

      For example, RENAME TABLE on a partitioned table will cause a rewrite of the decomposed metadata records and a rewrite of one or both of the .SDI file and the WL#7053 embedded metadata inside the .ibd files.

      In MySQL 8.0.0, the WL#7053 interface is not used yet, and there is no optimization when it comes to updates. All decomposed records will be deleted and inserted with different IDs. This also leads to fragmentation in the data dictionary tables when the system is tortured with a constant stream of trivial DDL operations.

      Proposal for Global Data Dictionary in MariaDB 10.4

      MariaDB can use the Global Data Dictionary as a code cleanup opportunity, removing duplicated logic and relying more on code reuse.

      MariaDB could get rid of some InnoDB binary file subformats related to data dictionary.
      The .frm files can be preserved for backward compatibility (tables could be copied by copying data and .frm files).
      The data dictionary can be kept in the .frm files, which can be made crash-safe and transactional by adding some transactional logging.

      Optionally, we can create a ‘cache’ of the .frm files in one or two persistent tables: mysql.tables and mysql.tables_children (for FOREIGN KEY relationships). Reading from a single table should be much faster than reading table definitions from many small files, especially on startup and when executing INFORMATION_SCHEMA queries.

      Prerequisites and Features

      This proposal is based on the following assumptions:

      1. The data dictionary is stored in a crash-safe storage engine (such as ENGINE=Aria), which allows atomic transaction commit with other engines. For cross-engine DDL transactions, we can continue to use the binlog, or use a new mechanism proposed in MDEV-11657.
        (MySQL 8.0 will probably only support a single Data Dictionary Storage Engine (DDSE), InnoDB.)
      2. MariaDB will not support CREATE TABLESPACE for InnoDB tables (a feature that was introduced in MySQL 5.7). MariaDB follows the Unix philosophy of ‘everything is a file’, with the refinement that a table can comprise multiple files.
      3. InnoDB data files will be made self-contained with regard to index metadata. That is, InnoDB data files in MariaDB would know the storage location of each index tree.
        (Between MySQL 4.1 and MySQL 5.7, we can only know that in .ibd files, the clustered index (primary key) starts at page 3.)

      Definitions

      Static metadata

      Any metadata that can be changed by DDL operations only.

      Dynamic metadata:

      Any other metadata. Examples of dynamic metadata attributes include AUTO_INCREMENT (MDEV-6076) and UPDATE_TIME. Any dynamic metadata will be persisted within the storage engine; not written back to the MariaDB Global Data Dictionary. MariaDB will have to invoke handler member functions to retrieve the latest dynamic metadata.

      Data dictionary table definitions

      /* Table definitions. */
      CREATE TABLE mysql.tables
      (
        table_schema VARCHAR(64) NOT NULL,
        table_name VARCHAR(64) NOT NULL,
        create_time DATETIME NOT NULL DEFAULT now,
        definition TEXT NOT NULL, /* without table or sequence name */
        frm BLOB NOT NULL,
        PRIMARY KEY(table_schema, table_name)
      );
       
      /* Example for CREATE TABLE test.t1(a INT) when InnoDB is the default. */
      INSERT INTO tables (table_schema, table_name, definition, frm)
      VALUES ('test', 't1', 'TABLE(a INT) _ROW_FORMAT=DYNAMIC _ENGINE=InnoDB',
              <frm file contents>);
      

      Note 1: mysql.tables only stores persistent metadata about tables and sequences.
      CREATE TEMPORARY TABLE would not write anything there, but CREATE TABLE…ENGINE=MEMORY would, if the table definition (unlike the contents) is supposed to survive server restart.

      Note 1.1: If we wish SHOW CREATE TABLE to omit implied attributes, we
      can identify them with the _ prefix in mysql.tables.definition. For
      example, if no explicit ROW_FORMAT is specified, InnoDB will store
      _ROW_FORMAT with the current innodb_default_row_format. (If we do not
      store the actual format options in mysql.tables.definition, we will
      lose an opportunity to perform a consistency check between .ibd files
      and mysql.tables when opening a table definition.)

      Note 1.2: mysql.tables.description should omit any dynamic
      metadata attributes that are stored in mysql.tables.dynamic_metadata,
      such as AUTO_INCREMENT or UPDATE_TIME.

      Note 1.3: Explicitly specified invalid or unsupported attributes will be removed or replaced both in mysql.tables.definition and the .frm file.
      For example for InnoDB, KEY_BLOCK_SIZE specified on indexes will be removed.
      Likewise for InnoDB, INDEX DIRECTORY will be removed.
      For InnoDB, KEY_BLOCK_SIZE specified on table will be removed
      unless ROW_FORMAT=COMPRESSED.
      For any storage engine, an invalid ROW_FORMAT would be replaced with
      the actually used value.

      Note 1.4: For partitioned tables, there will be a single entry in mysql.tables.

      Note 1.5: For additional safety and portability, whenever a change to
      a mysql.tables record is committed or when FLUSH TABLES is issued, we
      will update or create the affected or related
      datadir/schemaname/tablename.frm files. The files can continue
      to be used to move tables to other server instances.

      Note 1.6: On a table definition cache miss in handler::ha_open(), both
      mysql.tables and the file system will be consulted for .frm or .par
      files. If no record is found, a record will be inserted into
      mysql.tables based on the .frm file contents. This allows the user to
      replace tables by copying files. Because .frm or .par files contain no
      FOREIGN KEY metadata, nothing will be inserted into
      mysql.tables_children. If the .frm file is found valid but different
      from mysql.tables.frm, then the old records for the table by this name
      will be deleted from mysql.tables and mysql.tables_children and a new
      record will be inserted into mysql.tables.

      Note 1.7: It should be possible to import InnoDB tables or partitions
      across server instances using the same file-copying method, with no
      special IMPORT syntax. To speed up the operation, a server GUID can be
      embedded in the .ibd files to distinguish ‘foreign’ files that may
      need adjustment or cleanup (such as resetting TRX_ID, LSN and removing
      delete-marked records).

      /* Tables with REFERENCES clauses pointing to a parent table. */
      CREATE TABLE mysql.tables_children
      (
        parent_schema VARCHAR(64) NOT NULL,
        parent_name VARCHAR(64) NOT NULL,
        child_schema VARCHAR(64) NOT NULL,
        child_name VARCHAR(64) NOT NULL,
        PRIMARY KEY(parent_table_id, child_table_id, child_schema, child_name)
      );
      

      Note 2: mysql.tables_children allows the lookup of child tables whose
      FOREIGN KEY constraints refer to a given parent table. For the child
      tables, table.definition will contain the REFERENCES clauses referring
      to parent tables.

      Note 2.1: When both the parent table and the child table have been
      created, an entry in tables_children must exist. This table allows
      MariaDB to load child table definitions when a parent table is being
      loaded, for proper meta-data locking (MySQL WL#6049).

      Note 2.2: If any parent table or parent table columns are renamed, the
      table.definition of the affected child tables must be updated within
      the same transaction.

      /* Crash recovery information. (Which .frm files to remove
      drop on commit or rollback, or which files to rename back on rollback.) */
      CREATE TABLE mysql.ddl_log
      (
        sequence_number INT UNSIGNED PRIMARY KEY,
        type ENUM('DELETE_FILE','ROLLBACK_RENAME_FILE') NOT NULL,
        filename VARCHAR(512) NOT NULL,
        old_name VARCHAR(512)
      );
      

      Some operations, such as DROP INDEX (rollback of ADD INDEX or roll-forward of DROP INDEX) can be logged directly in the private transaction log of the storage engine.
      Likewise, TRUNCATE can be logged directly in the storage engine, with no changes to the data dictionary.
      Note: MySQL 8.0 logs TRUNCATE as a combination of RENAME, CREATE and DROP. The WL#9536 design looks like an (almost) verbatim copy of my WL#7016, which remains flagged as private.

      Example: Transactional RENAME TABLE

      RENAME TABLE differs from CREATE and DROP operations by modifying persistent metadata in the file system before the transaction is committed.
      CREATE operations would write 'DELETE_FILE' or similar fixup records that would be deleted at the DDL transaction commit. DROP operations would write 'DELETE_FILE' or similar records as part of the DDL transaction commit, and a post-commit fixup step would take care of actually deleting files.

      RENAME TABLE involves renaming persistent files before the transaction is committed. So, the fixup records must be of the form 'ROLLBACK_RENAME_FILE'. If any of the rename operations fail (for example due to insufficient file system permissions), then the transaction will be rolled back, and any renames that succeeded will be compensated by another rename operation. If the rollback-of-rename fails, then we are out of luck and should probably block access to the table.

      If no symbolic links are involved (there is no DATA DIRECTORY or INDEX DIRECTORY attribute), the file system operation can probably be a simple matter of

      #ifdef _WIN32
      MoveFileEx(from, to, MOVEFILE_WRITE_THROUGH);
      #else
      rename(from, to);
      // TODO: persist the rename() by calling syncfs(fd), fsync(directory), or fsync(file)
      #endif
      

      With symbolic links, there are more failure modes, and it is challenging to guarantee a reasonably reliable crash recovery, especially when multiple file systems are allowed. Perhaps an approach with hard links and a duplicated symbolic link would work:

      ln /some/where/a.ibd /some/where/b.ibd
      sync
      ln -s /some/where/b.ibd b.ibd
      sync
      rm /some/where/a.ibd a.ibd
      sync
      

      On recovery, when seeing that we have to rename b back to a, we should necessarily find at least one of the symbolic links, and also at least one of the link targets, and could perform a normalization step that would leave us with exactly one symbolic link and one link target. If that normalization step leaves us with b.ibd, we would perform the above rename step in the opposite direction, to crash-safely rollback the rename back to a.ibd. Only after the file system is in a stable state where only the symbolic link and data file for a.ibd exist, we would delete the fixup record.

      We could also write the operations on the symbolic links to the InnoDB redo log. With MDEV-14717, RENAME TABLE (without DATA DIRECTORY) is crash-safe in InnoDB.

      A single RENAME TABLE statement can operate on multiple tables. For simplicity, the above example covers a single-table rename only.

      Bootstrap, Startup and Upgrading

      In the MySQL 8.0 Global Data Dictionary, there will likely be a
      special Data Dictionary Tablespace that will contain all data
      dictionary tables. There is one hard-coded data dictionary table:

      CREATE TABLE mysql.version SELECT 1 version_id;
      

      On startup, the server will read the contents of this table and refuse
      to start up unless the stored version_id is the current
      version. With a special start-up option, a future version of the
      server can upgrade from a supported earlier version. In the process,
      it would likely create a new data dictionary tablespace and copy and
      convert all metadata from the old-version data dictionary.

      In MariaDB 10.4, there would be no shared tablespaces, but instead all
      tables in the mysql schema would be isolated files managed by some
      storage engine. But, we do need to hard-code some metadata.

      Bootstrap (--initialize)

      MariaDB or InnoDB should ideally not apply any ‘create if not exists’
      logic on any files. A normal startup will be refused if some of the
      predefined files (mysql.* tables) does not exist.

      On a special startup option --initialize, MariaDB will create a new
      instance. In this mode, the specified --datadir must be empty.

      We can initialize a new server instance in the following order:

      1. the redo log(s) (possibly including dummy files for preventing startup on older versions)
      2. (the InnoDB system and undo tablespaces, until MDEV-11633 is completed)
      3. (the mysql.tables, tables_children tables)
      4. the mysql.ddl_log table
      5. fsync() all files

      A startup attempt on an older MariaDB would fail because the (possibly
      dummy) redo log or binlog files are in incompatible format.

      Upgrade To Global Data Dictionary (from MariaDB 10.3 or earlier)

      We must have enough code to support SHOW CREATE TABLE of all old tables.
      InnoDB may refuse startup for upgrade in the following cases:

      • the redo log is not clean
      • any tables exist in the InnoDB system tablespace
      • any tables contain secondary indexes not from page number 4 onwards
      • the change buffer is not empty
      • the undo logs are not empty

      In these cases, the user would have to address the problem in the older
      server version:

      SET GLOBAL innodb_file_per_table=1;
      OPTIMIZE TABLE table_that_was_in_the_system_tablespace;
      SET GLOBAL innodb_fast_shutdown=0;
      /* shut down the server */
      

      On upgrade, once MDEV-11633 is completed, MariaDB would abandon
      (optionally destroy) the InnoDB system tablespace files and any undo
      tablespace files. We would no longer create such files.

      Upgrade would create and populate the mysql.tables tables based on the old data dictionary contents.

      If upgrade does not destroy the old files, it should be possible to downgrade to an older MariaDB, provided that there were no InnoDB DDL operations and there is a slow shutdown (the undo and redo logs are empty). After invoking upgrade, the only supported outcomes are the following:

      • the upgrade will be refused due to a failed consistency or compatibility check
      • the upgrade completes
      • the server is killed and must be restarted with upgrade

      Notes on Consistency Checks

      Do we need to check if InnoDB data files use the same ROW_FORMAT as is stored in the .frm file or in the mysql.tables entry? Probably not; we should only require that the .ibd file is self-consistent.

      If we implement mysql.tables and mysql.tables_children, how to ensure consistency with .frm files?

      1. If users add files? Silently copy from .frm to mysql.tables.
      2. If users delete files? Delete orphan records and report “table not found”.
      3. If users replace files? Detect when loading a table definition to the cache:
        Optionally, compare stat("name.frm") to create_time, LENGTH(frm).
        Update mysql.tables.frm from the file name.frm if the file looks valid.
        FLUSH TABLES will force the next handler::open() to check .frm

      Notes on FOREIGN KEY

      Discovering the FOREIGN KEY Relationships

      Foreign key metadata is required so that the changes can be prohibited or cascaded based on the existence or contents of (grand)*(child|parent) tables. The SQL layer must know the tables for acquiring appropriate MDL.

      How to find the FOREIGN KEY children? REFERENCES clauses in SHOW CREATE TABLE point to parents only.

      What if we included the REFERENCES clauses in .frm files?
      Read all .frm at startup (slow!) and remember any FOREIGN KEY?

      We could CREATE TABLE mysql.tables_children and optionally name.frk files (all such files would be read on startup).

      Cleaning up the FOREIGN KEY implementation

      Certain operations must know the FOREIGN KEY child/parent graph.
      RENAME TABLE, DROP TABLE, ALTER TABLE, TRUNCATE TABLE must either be prohibited, or the renaming of columns or tables must be cascaded.

      INSERT, UPDATE, DELETE, REPLACE must be prohibited or cascaded based on the parent and child tables.

      SELECT, ROLLBACK, COMMIT can continue to ignore FOREIGN KEY constraints.

      Currently, metadata locks are not sufficiently taken for some operations. MySQL 8.0 finally implemented WL#6049, but MariaDB is missing it.

      Could we have native, cross-engine FOREIGN KEY processing in the SQL layer (WL#148)? This would make ON…CASCADE and ON…SET NULL] fire TRIGGER, at last. This would also make FOREIGN KEY work on partitioned tables (MDEV-12483).

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                marko Marko Mäkelä
              • Votes:
                11 Vote for this issue
                Watchers:
                24 Start watching this issue

                Dates

                • Created:
                  Updated: