Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11657

Cross-engine transaction metadata

    XMLWordPrintable

Details

    Description

      The InnoDB transaction system state resides in the InnoDB system tablespace and in optional undo tablespaces. There currently are few management tools related to this. Even for InnoDB storage engine developers, it is difficult to tell how much space is being used by the undo logs and due to which transactions.

      MDEV-10139 introduced crash-safe, ROLLBACK=NO tables for CREATE SEQUENCE. Let us use those for managing the transaction system metadata. This design allows the tables to be managed by any crash-safe storage engine and to be used by any transactional storage engine.

      For each created persistent transaction, a record would be inserted into a table. Unless and until we have a common crash recovery log for all engines, this table must be partitioned by storage engine. Even then, to allow multiple transactions to change state concurrently, it could make sense to have a separate partition or instance of the table for each transactional engine, and within each engine, multiple partitions.

      Between transactional storage engines that use this table, we can have cross-engine transactions without binlog coordination. The table would also eliminate the need for a separate table for GTID and other replication metadata.

      CREATE TABLE mysql.transactions_$engine
      (
        start_id SERIAL
        COMMENT 'logical start time, like InnoDB trx_t::id or DB_TRX_ID a.k.a. TRX_UNDO_TRX_ID',
        xid CHAR(128) /*UNIQUE*/ NULL COMMENT 'XA transaction ID',
        gtid VARCHAR /*UNIQUE*/ NULL COMMENT 'global transaction ID',
        state ENUM ('active','prepared','pre_commit','committed') NOT NULL
        COMMENT 'pre_commit is only for cross-engine transactions without binlog'
      )
      ROLLBACK=NO;
      

      Metadata on rolled-back transactions (state='active') would be deleted when the rollback completes.

      The uniqueness of start_id, xid, gtid is enforced by caching all active transactions in RAM.

      The bulk of the transaction information will be stored separately in each storage engine. For each transaction that performed changes, InnoDB stores a transaction log (undo log records) for supporting ROLLBACK, multi-versioning concurrency control (MVCC) and implicit record-level locking.

      Note: The start_id is allocated from a global sequence. The sequence is only persisted in mysql.transactions_$engine.start_id; on startup, the maximum value of the sequence will be determined by reading all the transaction metadata of uncommitted transactions. This is similar to what always happened on InnoDB startup.

      Note: No end_id needs to be stored. InnoDB does assign a trx_t::no on transaction COMMIT, but it is only used for main-memory data structures that maintain read views for multi-versioning and purging old history. For state!='committed' transactions that are recovered on restart, we can assign the current maximum value of the sequence to trx_t::no. (In other words, the InnoDB undo log header field TRX_UNDO_TRX_NO could have been replaced with a Boolean flag indicating whether the transaction was committed.)

      Metadata on committed transactions (state='committed') can be preserved as long as it is desirable, for example in multi-master replication. InnoDB should be fine if the record is deleted already on commit. (InnoDB MVCC can keep track of recently committed transactions in memory; it will not need the records in the table. Likewise, if InnoDB purge cannot find a record for a transaction in the table, it can conclude that the transaction has already been committed.)

      Temporal (system-versioned) tables use a separate table mysql.transaction_registry with start_id,end_id and start and end timestamps.

      As of MDEV-15132 (MariaDB 10.3.5), the persistent transaction directory is split into rollback segments (by default innodb_undo_logs=128). The rollback segments are analogous to partitioning the mysql.transactions_$engine table for allowing more concurrency between transaction commits. A rollback segment comprises:

      1. TRX_RSEG_HISTORY list of committed transactions waiting for purge
      2. A directory of TRX_RSEG_N_SLOTS entries (innodb_page_size/16, that is, 1024 for innodb_page_size=16k) pointing to the logs of uncommitted transactions. (There can be at most innodb_undo_logs*innodb_page_size/16 concurrently registered write transactions.)
      3. TRX_RSEG_MAX_TRX_ID, the maximum transaction ID at the latest commit
      4. the binlog position of the latest transaction commit in this rollback segment
      5. the Galera WSREP XID (analogous to GTID) of the latest commit

      Each undo log starts with a header that defines the state of the transaction. The header is followed by a doubly-linked list of undo log records. Starting with MDEV-12288, only one persistent undo log will be created for each transaction that modifies data.

      For simpler maintenance, the InnoDB undo logs should be moved into separate filename.ibu files, one for each filename.ibd. The file structure can be described with a table:

      CREATE TABLE innodb_undo_log_$file
      (
        start_id BIGINT UNSIGNED NOT NULL /* REFERENCES mysql.transactions_innodb */,
        committed BIT NOT NULL,
        sequence_number INT UNSIGNED NOT NULL
        COMMENT 'an increasing number within the transaction, starting from 1.
        May be pointed to by DB_ROLL_PTR in filename.ibd',
        primary_key BLOB NOT NULL COMMENT 'the PRIMARY KEY in the table',
        data BLOB NOT NULL COMMENT 'record type and parameters',
        PRIMARY KEY(start_id,sequence_number)
      )
      ROLLBACK=NO;
      

      Because there is a separate filename.ibu for each filename.ibd, it is trivial to prune old undo log records on DROP TABLE, TRUNCATE TABLE, OPTIMIZE TABLE or ALTER TABLE. Also the dict_table_t::id and SYS_TABLES.ID become unnecessary.

      InnoDB will persist its maximum used start_id in the filename.ibu files. Upon emptying or deleting filename.ibu files, it should be copied to the filename.ibd files, to facilitate instant import of data files (MDEV-11658).

      Further data record types can be added to cover new needs, such as the ability to preserve full locking information to fix MySQL Bug#57978 PREPARED transactions forget locks on server restart.

      The undo logs for InnoDB temporary tables will continue to be stored inside the InnoDB temporary tablespace.

      Recovery considerations

      Storage engines maintain private state needed for MVCC, locking, and COMMIT, ROLLBACK, XA ROLLBACK, XA COMMIT.

      Writes to mysql.transactions_$engine will be persisted to the write-ahead recovery log of the underlying storage engine(s). A transaction state change (COMMIT, XA PREPARE, XA COMMIT, XA ROLLBACK) is durable when the write becomes durable.

      The mysql.transactions_$engine records only track the transaction state.

      Engines must know all their pending transactions and execute

      SELECT * FROM mysql.transactions_$engine WHERE start_id=…;
      

      • MyRocks would need this at least to determine XA transaction state.
      • InnoDB will do this for all start_id found in its undo logs.
      • If no record is found, initiate ROLLBACK.

      A full ROLLBACK can DELETE FROM mysql.transactions_$engine at any point (start or end of rollback).

      For GTID, we must preserve the latest sequence from each source domain. If gtid IS NULL, COMMIT can DELETE FROM mysql.transactions_$engine.

      XA PREPARE transactions must remain until XA COMMIT or XA ROLLBACK.

      Cross-Engine Commit Without Binlog

      COMMIT must be Atomic and Durable across storage engines. With a cross-engine write-ahead log, this would be trivial. A special kind of 2-phase commit is needed for engines that use a private log:

      1. For each participating engine:

        UPDATE mysql.transactions_$engine SET state='pre_commit'; FLUSH LOGS;
        

      2. For each participating engine:

        UPDATE mysql.transactions_$engine SET state='committed';
        

      3. Flush at least one recovery log.

      Recovery: Read and merge the mysql.transactions_$engine records in RAM.
      If state='committed' in any record for the same start_id:

      • Other records must have state IN ('prepared','pre_commit').
      • UPDATE…SET state='committed' in all records.
      • Perform COMMIT

      If state='prepared' in all records: recover as XA PREPARE.

      • The transaction state will remain until XA ROLLBACK or XA COMMIT.

      Otherwise: treat the transaction as incomplete, and initiate ROLLBACK:

      • DELETE FROM mysql.transactions_$engine…;

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.