Details

    Description

      The InnoDB transaction system state resides in the InnoDB system tablespace and in optional undo tablespaces. There currently are few management tools related to this. Even for InnoDB storage engine developers, it is difficult to tell how much space is being used by the undo logs and due to which transactions.

      MDEV-10139 introduced crash-safe, ROLLBACK=NO tables for CREATE SEQUENCE. Let us use those for managing the transaction system metadata. This design allows the tables to be managed by any crash-safe storage engine and to be used by any transactional storage engine.

      For each created persistent transaction, a record would be inserted into a table. Unless and until we have a common crash recovery log for all engines, this table must be partitioned by storage engine. Even then, to allow multiple transactions to change state concurrently, it could make sense to have a separate partition or instance of the table for each transactional engine, and within each engine, multiple partitions.

      Between transactional storage engines that use this table, we can have cross-engine transactions without binlog coordination. The table would also eliminate the need for a separate table for GTID and other replication metadata.

      CREATE TABLE mysql.transactions_$engine
      (
        start_id SERIAL
        COMMENT 'logical start time, like InnoDB trx_t::id or DB_TRX_ID a.k.a. TRX_UNDO_TRX_ID',
        xid CHAR(128) /*UNIQUE*/ NULL COMMENT 'XA transaction ID',
        gtid VARCHAR /*UNIQUE*/ NULL COMMENT 'global transaction ID',
        state ENUM ('active','prepared','pre_commit','committed') NOT NULL
        COMMENT 'pre_commit is only for cross-engine transactions without binlog'
      )
      ROLLBACK=NO;
      

      Metadata on rolled-back transactions (state='active') would be deleted when the rollback completes.

      The uniqueness of start_id, xid, gtid is enforced by caching all active transactions in RAM.

      The bulk of the transaction information will be stored separately in each storage engine. For each transaction that performed changes, InnoDB stores a transaction log (undo log records) for supporting ROLLBACK, multi-versioning concurrency control (MVCC) and implicit record-level locking.

      Note: The start_id is allocated from a global sequence. The sequence is only persisted in mysql.transactions_$engine.start_id; on startup, the maximum value of the sequence will be determined by reading all the transaction metadata of uncommitted transactions. This is similar to what always happened on InnoDB startup.

      Note: No end_id needs to be stored. InnoDB does assign a trx_t::no on transaction COMMIT, but it is only used for main-memory data structures that maintain read views for multi-versioning and purging old history. For state!='committed' transactions that are recovered on restart, we can assign the current maximum value of the sequence to trx_t::no. (In other words, the InnoDB undo log header field TRX_UNDO_TRX_NO could have been replaced with a Boolean flag indicating whether the transaction was committed.)

      Metadata on committed transactions (state='committed') can be preserved as long as it is desirable, for example in multi-master replication. InnoDB should be fine if the record is deleted already on commit. (InnoDB MVCC can keep track of recently committed transactions in memory; it will not need the records in the table. Likewise, if InnoDB purge cannot find a record for a transaction in the table, it can conclude that the transaction has already been committed.)

      Temporal (system-versioned) tables use a separate table mysql.transaction_registry with start_id,end_id and start and end timestamps.

      As of MDEV-15132 (MariaDB 10.3.5), the persistent transaction directory is split into rollback segments (by default innodb_undo_logs=128). The rollback segments are analogous to partitioning the mysql.transactions_$engine table for allowing more concurrency between transaction commits. A rollback segment comprises:

      1. TRX_RSEG_HISTORY list of committed transactions waiting for purge
      2. A directory of TRX_RSEG_N_SLOTS entries (innodb_page_size/16, that is, 1024 for innodb_page_size=16k) pointing to the logs of uncommitted transactions. (There can be at most innodb_undo_logs*innodb_page_size/16 concurrently registered write transactions.)
      3. TRX_RSEG_MAX_TRX_ID, the maximum transaction ID at the latest commit
      4. the binlog position of the latest transaction commit in this rollback segment
      5. the Galera WSREP XID (analogous to GTID) of the latest commit

      Each undo log starts with a header that defines the state of the transaction. The header is followed by a doubly-linked list of undo log records. Starting with MDEV-12288, only one persistent undo log will be created for each transaction that modifies data.

      For simpler maintenance, the InnoDB undo logs should be moved into separate filename.ibu files, one for each filename.ibd. The file structure can be described with a table:

      CREATE TABLE innodb_undo_log_$file
      (
        start_id BIGINT UNSIGNED NOT NULL /* REFERENCES mysql.transactions_innodb */,
        committed BIT NOT NULL,
        sequence_number INT UNSIGNED NOT NULL
        COMMENT 'an increasing number within the transaction, starting from 1.
        May be pointed to by DB_ROLL_PTR in filename.ibd',
        primary_key BLOB NOT NULL COMMENT 'the PRIMARY KEY in the table',
        data BLOB NOT NULL COMMENT 'record type and parameters',
        PRIMARY KEY(start_id,sequence_number)
      )
      ROLLBACK=NO;
      

      Because there is a separate filename.ibu for each filename.ibd, it is trivial to prune old undo log records on DROP TABLE, TRUNCATE TABLE, OPTIMIZE TABLE or ALTER TABLE. Also the dict_table_t::id and SYS_TABLES.ID become unnecessary.

      InnoDB will persist its maximum used start_id in the filename.ibu files. Upon emptying or deleting filename.ibu files, it should be copied to the filename.ibd files, to facilitate instant import of data files (MDEV-11658).

      Further data record types can be added to cover new needs, such as the ability to preserve full locking information to fix MySQL Bug#57978 PREPARED transactions forget locks on server restart.

      The undo logs for InnoDB temporary tables will continue to be stored inside the InnoDB temporary tablespace.

      Recovery considerations

      Storage engines maintain private state needed for MVCC, locking, and COMMIT, ROLLBACK, XA ROLLBACK, XA COMMIT.

      Writes to mysql.transactions_$engine will be persisted to the write-ahead recovery log of the underlying storage engine(s). A transaction state change (COMMIT, XA PREPARE, XA COMMIT, XA ROLLBACK) is durable when the write becomes durable.

      The mysql.transactions_$engine records only track the transaction state.

      Engines must know all their pending transactions and execute

      SELECT * FROM mysql.transactions_$engine WHERE start_id=…;
      

      • MyRocks would need this at least to determine XA transaction state.
      • InnoDB will do this for all start_id found in its undo logs.
      • If no record is found, initiate ROLLBACK.

      A full ROLLBACK can DELETE FROM mysql.transactions_$engine at any point (start or end of rollback).

      For GTID, we must preserve the latest sequence from each source domain. If gtid IS NULL, COMMIT can DELETE FROM mysql.transactions_$engine.

      XA PREPARE transactions must remain until XA COMMIT or XA ROLLBACK.

      Cross-Engine Commit Without Binlog

      COMMIT must be Atomic and Durable across storage engines. With a cross-engine write-ahead log, this would be trivial. A special kind of 2-phase commit is needed for engines that use a private log:

      1. For each participating engine:

        UPDATE mysql.transactions_$engine SET state='pre_commit'; FLUSH LOGS;
        

      2. For each participating engine:

        UPDATE mysql.transactions_$engine SET state='committed';
        

      3. Flush at least one recovery log.

      Recovery: Read and merge the mysql.transactions_$engine records in RAM.
      If state='committed' in any record for the same start_id:

      • Other records must have state IN ('prepared','pre_commit').
      • UPDATE…SET state='committed' in all records.
      • Perform COMMIT

      If state='prepared' in all records: recover as XA PREPARE.

      • The transaction state will remain until XA ROLLBACK or XA COMMIT.

      Otherwise: treat the transaction as incomplete, and initiate ROLLBACK:

      • DELETE FROM mysql.transactions_$engine…;

      Attachments

        Issue Links

          Activity

            We can do without the table mysql.transaction_logs and keep the transaction logs private to each transactional storage engine. In this way, the interfaces between the server core and the storage engines will be cleaner.

            The engine-private transaction logs would refer to the global crash-safe mysql.transactions table that records the transaction metadata (GTID, XID, short id, state).

            If we go that route, for InnoDB an interesting option would be to write one undo log write per table, say, tablename.ibu for each .ibd file for which changes exist. In this way, it would be easy to see which tables are 'active'. (For exporting, some command would still be needed, so that the change buffer will be merged, and any purge will be quiesced. But the importing server could run the purge!)

            Internally, the tablename.ibu could be in the same format as InnoDB ROW_FORMAT=REDUNDANT tables. The access patterns would be:

            1. table scan in purge: ORDER BY start_id ASC, sequence_number ASC
            2. lookup and reverse scan in rollback: ORDER BY sequence_number DESC WHERE start_id=?
            3. lookup in MVCC: WHERE start_id=DB_TRX_ID AND sequence_number=DB_ROLL_PTR

            Because the InnoDB transaction logs would be partitioned by table or partition, we can eliminate table_id, and all these access patterns can be satisfied by a single index:
            PRIMARY KEY(start_id, sequence_number).

            The rest of the undo log record would be (type, PRIMARY KEY, optionally secondary KEY columns) referring to the user table or partition.

            marko Marko Mäkelä added a comment - We can do without the table mysql.transaction_logs and keep the transaction logs private to each transactional storage engine. In this way, the interfaces between the server core and the storage engines will be cleaner. The engine-private transaction logs would refer to the global crash-safe mysql.transactions table that records the transaction metadata (GTID, XID, short id, state). If we go that route, for InnoDB an interesting option would be to write one undo log write per table, say, tablename.ibu for each .ibd file for which changes exist. In this way, it would be easy to see which tables are 'active'. (For exporting, some command would still be needed, so that the change buffer will be merged, and any purge will be quiesced. But the importing server could run the purge!) Internally, the tablename.ibu could be in the same format as InnoDB ROW_FORMAT=REDUNDANT tables. The access patterns would be: table scan in purge: ORDER BY start_id ASC, sequence_number ASC lookup and reverse scan in rollback: ORDER BY sequence_number DESC WHERE start_id=? lookup in MVCC: WHERE start_id=DB_TRX_ID AND sequence_number=DB_ROLL_PTR Because the InnoDB transaction logs would be partitioned by table or partition, we can eliminate table_id, and all these access patterns can be satisfied by a single index: PRIMARY KEY(start_id, sequence_number). The rest of the undo log record would be (type, PRIMARY KEY, optionally secondary KEY columns) referring to the user table or partition.

            If we partition the undo logs by InnoDB table or partition (say, store them in tablename.ibu files), then we will have to figure out where and how to store undo log for rolling back RENAME TABLE operations, and how to make RENAME TABLE crash-safe. It could be doable by something like the following:

            1. Write (via the redo log) the RENAME TABLE undo log to the old_tablename.ibu file
            2. Write MLOG_FILE_RENAME for renaming the files to the redo log, in a single atomic mini-transaction
            3. Flush the redo log
            4. Rename the files

            If the server is killed after flushing the redo log, then we will necessarily end up with all the files having been renamed from old_tablename to new_tablename. If the transaction was not committed, then we would roll back the rename operation based on the undo log record that we would now find in new_tablename.ibu.

            If the server is killed before flushing the redo log for the MLOG_FILE_NAME records, then we might find the undo log record in old_tablename.ibu, but rollback would ignore it.

            marko Marko Mäkelä added a comment - If we partition the undo logs by InnoDB table or partition (say, store them in tablename.ibu files), then we will have to figure out where and how to store undo log for rolling back RENAME TABLE operations, and how to make RENAME TABLE crash-safe. It could be doable by something like the following: Write (via the redo log) the RENAME TABLE undo log to the old_tablename.ibu file Write MLOG_FILE_RENAME for renaming the files to the redo log, in a single atomic mini-transaction Flush the redo log Rename the files If the server is killed after flushing the redo log, then we will necessarily end up with all the files having been renamed from old_tablename to new_tablename. If the transaction was not committed, then we would roll back the rename operation based on the undo log record that we would now find in new_tablename.ibu. If the server is killed before flushing the redo log for the MLOG_FILE_NAME records, then we might find the undo log record in old_tablename.ibu, but rollback would ignore it.

            The .ibu files could impose a significant performance penalty to transactions that modify very many tables. It might be more feasible to restructure the InnoDB undo logs in a different way: introducing 128 persistent undo tablespaces, each containing a number of rollback segment header pages, with the entries corresponding to the records of mysql.transactions_$engine mentioned above.

            In DB_ROLL_PTR, the rollback segment ID could identify the undo tablespace. Theoretically, given that each DB_TRX_ID has only one persistent rollback segment, we would not even need that; MVCC could look up the undo tablespace based on the DB_TRX_ID. This would require extending main memory data structures so that some data of committed transactions would be stored until the transactions are purged.

            marko Marko Mäkelä added a comment - The .ibu files could impose a significant performance penalty to transactions that modify very many tables. It might be more feasible to restructure the InnoDB undo logs in a different way: introducing 128 persistent undo tablespaces, each containing a number of rollback segment header pages, with the entries corresponding to the records of mysql.transactions_$engine mentioned above. In DB_ROLL_PTR , the rollback segment ID could identify the undo tablespace. Theoretically, given that each DB_TRX_ID  has only one persistent rollback segment, we would not even need that; MVCC could look up the undo tablespace based on the DB_TRX_ID . This would require extending main memory data structures so that some data of committed transactions would be stored until the transactions are purged.

            For SPATIAL INDEX, I think that the undo log records should store the minimum bounding rectangles (MBR), so that these will not have to be recalculated during rollback or purge.

            marko Marko Mäkelä added a comment - For SPATIAL INDEX , I think that the undo log records should store the minimum bounding rectangles (MBR), so that these will not have to be recalculated during rollback or purge.

            For indexes on virtual columns, as noted in MDEV-17466, undo log records should store the keys, so that purge and rollback can avoid evaluating any virtual column values.

            marko Marko Mäkelä added a comment - For indexes on virtual columns, as noted in MDEV-17466 , undo log records should store the keys, so that purge and rollback can avoid evaluating any virtual column values.

            People

              Unassigned Unassigned
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.