Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14425

Change the InnoDB redo log format to reduce write amplification




      The InnoDB redo log format is not optimal in many respects:

      • At the start of ib_logfile0, there are only two log checkpoint blocks. The rest of the log file is written in a circular fashion.
      • On log checkpoint, some file name information needs to be appended to the log.
      • File names that were first changed since the latest checkpoint must be appended to the log. The bookkeeping causes some contention on log_sys.mutex and fil_system.mutex.
      • The log file was unnecessarily split into multiple files, logically treated as one big circular file. (MDEV-20907 in MariaDB Server 10.5.0 change the default to 1 file, and later the parameter was deprecated and ignored.)
      • Log records are divided into tiny blocks of 512 bytes, with 12+4 bytes of header and footer (12+8 bytes with MDEV-12041 innodb_encrypt_log (10.4.0)).
      • We are holding a mutex while zero-filling unused parts of log blocks, encrypting log blocks, or computing checksums.
      • Mariabackup cannot copy the log without having access to the encryption keys. (It can copy data file pages without encrypting them.)

      We had some ideas to move to an append-only file and to partition the log into multiple files, but it turned out that a single fixed-size circular log file would perform best in typical scenarios.

      We should keep the first 512 bytes of the file ib_logfile0 for compatibility purposes, but everything else could be improved.

      • ib_logfile0 (after the 512-byte header) will be append-only, unencrypted, for records containing file names and checkpoint information. A checkpoint record will comprise an LSN and a byte offset in a separate, optionally encrypted, circular log file ib_logdata. The length of each record is explicitly tagged and the payload will be followed by CRC-32C.
      • The ib_logdata file can be append-only or circular. If it is circular, its fixed size must be an integer multiple of 512 bytes.
      • We remove log block headers and footers. We really only need is to detect the logical end of the circular log in ib_logdata. That can be achieved by making sure that mini-transactions are terminated by a sequence number (at least one bit) and a checksum. When the circular file wraps around, the sequence number will be incremented (or the sequence bit toggled).
      • For page-aligned I/O, allow dummy records to be written, to indicate that the next bytes (until the end of the physical page, no matter what the I/O block size is) must be ignored. Thus, there is no need to initialize any padding bytes or encrypt or compute checksums on them.
      • Encrypt and compute checksums on mtr_t::m_log before initiating a write to the circular log file. The log can be copied and checksums validated without access to encryption keys.
      • If ib_logdata is on a memory-mapped persistent memory device, then we could bypass log_sys.buf and the file system copy the data directly to the memory-mapped area and flush the CPU cache.

      Some old InnoDB redo log parameters will be deprecated and ignored:

      • innodb_log_files_in_group (deprecated & ignored in MariaDB 10.5.1)
      • innodb_log_checksums (already set by default, and forced when innodb_encrypt_log)

      The parameter innodb_log_write_ahead_size may be repurposed to indicate the desired physical log block size. For PMEM, it might be 64 bytes (the width of a CPU cache line).

      The checkpoint and file operations log file ib_logfile0

      The file name ib_logfile0 and the existing format of the first 512 bytes will be retained for the purpose of upgrading and preventing downgrading. In the first 512 bytes of the file, the following information will be present:

      • InnoDB redo log format version identifier (in the format introduced by MySQL 5.7.9/MariaDB 10.2.2)
      • The size of the ib_logdata file (the maximum possible size of the file if it is append-only).
      • All encryption parameters for ib_logdata
      • CRC-32C checksum
        This file will be append-only. After the first 512 bytes, the format will be:
      • Optional: FILE_ID records (renamed from FILE_MODIFY) of all .ibd files that existed when the redo log file was created. Each record terminated by CRC-32C.
      • FILE_CHECKPOINT record with LSN and byte offset in ib_logdata, and sequence bit (in circular ib_logdata file), and CRC-32C. The byte offset will be at least 47 bits (128 TiB). Total size of a checkpoint: at least 1+8+6+4=19 bytes.
      • Optional: FILE_CREATE, FILE_DELETE, FILE_RENAME records and further FILE_CHECKPOINT records for any file operations or subsequent checkpoints since the redo log was created. Each record will be terminated by CRC-32C.

      All writes to ib_logfile0 will be synchronous and durable (O_DSYNC, fdatasync() or O_SYNC, fsync()) .

      The circular or append-only page-level log file ib_logdata

      The ib_logdata file will contain MDEV-12353 records except FILE_ records that will be written into ib_logfile0. Each mini-transaction will be followed by a CRC-32C of all the bytes (using the value 0 for the sequence bit or number), so that Mariabackup can avoid recomputing the checksum while copying the log to a new file.

      If the ib_logdata is append-only, then we could enable log archiving by actually writing to files like ib_logdata.000000, ib_logdata.000001 and so on.

      Payload encoding

      The exact encoding of ib_logdata is not decided yet. We want to avoid overwriting the last log block, so we cannot have an explicit 'end of log' marker. We must associate each mini-transaction (atomic sequence of log records) with a sequence number (at the minimum, a sequence bit) and a checksum. The 4-byte CRC-32C is a good candidate, because it is already being used in data page checksums.

      We know that a CRC of nonzero bytes must be nonzero, and the mini-transaction payload cannot be zero, hence the CRC bytes can never be zero.
      Because of this, it would be beneficial to zero-initialize all the skipped bytes to increase the probability of quick detection of the end of the circular log. Compared to the existing format, we would avoid CRC computation or encryption of the skipped bytes, but the zero-filling would still have to be protected with a mutex.

      Alternative: Prepending a CRC to each MDEV-12353 mini-transaction

      In the MDEV-12353 encoding, a record cannot start with the bytes 0x00 or 0x01. Mini-transactions are currently being terminated by the byte 0x00. We could store the sequence bit in the terminating byte of the mini-transaction. The checksum would exclude the terminating byte.

      Only the payload bytes would be encrypted (not record types or lengths, and not page identifiers either). In that way, records can be parsed and validated efficiently. Decryption would only have to be invoked when the log really needs to be applied on the page. The initialization vector for encryption and decryption can include the unencrypted record header bytes.

      It is best to store the CRC before the mini-transaction payload, because the CRC cannot be 0. Hence, we can detect the end of the log without even parsing the mini-transaction bytes.

      For circular log files, we can introduce a special mini-transaction 'Skip the next N bytes', encoded in sizeof(CRC)+2+log(N) bytes: CRC, record type and length, subtype and the value of the sequence bit, and variable-length encoded N. If we need to pad a block with fewer bytes than the minimum size, we would write a record to skip the minimum size.

      Pros: Minimal overhead: sizeof(CRC) bytes per mini-transaction.
      Cons: Recovery may have to parse a lot of log before determining that the end of the log was reached.

      Alternative: Prepending a mini-transaction header with length and CRC

      We could encapsulate MDEV-12353 records (without the mini-transaction terminating NUL byte) in the following structure:

      • variable-length encoded integer of total_length << 2 | sequence_bit
      • CRC of the data payload and the variable-length encoded integer
      • the data payload (MDEV-12353 records); could be encrypted in their entirety

      Skipped bytes (at least 5) would be indicated by the following:

      • variable-length encoded integer of skipped_length << 2 | 1 << 1 | sequence_bit
      • CRC of the variable-length encoded integer (not including the skipped bytes)

      Pros: Recovery can determine more quickly that the end of the circular log was reached, thanks to the length, sequence bit and (nonzero) CRC being stored at the start.
      Pros: More of the log could be encrypted (at the cost of recovery and backup restoration speed)
      Cons: Slightly more overhead: sizeof(CRC)+log(length * 4) bytes. For length<32 bytes, no change of overhead.

      Log writing and synchronous flushing

      For the bulk of the changes done by mini-transactions, we do not care about flushing. The file system can write log file blocks as it pleases.

      Some state changes of the database must be made durable at a specific time. Examples include user transaction COMMIT, XA PREPARE, XA ROLLBACK, and (in case the binlog is not enabled) XA COMMIT.

      Whenever we want to make a certain change durable, we must flush all log files up to the LSN of the mini-transaction commit that made the change. While doing this, we can pad each log file to the file system block size (the 'skipped bytes'), so that the file system can always write full blocks. The padding could also be helpful when trying to resurrect a corrupted redo log file.

      If redo log is physically replicated to the buffer pools of slaves (like in PolarDB), then we should first write to the redo log files and only then to the slaves, and we should assume that the writes to the files will always eventually be durable. If that assumption is broken, then all servers would have to be restarted and perform crash recovery. Because the log sequence number will be counted in bytes, we will have to replicate the number of skipped bytes. (But we could omit the skipped bytes themselves when sending over the network.)

      Crash recovery

      1. Validate the ib_logfile0 header, including checking the size of the ib_logdata file.
      2. Read the last bytes of ib_logfile0 to see if it ends in a valid FILE_CHECKPOINT record. If not, read the entire ib_logfile0 to find the latest FILE_CHECKPOINT record. (The last write to ib_logfile0 could have been incomplete, and we may have to trim that file.)
      3. If ib_logdata ends at the identified byte position (we got sequence bit mismatch or checksum mismatch), then the database was clean.
      4. Else, recover the file name information based on the ib_logfile0 contents and start recovery.
      5. Parse the ib_logdata until the end. If any tablespace identifiers refer to unknown or inaccessible data files, abort startup unless innodb_force_recovery≥1. Ignore records for those tablespaces for which FILE_DELETE had been recorded.
      6. Now that the log has been validated, start the modifications in recovery. First, trim ib_logfile0 if needed, and replay FILE_DELETE and FILE_RENAME operations.
      7. Shrink any files for which we parsed a TRIM_PAGES record, and extend any files according to changes to FSP_SIZE.
      8. Apply the parsed log to all data files.

      Considerations on checkpoints and writing ib_logdata

      After completing a checkpoint (the write of the FILE_CHECKPOINT record to ib_logfile0), we could punch a hole in ib_logdata to discard no-longer-needed log records.

      As an option, the redo log could be rebuilt on a checkpoint, by creating a logically empty set of log files, at the minimum consisting of an empty ib_logdata file and a ib_logfile0 that contains FILE_ID records and a FILE_CHECKPOINT record.

      If redo log archiving is enabled, when the maximum configured size of a single redo log (which is stored in the ib_logfile0 header) is reached, a new ib_logdata.%06u file will be created. The old file length must be exactly at the maximum length.

      Writes to ib_logfile0 will not increment the LSN at all! This means that the redo log could be easily rebuilt at any LSN, and Mariabackup could write additional information to that file, removing the need for separate .delta files in incremental backups.


        1. append.c
          0.6 kB
        2. preallocate.c
          0.6 kB

          Issue Links



              marko Marko Mäkelä
              marko Marko Mäkelä
              7 Vote for this issue
              33 Start watching this issue



                  Git Integration