Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14425

InnoDB redo log format for better performance




      The InnoDB redo log format is not optimal in many respects:

      • The log files are not append-only. Modern journaled file systems handle append-only files more efficiently.
      • There is no possibility to archive the redo log.
      • The LSN is based on bytes, and the log is not partitioned, making it hard to parallelize writes of concurrent mini-transaction commits.

      Let us introduce the following parameters:

      • innodb_log_file_size (in multiples of 4096 bytes; 0 (default) disables log rotation/archiving)
      • innodb_log_partitions (default=1) for allowing concurrent writes to different partitions

      Some old InnoDB redo log parameters will be removed (or deprecated and ignored):

      • innodb_log_files_in_group
      • innodb_log_write_ahead_size
      • innodb_log_checksums (if it performs well enough with hardware-assisted CRC-32C)

      The idea: Partition the log into append-only, truncate-the-start files

      Append-only files are more efficient in modern crash-safe file systems. For example, file systems can avoid writing the data twice (first to the journal, then to the data blocks).

      The ib_logfile0 will be repurposed for the purpose of upgrading and preventing downgrading. The size of this file will be one file system block (padded with NUL bytes). In the first 512 bytes of the file, the following information will be present:

      • InnoDB redo log format version identifier (in the format introduced by MySQL 5.7.9/MariaDB 10.2.2)
      • The value of innodb_log_partitions at the latest checkpoint
      • The LSN of the latest checkpoint
      • CRC-32C checksum
      • Optional: If innodb_encrypt_log, write the encryption key ID and encrypt the checkpoint LSN with that key.

      This file will be overwritten (O_DIRECT) at every redo log checkpoint.

      The redo log will be partitioned into files like innodb_%u.log with 0..innodb_log_partitions-1. The log file will be chosen by pthread_self() % innodb_log_partitions.

      If innodb_log_max_size is not at the default (0), then as soon as one log file would exceed the maximum size, all log files will be rotated by renaming to innodb_%u.%06u and by creating empty innodb_%u.log files.

      If innodb_log_max_size=0 (the default), then at checkpoint, the start of each innodb_%u.log file will be discarded by punching a hole from 0 to the block that contains the first record at or after the checkpoint LSN. If the file system does not support hole-punching, then at the start of the file a header will be written that points to the first block.

      The log block format will be redesigned. The log block header may contain the following:

      • Log block size (the physical block size of the system that wrote the block)
      • Checksum
      • Pointer to the first MLOG_START record (or 0 if there is no such record in the page)
      • The smallest log block size is 512 bytes.
      • All-zero log blocks are silently ignored (treated as 512 bytes)

      Mini-transactions will always start with an MLOG_START(lsn) entry. The lsn is a global sequence number that is atomically incremented whenever mtr_t::commit() is about to write redo log.

      For operations on clustered indexes, the MLOG_START entry could be followed by the user transaction start ID (DB_TRX_ID) so that the changes could be filtered by transaction.

      Log writing and synchronous flushing

      For the bulk of the changes done by mini-transactions, we do not care about flushing. The file system can write log file blocks as it pleases.

      Some state changes of the database must be made durable at a specific time. Examples include user transaction COMMIT, XA PREPARE, XA ROLLBACK, and (in case the binlog is not enabled) XA COMMIT.

      Whenever we want to make a certain change durable, we must flush all log files up to the LSN of the mini-transaction commit that made the change. While doing this, we can pad each log file to the file system block size, so that the file system can always write full blocks. The padding could also be helpful when trying to resurrect a corrupted redo log file.

      If redo log is additionally replicated to the buffer pools of slaves (like in PolarDB), then we should first write to the redo log files and only then to the slaves, and we should assume that the writes to the files will always eventually be durable. If that assumption is broken, then all servers would have to be restarted and perform crash recovery.

      Crash recovery

      1. Read start_checkpoint_lsn from ib_logfile0.
      2. In each redo log file: Find the first MLOG_START record with lsn>= checkpoint_lsn
      3. Process each redo log up to the end

      The log_sys->lsn will be initialized to the maximum LSN that was found.

      The MLOG_START records will be found by scanning the redo log blocks from a start offset onwards until a qualifying record is found.
      On checkpoint, the start of each log file may be truncated until a block that is near the MLOG_START(checkpoint_lsn), so that we will not have to scan from the start of each file.

      Ordering of mini-transaction commits

      Should we require the LSN to be contiguous from checkpoint_lsn to the end? This matters in a scenario where the server was killed while multiple log files were being written concurrently, or multiple mtr_t::commit() were executing at the same time.

      If we allowed gaps in the LSNs, we are essentially implying a partial ordering of mini-transaction commits. This would simplify the recovery algorithm and allow us to recover more mini-transactions from the redo log. But it would not allow to recover any more user transaction commits, because synchronous log flushing would guarantee the files to be in sync with each other. It would also break correctness issue in the following scenario:

      1. Mini-transaction m1 updates page A via log file 1.
      2. Mini-transaction m2 updates page A via log file 2.
      3. The write to log file 2 is completed, but log file 1 was not written yet.
      4. The system is killed.
      5. Recovery sees the commit LSN of m2 but does not see m1, whose LSN would have been smaller.
      6. If recovery ignores the gap of m1.LSN and applies the change of m2, page A may be inconsistent, because it missed the earlier change by m1.

      This scenario could be prevented with additional logic in the mini-transactions, adding fsync() of the ‘related log files’. But it would necessarily slow down the log writing, defeating the purpose of partitioning the log.

      It could make sense to introduce a disaster recovery option for ignoring LSN gaps between files. A possibility (after fixing MDEV-12700) would be a special value innodb_read_only=2 which would recover the database while ignoring the LSN gaps.

      Page flushing

      Before writing any dirty page to data file, ensure that all log files have been flushed up to the page LSN.
      Thanks to this, recovery will not require contiguous LSN in all log files.
      Thanks to this, a persistent mtr_t::commit() (which changes user transaction state) will only have to flush the current redo log file, not all log files.

      Log checkpoint

        pwrite(ib_logfile0, page_with_checkpoint_lsn, 4096, 0);
        if (innodb_log_max_size) {
          for (auto log : logs) {
            mach_write_to_8(log_header, log->start_block(checkpoint_lsn));
            pwrite(log->fd, log_header, 4096, 0);
        } else {
          for (auto log : logs) {
            uint64_t offs = log->start_block(checkpoint_lsn);
            fallocate(log->fd, FALLOC_FL_PUNCH_HOLE, 0, offs);

      The new function

      uint64_t log_file_t::start_block(lsn_t) const

      returns the file offset of the block that contains the (introduced in this work) MLOG_START(lsn) record, or an offset that is not much smaller than that. It likely will require some additional main-memory data structure, because we might reuse log_file_t::buffer entries after they have been written to the redo log, but before the log checkpoint is initiated.

      Log buffering

      For multi-page mini-transactions (such as B-tree page split or merge), we can have a local redo log record buffer in mtr_t, similar to the current mtr_t::log. This buffer would be copied to the log file buffer at mtr_t::commit().

      Short or single-page mini-transactions can directly write the log records to the log file buffer (avoid copying and heap memory allocation).


      void mtr_t::commit(bool sync = false)
        {log_file,lsn} = log_sys->acquire_file(); // Atomic or using log_sys->mutex
        write_mlog_start(mtr_log, lsn); // First record of each mini-transaction
        pos = log_file->append_and_release(mtr_log, len);
        if (sync) log_file->flush(pos);

      The function log_file_t::flush(uint64_t file_pos) will (encrypt and) write buf to the file.

      To be defined

      Which data structure to use for

      uint64_t log_file_t::start_block(lsn_t) const

      and which accuracy to aim for? Maybe 1MiB granularity would be enough? Then, just remember the start LSN of each 1MiB segment since the latest checkpoint was written?


        1. append.c
          0.6 kB
        2. preallocate.c
          0.6 kB

          Issue Links



              kevg Eugene Kosov
              marko Marko Mäkelä
              6 Vote for this issue
              24 Start watching this issue