[MDEV-14425] Change the InnoDB redo log format to reduce write amplification - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Fix Version/s: 10.8.1
Component/s: Encryption, mariabackup, Storage Engine - InnoDB
Labels:
- Preview_10.8
- performance

Description

The InnoDB redo log format is not optimal in many respects:

At the start of ib_logfile0, there are two log checkpoint blocks, only 1024 bytes apart, while there exist devices with 4096-byte block size. The rest of the log file is written in a circular fashion.
On log checkpoint, some file name information needs to be appended to the log.
File names that were first changed since the latest checkpoint must be appended to the log. The bookkeeping causes some contention on log_sys.mutex and fil_system.mutex. Edit: The contention on fil_system.mutex was practically removed in ~~MDEV-23855~~, and the contention on log_sys.mutex due to this is minimal.
The log file was unnecessarily split into multiple files, logically treated as one big circular file. (~~MDEV-20907~~ in MariaDB Server 10.5.0 change the default to 1 file, and later the parameter was deprecated and ignored.)
Log records are divided into tiny blocks of 512 bytes, with 12+4 bytes of header and footer (12+8 bytes with ~~MDEV-12041~~ innodb_encrypt_log (10.4.0)).
We are holding a mutex while zero-filling unused parts of log blocks, encrypting log blocks, or computing checksums.
We were holding an exclusive latch while copying log blocks; this was fixed in ~~MDEV-27774~~.
Mariabackup cannot copy the log without having access to the encryption keys. (It can copy data file pages without encrypting them.)

We had some ideas to move to an append-only file and to partition the log into multiple files, but it turned out that a single fixed-size circular log file would perform best in typical scenarios.

To address the fil_system.mutex contention whose root cause was later fixed in ~~MDEV-23855~~, we were considering to split the log as follows:

ib_logfile0 (after the 512-byte header) will be append-only, unencrypted, for records containing file names and checkpoint information. A checkpoint record will comprise an LSN and a byte offset in a separate, optionally encrypted, circular log file ib_logdata. The length of each record is explicitly tagged and the payload will be followed by CRC-32C.
The ib_logdata file can be append-only or circular. If it is circular, its fixed size must be an integer multiple of 512 bytes.

One problem would have had to be solved: When would the ib_logfile0 be shrunk? No storage is unlimited.

We will retain the ib_logfile0 and the basic format of its first 512 bytes for compatibility purposes, but other features could be improved.

We remove log block headers and footers. We really only need is to detect the logical end of the circular log. That can be achieved by making sure that mini-transactions are terminated by a sequence number (at least one bit) and a checksum. When the circular file wraps around, the sequence number will be incremented (or the sequence bit toggled).
For page-aligned I/O, we allow dummy records to be written, to indicate that the next bytes (until the end of the physical block, no matter what the I/O block size is) must be ignored. (The log parser will ignore these padding records, but we do not currently write them; we will keep overwriting the last physical block until it has been completely filled like we used to do until now.)
Encrypt and compute checksum on mtr_t::m_log before initiating a write to the circular log file. The log can be copied and checksum validated without access to encryption keys.
If the log is on a memory-mapped persistent memory device, then we will make log_sys.buf point directly to the persistent memory.

Some old InnoDB redo log parameters were removed in ~~MDEV-23397~~ (MariaDB 10.6.0). Some more parameters will removed or changed here:

innodb_log_write_ahead_size: Removed. On Linux and Microsoft Windows, we will detect and use the physical block size of the underlying storage. We will also remove the log_padded counter from INFORMATION_SCHEMA.INNODB_METRICS.
innodb_log_file_buffering: Added (~~MDEV-28766~~). This controls the use of O_DIRECT on the ib_logfile0 when the physical block size can be determined
innodb_log_buffer_size: The minimum value is raised to 2MiB and the granularity increased from 1024 to 4096 bytes. This buffer will also be used during recovery. Ignored when the log is memory-mapped (on PMEM or /dev/shm).
innodb_log_file_size: The allocation granularity is reduced from 1MiB to 4KiB.

Some global variables will be adjusted as well:

Innodb_os_log_fsyncs: Removed. This will be included in Innodb_data_fsyncs.
Innodb_os_log_pending_fsyncs: Removed. This was limited to at most 1 by design.
Innodb_log_pending_writes: Removed. This was limited to at most 1 by design.

The circular log file `ib_logfile0`

The file name ib_logfile0 and the existing format of the first 512 bytes will be retained for the purpose of upgrading and preventing downgrading. In the first 512 bytes of the file, the following information will be present:

InnoDB redo log format version identifier (in the format introduced by MySQL 5.7.9/MariaDB 10.2.2)
CRC-32C checksum

After the first 512 bytes, there will be two 64-byte checkpoint blocks at the byte offsets 4096 and 8192, containing:

The checkpoint LSN
The LSN at the time the checkpoint was created, pointing to an optional sequence of FILE_MODIFY records and a FILE_CHECKPOINT record

The circular redo log record area starts at offset 12288 and extends to the end of the file. Unless the file was created by mariadb-backup, the file size will be a multiple of 4096 bytes.

All writes to ib_logfile0 will be synchronous and durable (O_DSYNC, fdatasync() or O_SYNC, fsync() or pmem_persist()).

Payload encoding

The payload area will contain records in the ~~MDEV-12353~~ format. Each mini-transaction will be followed by a sequence byte 0x00 or 0x01 (the value of the sequence bit), optionally (if the log is encrypted) a 8-byte nonce, and a CRC-32C of all the bytes (except the sequence byte), so that backup can avoid recomputing the checksum while copying the log to a new file.

We want to be able to avoid overwriting the last log block, so we cannot have an explicit 'end of log' marker. We must associate each mini-transaction (atomic sequence of log records) with a sequence number (at the minimum, a sequence bit) and a checksum. The 4-byte CRC-32C is a good candidate, because it is already being used in data page checksums.

Padding

We might want to introduce a special mini-transaction 'Skip the next N bytes', encoded in sizeof(CRC)+2+log(N) bytes: CRC, record type and length, subtype and the value of the sequence bit, and variable-length encoded N. However, for a compressed storage device, it would be helpful to not have any garbage bytes in the log file. It would be better to initialize all those N bytes.

If we need to pad a block with fewer bytes than the minimum size, we would write a record to skip the minimum size.

This has been implemented with arbitrary-length FILE_CHECKPOINT mini-transactions whose payload consists of NUL bytes. The parser will ignore such records. We are not currently writing such records, but instead overwriting the last incomplete log block when more log is being appended, just like InnoDB always did.

Mini-transaction encoding: Prepending or appending a CRC to each MDEV-12353 mini-transaction

In the ~~MDEV-12353~~ encoding, a record cannot start with the bytes 0x00 or 0x01. Mini-transactions are currently being terminated by the byte 0x00. We could store the sequence bit in the terminating byte of the mini-transaction. The checksum would exclude the terminating byte.

Only the payload bytes would be encrypted (not record types or lengths, and not page identifiers either). In that way, records can be parsed and validated efficiently. Decryption would only have to be invoked when the log really needs to be applied on the page. The initialization vector for encryption and decryption can include the unencrypted record header bytes.

It could be best to store the CRC before the mini-transaction payload, because the CRC of non-zero bytes cannot be 0. Hence, we can detect the end of the log without even parsing the mini-transaction bytes.

Pros: Minimal overhead: sizeof(CRC) bytes per mini-transaction.
Cons: Recovery may have to parse a lot of log before determining that the end of the log was reached.

In the end, the CRC was written after the mini-transaction. The log parser can flag an inconsistency if the maximum mini-transaction size would be exceeded.

Alternative encoding (scrapped idea): Prepending a mini-transaction header with length and CRC

We could encapsulate ~~MDEV-12353~~ records (without the mini-transaction terminating NUL byte) in the following structure:

variable-length encoded integer of total_length << 2 | sequence_bit
CRC of the data payload and the variable-length encoded integer
the data payload (~~MDEV-12353~~ records); could be encrypted in their entirety

Skipped bytes (at least 5) would be indicated by the following:

variable-length encoded integer of skipped_length << 2 | 1 << 1 | sequence_bit
CRC of the variable-length encoded integer (not including the skipped bytes)

Pros: Recovery can determine more quickly that the end of the circular log was reached, thanks to the length, sequence bit and (nonzero) CRC being stored at the start.
Pros: More of the log could be encrypted (at the cost of recovery and backup restoration speed)
Cons: Increased storage overhead: sizeof(CRC)+log(length * 4) bytes. For length<32 bytes, no change of overhead.
Cons: If the encryption is based on the current LSN, then both encryption and the checksum would have to be computed while holding log_sys.mutex.

Log writing and synchronous flushing

For the bulk of the changes done by mini-transactions, we do not care about flushing. The file system can write log file blocks as it pleases.

Some state changes of the database must be made durable at a specific time. Examples include user transaction COMMIT, XA PREPARE, XA ROLLBACK, and (in case the binlog is not enabled) XA COMMIT.

Whenever we want to make a certain change durable, we must flush all log files up to the LSN of the mini-transaction commit that made the change.

If redo log is physically replicated to the buffer pools of physical replicas (like in Amazon Aurora or Alibaba PolarDB), then we should first write to the local log and only then to the replicas, and we should assume that the writes to the files will always eventually be durable. If that assumption is broken, then all servers would have to be restarted and perform crash recovery.

Crash recovery and backup

The previous two-stage parsing (log block validation and log record parsing) was replaced with a single stage. The separate 2-megabyte buffer recv_sys.buf is no longer needed, because the bytes of the log records will be stored contiguously, except when the log file wraps around from its end to the offset 12,288.

When the log file is memory-mapped, we will parse records directly from log_sys.buf that contains a view of the entire log file. For parsing the mini-transaction that wraps from the end of the file to the start, the record parser will use a special pointer wrapper. When not using memory-mapping, we will read from the log file to log_sys.buf in such a way that the records of each mini-transaction will be contiguous.

Crash-upgrade from earlier versions will not be supported. Before upgrading, the old server must have been shut down, or mariadb-backup --prepare must have been executed using an appropriate older version of the backup tool.

Starting up without ib_logfile0 will no longer be supported; see also ~~MDEV-27199~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

81cf92e9471.pdf
29 kB
2022-01-12 08:37
append.c
0.6 kB
2018-02-02 09:41
MDEV-14425.pdf
29 kB
2021-12-16 12:14
NUMA_1.pdf
37 kB
2022-01-17 10:04
NUMA_1vs2.pdf
29 kB
2022-01-17 10:04
NUMA_2.pdf
38 kB
2022-01-17 10:04
preallocate.c
0.6 kB
2018-02-02 09:41

Issue Links

blocks

MDEV-14462 Confusing error message: ib_logfiles are too small for innodb_thread_concurrency=0

Closed

causes

MDEV-27621 Backup fails with FATAL ERROR: Was only able to copy log from .. to .., not ..

Closed

MDEV-27787 mariadb-backup --backup is allocating extra memory for log records

Closed

MDEV-27790 Fix mis-matched braces for non-Linux targets (fails to build)

Closed

MDEV-27916 InnoDB ignores log write errors

Closed

MDEV-27939 Log buffer wrap-around errors on PMEM

Closed

MDEV-28879 Assertion `l->lsn <= log_sys.get_lsn()' failed around recv_recover_page

Closed

MDEV-28994 Backup produces garbage when using memory-mapped log (PMEM)

Closed

MDEV-29555 ASAN heap-buffer-overflow in mariabackup.huge_lsn,strict_full_crc32

Closed

MDEV-31791 Crash recovery in the test innodb.recovery_memory occasionally fails

Closed

MDEV-32746 SIGSEGV on recovery when using innodb_encrypt_log and PMEM

Closed

MDEV-36024 performance regression with encrypted InnoDB log

In Progress

includes

MDEV-16045 Allocate log_sys statically

Closed

is blocked by

MDEV-14545 Backup fails due to MLOG_INDEX_LOAD record

Closed

MDEV-18115 Remove dummy tablespace for the redo log

Closed

MDEV-20907 Set innodb_log_files_in_group=1 by default

Closed

MDEV-21870 Deprecate and ignore innodb_scrub_log and innodb_scrub_log_speed

Closed

is part of

MDEV-27373 Q1 2022 release merge

Closed

relates to

MDEV-12699 Improve crash recovery of corrupted data pages

Closed

MDEV-13830 Assertion failed: recv_sys->mlog_checkpoint_lsn <= recv_sys->recovered_lsn

Closed

MDEV-14481 Execute InnoDB crash recovery in the background

Closed

MDEV-14992 BACKUP: in-server backup

Open

MDEV-16232 Use fewer mini-transactions

Stalled

MDEV-16526 Overhaul the InnoDB page flushing

Closed

MDEV-17138 Reduce redo log volume for undo tablespace initialization

Closed

MDEV-18370 InnoDB: Failing assertion: lsn % OS_FILE_LOG_BLOCK_SIZE == LOG_BLOCK_HDR_SIZE in log0log.cc with innodb_scrub_log=ON and high values of innodb_scrub_log_speed

Closed

MDEV-19176 Do not run out of InnoDB buffer pool during recovery

Closed

MDEV-20474 Assertion `!recv_no_log_write' failed in log_pad_current_log_block upon server startup on a clean datadir

Closed

MDEV-20475 Assertion `flushed_lsn == log_get_lsn()' failed in srv_prepare_to_delete_redo_log_files upon server startup

Closed

MDEV-21990 Issue a message on changing deprecated innodb_log_files_in_group

Closed

MDEV-23382 Change DB_ROLL_PTR format to allow more than 128 concurrent START TRANSACTION

Open

MDEV-24023 mariabackup.innodb_redo_overwrite failed in buidbot with result length mismatch

Open

MDEV-27199 Require ib_logfile0 to exist unless innodb_force_recovery=6

Closed

MDEV-27268 Failed InnoDB initialization leaves garbage files behind

Closed

MDEV-27437 Galera snapshot transfer fails to upgrade between some major versions

Closed

MDEV-27486 Refuse Galera SST if major version of donor and joiner are different

Stalled

MDEV-27716 mtr_t::commit() unnecessarily acquires log_sys.mutex when writing no log

Closed

MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()

Closed

MDEV-27812 Allow innodb_log_file_size to change without server restart

Closed

MDEV-27848 Remove unused wait/io/file/innodb/innodb_log_file

Closed

MDEV-27917 Some redo log diagnostics is always reported as 0

Closed

MDEV-28111 Redo log writes are being buffered on Linux for no good reason

Closed

MDEV-28977 Race condition in the recovery of CREATE TABLE or table-rebuilding DDL

Closed

MDEV-31642 Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF

Closed

MDEV-32445 InnoDB may corrupt its log before upgrading it on startup

Closed

MDEV-32971 Assertion !recv_sys.is_corrupt_fs() failed on recovery

Closed

MDEV-33363 CI failure: innodb.import_corrupted: Assertion failed: oldest_lsn > log_sys.last_checkpoint_lsn

Closed

MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0

Closed

MDEV-34422 InnoDB writes corrupted log on macOS and AIX due to uninitialized log_sys.lsn_lock

Closed

MDEV-34483 Backup may copy unnecessarily much log

Closed

MDEV-35796 OPT_PAGE_CHECKSUM is ignored if innodb_encrypt_log=ON

Stalled

MDEV-8139 Fix scrubbing

Closed

MDEV-9905 Options for NVDIMM usage in MariaDB

Open

MDEV-11380 AliSQL: [Perf] Issue#24 SPLIT LOG BUFFER TO ROTATE LOG WRITE

Closed

MDEV-12041 Implement key rotation for innodb_encrypt_log

Closed

MDEV-12353 Efficient InnoDB redo log record format

Closed

MDEV-15914 performance regression for mass update

Closed

MDEV-16168 Performance regression on sysbench write benchmarks from 10.2 to 10.3

Closed

MDEV-18370 InnoDB: Failing assertion: lsn % OS_FILE_LOG_BLOCK_SIZE == LOG_BLOCK_HDR_SIZE in log0log.cc with innodb_scrub_log=ON and high values of innodb_scrub_log_speed

Closed

MDEV-18606 innodb crashes on large update and it gets corrupted

Closed

MDEV-21382 use fdatasync() for redo log where appropriate

Closed

MDEV-21923 LSN allocation is a bottleneck

Closed

MDEV-25124 benchmark 10.6 performance for PMEM enabled builds

Closed

MDEV-33894 MariaDB does unexpected storage read IO for the redo log

Closed

PERF-117 Loading...

PERF-118 Loading...

links to

MySQL Bug #94448 Rewrite LOG_BLOCK_FIRST_REC_GROUP during recovery may be dangerous.

podman feature request to make /sys/dev/block available for O_DIRECT size determination

mentioned in: Page Loading...

(7 causes, 1 includes, 4 is blocked by, 1 is part of, 48 relates to, 2 links to, 1 mentioned in)

Activity

Ascending order - Click to sort in descending order

View 77 older comments

Matthias Leich added a comment - 2022-01-17 19:13

The tree

origin/bb-10.8-MDEV-14425 7f75466f539b61d3dc8696e72a2d715c59aa04d6 2022-01-14T19:52:40+02:00

behaved well in RQG testing. Bad effects observed happen on other trees too.

Matthias Leich added a comment - 2022-01-17 19:13 The tree origin/bb-10.8-MDEV-14425 7f75466f539b61d3dc8696e72a2d715c59aa04d6 2022-01-14T19:52:40+02:00 behaved well in RQG testing. Bad effects observed happen on other trees too.

Marko Mäkelä added a comment - 2022-01-18 13:35

I rebased the bb-10.8-~~MDEV-14425~~ branch once more, so that krunalbauskar can test it. Previously, his tests were contaminated by ~~MDEV-27499~~.

Marko Mäkelä added a comment - 2022-01-18 13:35 I rebased the bb-10.8- MDEV-14425 branch once more, so that krunalbauskar can test it. Previously, his tests were contaminated by MDEV-27499 .

Marko Mäkelä added a comment - 2022-01-21 07:01

An observation was made during testing: rr record mariadb-backup --backup cannot work reliably if the redo log file was opened via mmap(). The reason is that rr assumes that the mmap file contents may only be changed by the traced process(es). To avoid bogus failures while running backup under rr, there are a few solutions:

Build without libpmem: rm CMakeCache.txt; cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1 /path/to/source
Place the server’s redo log somewhere else than /dev/shm or a PMEM device mounted with -o dax.
Patch log_t::attach() so that mmap() will not be attempted if srv_operation == SRV_OPERATION_BACKUP.
Implement server-side backup (MDEV-14992).

Marko Mäkelä added a comment - 2022-01-21 07:01 An observation was made during testing: rr record mariadb-backup --backup cannot work reliably if the redo log file was opened via mmap() . The reason is that rr assumes that the mmap file contents may only be changed by the traced process(es) . To avoid bogus failures while running backup under rr , there are a few solutions: Build without libpmem : rm CMakeCache.txt; cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1 /path/to/source Place the server’s redo log somewhere else than /dev/shm or a PMEM device mounted with -o dax . Patch log_t::attach() so that mmap() will not be attempted if srv_operation == SRV_OPERATION_BACKUP . Implement server-side backup ( MDEV-14992 ).

Matthias Leich added a comment - 2022-01-21 11:41

The tree

    origin/bb-10.8-MDEV-14425 14eaeb68e60626a9b1e4f4b611f0bc23a79f7202 2022-01-20T13:35:18+02:00

performed sufficient well in RQG testing focused on Mariabackup.

There was some surprising amount of  bad effects (unknown to me but maybe already in JIRA).

But the same test battery applied to

    origin/10.8 baef53a70c675da6d19ac3c7f23c7b8b4ed8458c 2022-01-20T16:01:10+01:00

showed nearly the same and in sum not less bad effects.

Hence I stop testing now and vote for integrating MDEV-14425 into 10.8 if

the corresponding MTR test pass.

Matthias Leich added a comment - 2022-01-21 11:41 The tree origin/bb-10.8-MDEV-14425 14eaeb68e60626a9b1e4f4b611f0bc23a79f7202 2022-01-20T13:35:18+02:00 performed sufficient well in RQG testing focused on Mariabackup. There was some surprising amount of bad effects (unknown to me but maybe already in JIRA). But the same test battery applied to origin/10.8 baef53a70c675da6d19ac3c7f23c7b8b4ed8458c 2022-01-20T16:01:10+01:00 showed nearly the same and in sum not less bad effects. Hence I stop testing now and vote for integrating MDEV-14425 into 10.8 if the corresponding MTR test pass.

Marko Mäkelä added a comment - 2022-01-21 15:06

Thank you to everyone who tested this and provided feedback.

As recommended by axel, spinning on buf_pool.mutex was disabled, except on ARMv8.

Based on performance tests on 512-byte block devices by myself and krunalbauskar, we will not enable O_DIRECT on the redo log on Linux by default. With the setting innodb_flush_method=O_SYNC we will enable O_DIRECT on the log as well as data files. On Microsoft Windows, buffering had already been disabled for the redo log.

On other operating systems than Windows or Linux, writes to the log will keep using a block size of 512 bytes and not bypass any file system cache. Changing that would require implementing a way to detect the physical block size.

Marko Mäkelä added a comment - 2022-01-21 15:06 Thank you to everyone who tested this and provided feedback. As recommended by axel , spinning on buf_pool.mutex was disabled , except on ARMv8. Based on performance tests on 512-byte block devices by myself and krunalbauskar , we will not enable O_DIRECT on the redo log on Linux by default. With the setting innodb_flush_method=O_SYNC we will enable O_DIRECT on the log as well as data files. On Microsoft Windows, buffering had already been disabled for the redo log. On other operating systems than Windows or Linux, writes to the log will keep using a block size of 512 bytes and not bypass any file system cache. Changing that would require implementing a way to detect the physical block size.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 8 Vote for this issue

Watchers:: 33 Start watching this issue

Dates

Created:: 2017-11-17 05:54

Updated:: 2025-02-04 12:16

Resolved:: 2022-01-21 15:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

The circular log file ib_logfile0

Payload encoding

Padding

Mini-transaction encoding: Prepending or appending a CRC to each MDEV-12353 mini-transaction

Alternative encoding (scrapped idea): Prepending a mini-transaction header with length and CRC

Log writing and synchronous flushing

Crash recovery and backup

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration

The circular log file `ib_logfile0`