[MDEV-14425] Change the InnoDB redo log format to reduce write amplification - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Fix Version/s: 10.8.1
Component/s: Encryption, mariabackup, Storage Engine - InnoDB
Labels:
- Preview_10.8
- performance

Description

The InnoDB redo log format is not optimal in many respects:

At the start of ib_logfile0, there are two log checkpoint blocks, only 1024 bytes apart, while there exist devices with 4096-byte block size. The rest of the log file is written in a circular fashion.
On log checkpoint, some file name information needs to be appended to the log.
File names that were first changed since the latest checkpoint must be appended to the log. The bookkeeping causes some contention on log_sys.mutex and fil_system.mutex. Edit: The contention on fil_system.mutex was practically removed in ~~MDEV-23855~~, and the contention on log_sys.mutex due to this is minimal.
The log file was unnecessarily split into multiple files, logically treated as one big circular file. (~~MDEV-20907~~ in MariaDB Server 10.5.0 change the default to 1 file, and later the parameter was deprecated and ignored.)
Log records are divided into tiny blocks of 512 bytes, with 12+4 bytes of header and footer (12+8 bytes with ~~MDEV-12041~~ innodb_encrypt_log (10.4.0)).
We are holding a mutex while zero-filling unused parts of log blocks, encrypting log blocks, or computing checksums.
We were holding an exclusive latch while copying log blocks; this was fixed in ~~MDEV-27774~~.
Mariabackup cannot copy the log without having access to the encryption keys. (It can copy data file pages without encrypting them.)

We had some ideas to move to an append-only file and to partition the log into multiple files, but it turned out that a single fixed-size circular log file would perform best in typical scenarios.

To address the fil_system.mutex contention whose root cause was later fixed in ~~MDEV-23855~~, we were considering to split the log as follows:

ib_logfile0 (after the 512-byte header) will be append-only, unencrypted, for records containing file names and checkpoint information. A checkpoint record will comprise an LSN and a byte offset in a separate, optionally encrypted, circular log file ib_logdata. The length of each record is explicitly tagged and the payload will be followed by CRC-32C.
The ib_logdata file can be append-only or circular. If it is circular, its fixed size must be an integer multiple of 512 bytes.

One problem would have had to be solved: When would the ib_logfile0 be shrunk? No storage is unlimited.

We will retain the ib_logfile0 and the basic format of its first 512 bytes for compatibility purposes, but other features could be improved.

We remove log block headers and footers. We really only need is to detect the logical end of the circular log. That can be achieved by making sure that mini-transactions are terminated by a sequence number (at least one bit) and a checksum. When the circular file wraps around, the sequence number will be incremented (or the sequence bit toggled).
For page-aligned I/O, we allow dummy records to be written, to indicate that the next bytes (until the end of the physical block, no matter what the I/O block size is) must be ignored. (The log parser will ignore these padding records, but we do not currently write them; we will keep overwriting the last physical block until it has been completely filled like we used to do until now.)
Encrypt and compute checksum on mtr_t::m_log before initiating a write to the circular log file. The log can be copied and checksum validated without access to encryption keys.
If the log is on a memory-mapped persistent memory device, then we will make log_sys.buf point directly to the persistent memory.

Some old InnoDB redo log parameters were removed in ~~MDEV-23397~~ (MariaDB 10.6.0). Some more parameters will removed or changed here:

innodb_log_write_ahead_size: Removed. On Linux and Microsoft Windows, we will detect and use the physical block size of the underlying storage. We will also remove the log_padded counter from INFORMATION_SCHEMA.INNODB_METRICS.
innodb_log_file_buffering: Added (~~MDEV-28766~~). This controls the use of O_DIRECT on the ib_logfile0 when the physical block size can be determined
innodb_log_buffer_size: The minimum value is raised to 2MiB and the granularity increased from 1024 to 4096 bytes. This buffer will also be used during recovery. Ignored when the log is memory-mapped (on PMEM or /dev/shm).
innodb_log_file_size: The allocation granularity is reduced from 1MiB to 4KiB.

Some global variables will be adjusted as well:

Innodb_os_log_fsyncs: Removed. This will be included in Innodb_data_fsyncs.
Innodb_os_log_pending_fsyncs: Removed. This was limited to at most 1 by design.
Innodb_log_pending_writes: Removed. This was limited to at most 1 by design.

The circular log file `ib_logfile0`

The file name ib_logfile0 and the existing format of the first 512 bytes will be retained for the purpose of upgrading and preventing downgrading. In the first 512 bytes of the file, the following information will be present:

InnoDB redo log format version identifier (in the format introduced by MySQL 5.7.9/MariaDB 10.2.2)
CRC-32C checksum

After the first 512 bytes, there will be two 64-byte checkpoint blocks at the byte offsets 4096 and 8192, containing:

The checkpoint LSN
The LSN at the time the checkpoint was created, pointing to an optional sequence of FILE_MODIFY records and a FILE_CHECKPOINT record

The circular redo log record area starts at offset 12288 and extends to the end of the file. Unless the file was created by mariadb-backup, the file size will be a multiple of 4096 bytes.

All writes to ib_logfile0 will be synchronous and durable (O_DSYNC, fdatasync() or O_SYNC, fsync() or pmem_persist()).

Payload encoding

The payload area will contain records in the ~~MDEV-12353~~ format. Each mini-transaction will be followed by a sequence byte 0x00 or 0x01 (the value of the sequence bit), optionally (if the log is encrypted) a 8-byte nonce, and a CRC-32C of all the bytes (except the sequence byte), so that backup can avoid recomputing the checksum while copying the log to a new file.

We want to be able to avoid overwriting the last log block, so we cannot have an explicit 'end of log' marker. We must associate each mini-transaction (atomic sequence of log records) with a sequence number (at the minimum, a sequence bit) and a checksum. The 4-byte CRC-32C is a good candidate, because it is already being used in data page checksums.

Padding

We might want to introduce a special mini-transaction 'Skip the next N bytes', encoded in sizeof(CRC)+2+log(N) bytes: CRC, record type and length, subtype and the value of the sequence bit, and variable-length encoded N. However, for a compressed storage device, it would be helpful to not have any garbage bytes in the log file. It would be better to initialize all those N bytes.

If we need to pad a block with fewer bytes than the minimum size, we would write a record to skip the minimum size.

This has been implemented with arbitrary-length FILE_CHECKPOINT mini-transactions whose payload consists of NUL bytes. The parser will ignore such records. We are not currently writing such records, but instead overwriting the last incomplete log block when more log is being appended, just like InnoDB always did.

Mini-transaction encoding: Prepending or appending a CRC to each MDEV-12353 mini-transaction

In the ~~MDEV-12353~~ encoding, a record cannot start with the bytes 0x00 or 0x01. Mini-transactions are currently being terminated by the byte 0x00. We could store the sequence bit in the terminating byte of the mini-transaction. The checksum would exclude the terminating byte.

Only the payload bytes would be encrypted (not record types or lengths, and not page identifiers either). In that way, records can be parsed and validated efficiently. Decryption would only have to be invoked when the log really needs to be applied on the page. The initialization vector for encryption and decryption can include the unencrypted record header bytes.

It could be best to store the CRC before the mini-transaction payload, because the CRC of non-zero bytes cannot be 0. Hence, we can detect the end of the log without even parsing the mini-transaction bytes.

Pros: Minimal overhead: sizeof(CRC) bytes per mini-transaction.
Cons: Recovery may have to parse a lot of log before determining that the end of the log was reached.

In the end, the CRC was written after the mini-transaction. The log parser can flag an inconsistency if the maximum mini-transaction size would be exceeded.

Alternative encoding (scrapped idea): Prepending a mini-transaction header with length and CRC

We could encapsulate ~~MDEV-12353~~ records (without the mini-transaction terminating NUL byte) in the following structure:

variable-length encoded integer of total_length << 2 | sequence_bit
CRC of the data payload and the variable-length encoded integer
the data payload (~~MDEV-12353~~ records); could be encrypted in their entirety

Skipped bytes (at least 5) would be indicated by the following:

variable-length encoded integer of skipped_length << 2 | 1 << 1 | sequence_bit
CRC of the variable-length encoded integer (not including the skipped bytes)

Pros: Recovery can determine more quickly that the end of the circular log was reached, thanks to the length, sequence bit and (nonzero) CRC being stored at the start.
Pros: More of the log could be encrypted (at the cost of recovery and backup restoration speed)
Cons: Increased storage overhead: sizeof(CRC)+log(length * 4) bytes. For length<32 bytes, no change of overhead.
Cons: If the encryption is based on the current LSN, then both encryption and the checksum would have to be computed while holding log_sys.mutex.

Log writing and synchronous flushing

For the bulk of the changes done by mini-transactions, we do not care about flushing. The file system can write log file blocks as it pleases.

Some state changes of the database must be made durable at a specific time. Examples include user transaction COMMIT, XA PREPARE, XA ROLLBACK, and (in case the binlog is not enabled) XA COMMIT.

Whenever we want to make a certain change durable, we must flush all log files up to the LSN of the mini-transaction commit that made the change.

If redo log is physically replicated to the buffer pools of physical replicas (like in Amazon Aurora or Alibaba PolarDB), then we should first write to the local log and only then to the replicas, and we should assume that the writes to the files will always eventually be durable. If that assumption is broken, then all servers would have to be restarted and perform crash recovery.

Crash recovery and backup

The previous two-stage parsing (log block validation and log record parsing) was replaced with a single stage. The separate 2-megabyte buffer recv_sys.buf is no longer needed, because the bytes of the log records will be stored contiguously, except when the log file wraps around from its end to the offset 12,288.

When the log file is memory-mapped, we will parse records directly from log_sys.buf that contains a view of the entire log file. For parsing the mini-transaction that wraps from the end of the file to the start, the record parser will use a special pointer wrapper. When not using memory-mapping, we will read from the log file to log_sys.buf in such a way that the records of each mini-transaction will be contiguous.

Crash-upgrade from earlier versions will not be supported. Before upgrading, the old server must have been shut down, or mariadb-backup --prepare must have been executed using an appropriate older version of the backup tool.

Starting up without ib_logfile0 will no longer be supported; see also ~~MDEV-27199~~.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

81cf92e9471.pdf
29 kB
2022-01-12 08:37
append.c
0.6 kB
2018-02-02 09:41
MDEV-14425.pdf
29 kB
2021-12-16 12:14
NUMA_1.pdf
37 kB
2022-01-17 10:04
NUMA_1vs2.pdf
29 kB
2022-01-17 10:04
NUMA_2.pdf
38 kB
2022-01-17 10:04
preallocate.c
0.6 kB
2018-02-02 09:41

Issue Links

blocks

MDEV-14462 Confusing error message: ib_logfiles are too small for innodb_thread_concurrency=0

Closed

causes

MDEV-27621 Backup fails with FATAL ERROR: Was only able to copy log from .. to .., not ..

Closed

MDEV-27787 mariadb-backup --backup is allocating extra memory for log records

Closed

MDEV-27790 Fix mis-matched braces for non-Linux targets (fails to build)

Closed

MDEV-27916 InnoDB ignores log write errors

Closed

MDEV-27939 Log buffer wrap-around errors on PMEM

Closed

MDEV-28879 Assertion `l->lsn <= log_sys.get_lsn()' failed around recv_recover_page

Closed

MDEV-28994 Backup produces garbage when using memory-mapped log (PMEM)

Closed

MDEV-29555 ASAN heap-buffer-overflow in mariabackup.huge_lsn,strict_full_crc32

Closed

MDEV-31791 Crash recovery in the test innodb.recovery_memory occasionally fails

Closed

MDEV-32746 SIGSEGV on recovery when using innodb_encrypt_log and PMEM

Closed

MDEV-36024 performance regression with encrypted InnoDB log

In Progress

includes

MDEV-16045 Allocate log_sys statically

Closed

is blocked by

MDEV-14545 Backup fails due to MLOG_INDEX_LOAD record

Closed

MDEV-18115 Remove dummy tablespace for the redo log

Closed

MDEV-20907 Set innodb_log_files_in_group=1 by default

Closed

MDEV-21870 Deprecate and ignore innodb_scrub_log and innodb_scrub_log_speed

Closed

is part of

MDEV-27373 Q1 2022 release merge

Closed

relates to

MDEV-12699 Improve crash recovery of corrupted data pages

Closed

MDEV-13830 Assertion failed: recv_sys->mlog_checkpoint_lsn <= recv_sys->recovered_lsn

Closed

MDEV-14481 Execute InnoDB crash recovery in the background

Closed

MDEV-14992 BACKUP: in-server backup

Open

MDEV-16232 Use fewer mini-transactions

Stalled

MDEV-16526 Overhaul the InnoDB page flushing

Closed

MDEV-17138 Reduce redo log volume for undo tablespace initialization

Closed

MDEV-18370 InnoDB: Failing assertion: lsn % OS_FILE_LOG_BLOCK_SIZE == LOG_BLOCK_HDR_SIZE in log0log.cc with innodb_scrub_log=ON and high values of innodb_scrub_log_speed

Closed

MDEV-19176 Do not run out of InnoDB buffer pool during recovery

Closed

MDEV-20474 Assertion `!recv_no_log_write' failed in log_pad_current_log_block upon server startup on a clean datadir

Closed

MDEV-20475 Assertion `flushed_lsn == log_get_lsn()' failed in srv_prepare_to_delete_redo_log_files upon server startup

Closed

MDEV-21990 Issue a message on changing deprecated innodb_log_files_in_group

Closed

MDEV-23382 Change DB_ROLL_PTR format to allow more than 128 concurrent START TRANSACTION

Open

MDEV-24023 mariabackup.innodb_redo_overwrite failed in buidbot with result length mismatch

Open

MDEV-27199 Require ib_logfile0 to exist unless innodb_force_recovery=6

Closed

MDEV-27268 Failed InnoDB initialization leaves garbage files behind

Closed

MDEV-27437 Galera snapshot transfer fails to upgrade between some major versions

Closed

MDEV-27486 Refuse Galera SST if major version of donor and joiner are different

Stalled

MDEV-27716 mtr_t::commit() unnecessarily acquires log_sys.mutex when writing no log

Closed

MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()

Closed

MDEV-27812 Allow innodb_log_file_size to change without server restart

Closed

MDEV-27848 Remove unused wait/io/file/innodb/innodb_log_file

Closed

MDEV-27917 Some redo log diagnostics is always reported as 0

Closed

MDEV-28111 Redo log writes are being buffered on Linux for no good reason

Closed

MDEV-28977 Race condition in the recovery of CREATE TABLE or table-rebuilding DDL

Closed

MDEV-31642 Upgrade from 10.7 or earlier may crash if innodb_log_file_buffering=OFF

Closed

MDEV-32445 InnoDB may corrupt its log before upgrading it on startup

Closed

MDEV-32971 Assertion !recv_sys.is_corrupt_fs() failed on recovery

Closed

MDEV-33363 CI failure: innodb.import_corrupted: Assertion failed: oldest_lsn > log_sys.last_checkpoint_lsn

Closed

MDEV-34062 mariadb-backup --backup is extremely slow at copying ib_logfile0

Closed

MDEV-34422 InnoDB writes corrupted log on macOS and AIX due to uninitialized log_sys.lsn_lock

Closed

MDEV-34483 Backup may copy unnecessarily much log

Closed

MDEV-35796 OPT_PAGE_CHECKSUM is ignored if innodb_encrypt_log=ON

Stalled

MDEV-8139 Fix scrubbing

Closed

MDEV-9905 Options for NVDIMM usage in MariaDB

Open

MDEV-11380 AliSQL: [Perf] Issue#24 SPLIT LOG BUFFER TO ROTATE LOG WRITE

Closed

MDEV-12041 Implement key rotation for innodb_encrypt_log

Closed

MDEV-12353 Efficient InnoDB redo log record format

Closed

MDEV-15914 performance regression for mass update

Closed

MDEV-16168 Performance regression on sysbench write benchmarks from 10.2 to 10.3

Closed

MDEV-18370 InnoDB: Failing assertion: lsn % OS_FILE_LOG_BLOCK_SIZE == LOG_BLOCK_HDR_SIZE in log0log.cc with innodb_scrub_log=ON and high values of innodb_scrub_log_speed

Closed

MDEV-18606 innodb crashes on large update and it gets corrupted

Closed

MDEV-21382 use fdatasync() for redo log where appropriate

Closed

MDEV-21923 LSN allocation is a bottleneck

In Progress

MDEV-25124 benchmark 10.6 performance for PMEM enabled builds

Closed

MDEV-33894 MariaDB does unexpected storage read IO for the redo log

Closed

PERF-117 Loading...

PERF-118 Loading...

links to

MySQL Bug #94448 Rewrite LOG_BLOCK_FIRST_REC_GROUP during recovery may be dangerous.

podman feature request to make /sys/dev/block available for O_DIRECT size determination

mentioned in: Page Loading...

(7 causes, 1 includes, 4 is blocked by, 1 is part of, 48 relates to, 2 links to, 1 mentioned in)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2017-11-29 10:53

As part of this work, the function log_buffer_extend() will be removed.

Marko Mäkelä added a comment - 2017-11-29 10:53 As part of this work, the function log_buffer_extend() will be removed.

Marko Mäkelä added a comment - 2017-11-29 16:07

As part of ~~MDEV-14425~~ the recovery logic should be improved so that when a redo log block is corrupted, only the mini-transaction(s) that are (partly or fully) contained in the block will be skipped. This would augment ~~MDEV-12699~~, which is about improving the recovery of corrupted data pages.

Marko Mäkelä added a comment - 2017-11-29 16:07 As part of MDEV-14425 the recovery logic should be improved so that when a redo log block is corrupted, only the mini-transaction(s) that are (partly or fully) contained in the block will be skipped. This would augment MDEV-12699 , which is about improving the recovery of corrupted data pages.

Marko Mäkelä added a comment - 2017-12-01 07:20

The function log_free_check() will have to be replaced.
With the original circularly-written InnoDB redo log, the function seeked to prevent a situation where the tail of the log is overwriting the head before the head is logically truncated by a redo log checkpoint. If such overwriting happens, InnoDB will be unable to recover from a crash. The situation would be normalized by a redo log checkpoint.

With these append-only log files, the overwriting issue is replaced with another one: running out of space in the file system. So, we will continue to need a function similar to log_free_check(). If the file system of the current thread’s log file is about to run out of space, the replacement of log_free_check() would return an ‘out of space’ error, which would be returned all the way up the call stack. If the ultimate caller is a client connection, the error would be reported as ER_DISK_FULL.

Running out of space should only be an issue when log archiving is enabled (innodb_log_max_size is overridden from its default value).

Marko Mäkelä added a comment - 2017-12-01 07:20 The function log_free_check() will have to be replaced. With the original circularly-written InnoDB redo log, the function seeked to prevent a situation where the tail of the log is overwriting the head before the head is logically truncated by a redo log checkpoint. If such overwriting happens, InnoDB will be unable to recover from a crash. The situation would be normalized by a redo log checkpoint. With these append-only log files, the overwriting issue is replaced with another one: running out of space in the file system. So, we will continue to need a function similar to log_free_check(). If the file system of the current thread’s log file is about to run out of space, the replacement of log_free_check() would return an ‘out of space’ error, which would be returned all the way up the call stack. If the ultimate caller is a client connection, the error would be reported as ER_DISK_FULL. Running out of space should only be an issue when log archiving is enabled (innodb_log_max_size is overridden from its default value).

Marko Mäkelä added a comment - 2018-02-02 09:57

For the record: monty asked me to compare the relative performance of writing to a preallocated file and appending to a file.
On my HP ZBook 15u G3 laptop, I ran the test programs append.c and preallocate.c to write a 2GiB file on ext4fs, and the results are clear.
I used 2 SSD devices: 220GiB NVMe that is encrypted by dm-crypt, and a 450GiB SATA SSD that is not encrypted. With 4 programs running in parallel like this:

time ./append foo1&time ./append foo2&time ./append foo3&time ./append foo4

the reported real time was like this:

append, NVMe	preallocate, NVMe	append, SATA	preallocate, SATA
18.7s	24.2s	21.0s	66.7s

It would be interesting to test other relevant file systems as well as HDD.

Marko Mäkelä added a comment - 2018-02-02 09:57 For the record: monty asked me to compare the relative performance of writing to a preallocated file and appending to a file. On my HP ZBook 15u G3 laptop, I ran the test programs append.c and preallocate.c to write a 2GiB file on ext4fs, and the results are clear. I used 2 SSD devices: 220GiB NVMe that is encrypted by dm-crypt, and a 450GiB SATA SSD that is not encrypted. With 4 programs running in parallel like this: time ./append foo1&time ./append foo2&time ./append foo3&time ./append foo4 the reported real time was like this: append, NVMe preallocate, NVMe append, SATA preallocate, SATA 18.7s 24.2s 21.0s 66.7s It would be interesting to test other relevant file systems as well as HDD.

Daniel Black added a comment - 2018-02-05 03:52

Some server hardware stats:

Power Systems S822LC "Firestone", Power8 - 10 cores
Ubuntu-16.04 userspace, Kernel 4.14.0-27552-gadb89ad
nvme - IBM PCIe3 1.6TB NVMe Flash Adapter
disk - ST1000NX0313, 7200rpm - SATA 3.1, 2 disks in software raid1 (changed to 4.15.0-10668-g3527799 kernel - managed to hit kernel thread that hung)

medium	append	preallocate
nvme, xfs bsize=4096 (default)	6.827s	6.706
nvme, xfs bsize=4096 (using ix10)	57.257	57.650s
nvme, xfs, bsize=16384 (using ix10)	56.621s	55.880s
nvme, ext3	9.191s	9.220s
nvme, ext3 (using ix10)	54.680s	1m45.147s
nvme, ext4 (using ix10)	55.242s	53.415s
disk, ext4 on llvm	1m39.221s	1m42.376s

ix10 is the i variable in the source * 10. Felt the size was too small at ~6 seconds.

error margin seems about 5% based on repetition (most results in the batch of 4 where identical however between batches varied). Assumed to be effects of journalling.

So ext3 is worse at prealloc. Multiple runs did occur. otherwise comparable.

Daniel Black added a comment - 2018-02-05 03:52 Some server hardware stats: Power Systems S822LC "Firestone", Power8 - 10 cores Ubuntu-16.04 userspace, Kernel 4.14.0-27552-gadb89ad nvme - IBM PCIe3 1.6TB NVMe Flash Adapter disk - ST1000NX0313, 7200rpm - SATA 3.1, 2 disks in software raid1 (changed to 4.15.0-10668-g3527799 kernel - managed to hit kernel thread that hung) medium append preallocate nvme, xfs bsize=4096 (default) 6.827s 6.706 nvme, xfs bsize=4096 (using ix10) 57.257 57.650s nvme, xfs, bsize=16384 (using ix10) 56.621s 55.880s nvme, ext3 9.191s 9.220s nvme, ext3 (using ix10) 54.680s 1m45.147s nvme, ext4 (using ix10) 55.242s 53.415s disk, ext4 on llvm 1m39.221s 1m42.376s ix10 is the i variable in the source * 10. Felt the size was too small at ~6 seconds. error margin seems about 5% based on repetition (most results in the batch of 4 where identical however between batches varied). Assumed to be effects of journalling. So ext3 is worse at prealloc. Multiple runs did occur. otherwise comparable.

Jun Su added a comment - 2018-02-05 04:52

Your preallocate code doesn't reflect the reality well. Since one log file can be written many times, they contains therandom data instead of just falloc. Your code simulate the situation when first time the log file get writes (brand new server). Please consider adding a command line parameter to open the existing file from previous run and skip falloc to simulate when the log file gets overwrite.

Jun Su added a comment - 2018-02-05 04:52 Your preallocate code doesn't reflect the reality well. Since one log file can be written many times, they contains therandom data instead of just falloc. Your code simulate the situation when first time the log file get writes (brand new server). Please consider adding a command line parameter to open the existing file from previous run and skip falloc to simulate when the log file gets overwrite.

Marko Mäkelä added a comment - 2018-02-05 15:59

danblack, thanks for your benchmarks!

junsu, thank you for your comment. Another problem with the test program that there is only a fdatasync() call at the end, while in the reality there would be some fdatasync() whenever we need to make something durable.

I am not actively working on this task yet. I would welcome improved versions of the test programs, as well as more benchmarks, especially ones that would seem to indicate that preallocating is faster than appending.

Marko Mäkelä added a comment - 2018-02-05 15:59 danblack , thanks for your benchmarks! junsu , thank you for your comment. Another problem with the test program that there is only a fdatasync() call at the end, while in the reality there would be some fdatasync() whenever we need to make something durable. I am not actively working on this task yet. I would welcome improved versions of the test programs, as well as more benchmarks, especially ones that would seem to indicate that preallocating is faster than appending.

Marko Mäkelä added a comment - 2018-03-06 06:56 - edited

In addition to the proposed parameter innodb_log_max_size there perhaps also should be some time limit after which the excessive old logs will be discarded. Perhaps innodb_log_min_age?
To ease maintenance and backups, maybe we should have an option to have strictly append-only log files, and should revise the design so that log files are never renamed. Most notably, there should be no header in the partitioned log files; the checkpoint block offsets would be written into the single-block control file ib_logfile0 only.

On a related note, last Tuesday I gave a high-level view of the InnoDB internals in a M18 talk Deep Dive: InnoDB Transactions and Write Paths (video). The slides with the diagram on mini-transactions and describing the log checkpoints could be useful background information.

Marko Mäkelä added a comment - 2018-03-06 06:56 - edited In addition to the proposed parameter innodb_log_max_size there perhaps also should be some time limit after which the excessive old logs will be discarded. Perhaps innodb_log_min_age ? To ease maintenance and backups, maybe we should have an option to have strictly append-only log files, and should revise the design so that log files are never renamed. Most notably, there should be no header in the partitioned log files; the checkpoint block offsets would be written into the single-block control file ib_logfile0 only. On a related note, last Tuesday I gave a high-level view of the InnoDB internals in a M18 talk Deep Dive: InnoDB Transactions and Write Paths (video). The slides with the diagram on mini-transactions and describing the log checkpoints could be useful background information.

Inaam Rana added a comment - 2018-03-20 16:32

Marko,

This is a very interesting idea. I believe we should think of redo log as nothing but a bunch of ordered changes to the pages. This implies that if for a given page we have all the changes available we should be able to deal with the gaps in the log (obviously, as long as we do flush all log files at checkpoint).

How about we map to log partition based on space_id:page_no i.e.: all changes to a page must always go to same logfile. If this invariant is maintained then we don't need to flush anything on mtr_commit(). As a trx touches different pages it will keep track of which logfiles it needs to flush. For small trxs like a single row DML it will be a single file. This calculation can be done at mtr_commit(). At trx_commit() we only flush the relevant files.

I think with above scheme we can allow gaps in LSN. Conceptually this design feels intuitive. Redo log records changes to pages. Redo logs are partitioned based on page number. Log buffer can also be partitioned extending the same logic.

As an aside, if we redesign the format may be we should add information during mtr_commit() about back pointer to last redo record that changed the page. If we are able to follow the chain of changes to a page that might actually be quite helpful. There might not be an immediate use case for MariaDB but then log format changes are cumbersome to orchestrate.

Inaam Rana added a comment - 2018-03-20 16:32 Marko, This is a very interesting idea. I believe we should think of redo log as nothing but a bunch of ordered changes to the pages. This implies that if for a given page we have all the changes available we should be able to deal with the gaps in the log (obviously, as long as we do flush all log files at checkpoint). How about we map to log partition based on space_id:page_no i.e.: all changes to a page must always go to same logfile. If this invariant is maintained then we don't need to flush anything on mtr_commit(). As a trx touches different pages it will keep track of which logfiles it needs to flush. For small trxs like a single row DML it will be a single file. This calculation can be done at mtr_commit(). At trx_commit() we only flush the relevant files. I think with above scheme we can allow gaps in LSN. Conceptually this design feels intuitive. Redo log records changes to pages. Redo logs are partitioned based on page number. Log buffer can also be partitioned extending the same logic. As an aside, if we redesign the format may be we should add information during mtr_commit() about back pointer to last redo record that changed the page. If we are able to follow the chain of changes to a page that might actually be quite helpful. There might not be an immediate use case for MariaDB but then log format changes are cumbersome to orchestrate.

Marko Mäkelä added a comment - 2018-03-21 07:04

inaamrana, thank you for the valuable feedback.

Your suggestion to flush a subset of the log files at transaction commit and to ignoring gaps in the log recovery seems to imply a partial ordering of events. With the added invariant that modifications of a certain page must always go to a certain log file, I cannot see any obvious correctness problem. We would only need an additional field "total number of log files" for the mini-transaction, so that recovery can reject a mini-transaction that was not fully written to all log files. What if this is combined with physical replication? We should only replicate each mini-transaction log up to the latest log flush. Here it could be tricky to guarantee the atomicity of the replicated mini-transactions without flushing all logs up to our mini-transaction commit LSN.

This architecture would seem to require some scatter-gather operation or partitioning of the local mini-transaction buffer, so that the log for pages are written to the appropriate log files. Maybe the easiest way to arrange that would be by copying log snippets from local mini-transaction buffers to a per-log global buffer. This would also imply that the mini-transaction log for a given mini-transaction (identified by logical LSN) can exist in multiple log files.

For maximum commit concurrency, I think that in this scheme, there should be 1 redo log for each rollback segment (now that with ~~MDEV-15132~~ and ~~MDEV-15158~~ commit only writes to the rollback segment header and undo log header pages, not the TRX_SYS page). For maximal concurrency, we could experiment with dedicated redo logs for transaction metadata, and separate redo logs for data file changes.

This is definitely worth trying. I think that we should prototype and benchmark both approaches before committing to a solution.

The back-pointer to the last record that changed the page sounds like a good idea to me, and certainly useful for troubleshooting. When the full log is archived, this could allow faster point-in-time recovery. In your suggested scheme, this could be a byte offset from the start of the file. In my original scheme, it should probably be LSN, and some searching would be needed to find the record among the log files.

Marko Mäkelä added a comment - 2018-03-21 07:04 inaamrana , thank you for the valuable feedback. Your suggestion to flush a subset of the log files at transaction commit and to ignoring gaps in the log recovery seems to imply a partial ordering of events. With the added invariant that modifications of a certain page must always go to a certain log file, I cannot see any obvious correctness problem. We would only need an additional field "total number of log files" for the mini-transaction, so that recovery can reject a mini-transaction that was not fully written to all log files. What if this is combined with physical replication? We should only replicate each mini-transaction log up to the latest log flush. Here it could be tricky to guarantee the atomicity of the replicated mini-transactions without flushing all logs up to our mini-transaction commit LSN. This architecture would seem to require some scatter-gather operation or partitioning of the local mini-transaction buffer, so that the log for pages are written to the appropriate log files. Maybe the easiest way to arrange that would be by copying log snippets from local mini-transaction buffers to a per-log global buffer. This would also imply that the mini-transaction log for a given mini-transaction (identified by logical LSN) can exist in multiple log files. For maximum commit concurrency, I think that in this scheme, there should be 1 redo log for each rollback segment (now that with MDEV-15132 and MDEV-15158 commit only writes to the rollback segment header and undo log header pages, not the TRX_SYS page). For maximal concurrency, we could experiment with dedicated redo logs for transaction metadata, and separate redo logs for data file changes. This is definitely worth trying. I think that we should prototype and benchmark both approaches before committing to a solution. The back-pointer to the last record that changed the page sounds like a good idea to me, and certainly useful for troubleshooting. When the full log is archived, this could allow faster point-in-time recovery. In your suggested scheme, this could be a byte offset from the start of the file. In my original scheme, it should probably be LSN, and some searching would be needed to find the record among the log files.

Minshen Cai added a comment - 2018-04-03 14:56

In my opinion, there are issues on Inaam's idea. The below is a use case.
1. Mini-transaction m1 updates page A via log file 1, and updates page B via log file 2.
2. Mini-transaction m2 updates page A via log file 1.
3. All write to log file 1 completes.
4. m2 is in the user transaction tr2. tr2 commits successfully.
5. But at this time, the write to log file 2 for m1 hasn't been finished
6. The system is killed.
7. Recover sees the commit of m2. The change of page A made by m2 is replayed.
8. Recover doesn't see the completed log events of m1 in log file 2. m1 isn't completed. m1 is discarded. The change of page A which is made by m1 is ignored. By this, page A may be inconsistent.

In short, there might be uncommitted transactions which modify the same page of our transaction. The mini transactions of such uncommitted ones may write additional log file which is other than the ones written by our transaction. So to avoid the above issue, when commit our transaction, beside of the redo log file the current transaction writes, we also need to flush all redo log files written by such uncommitted transactions.

Because of this, Inaam's idea might not perform well in practice.

Minshen Cai added a comment - 2018-04-03 14:56 In my opinion, there are issues on Inaam's idea. The below is a use case. 1. Mini-transaction m1 updates page A via log file 1, and updates page B via log file 2. 2. Mini-transaction m2 updates page A via log file 1. 3. All write to log file 1 completes. 4. m2 is in the user transaction tr2. tr2 commits successfully. 5. But at this time, the write to log file 2 for m1 hasn't been finished 6. The system is killed. 7. Recover sees the commit of m2. The change of page A made by m2 is replayed. 8. Recover doesn't see the completed log events of m1 in log file 2. m1 isn't completed. m1 is discarded. The change of page A which is made by m1 is ignored. By this, page A may be inconsistent. In short, there might be uncommitted transactions which modify the same page of our transaction. The mini transactions of such uncommitted ones may write additional log file which is other than the ones written by our transaction. So to avoid the above issue, when commit our transaction, beside of the redo log file the current transaction writes, we also need to flush all redo log files written by such uncommitted transactions. Because of this, Inaam's idea might not perform well in practice.

Marko Mäkelä added a comment - 2018-04-03 16:48

I agree with micai that likely the only practical solution for preventing the scenario is to flush all redo log files whenever state change needs to be made durable in the database. This would seem to remove any performance benefits of partitioning the log file.

inaamrana’s idea sould still work in special cases, such as when each mini-transaction modifies its private set of pages, or if a (short) user transaction keeps page locks for the whole duration of the transaction. Maybe we could consider having separate groups of undo-redo-log or rseg-redo-log files, and allowing some level of partial ordering among related mini-transaction commits.

I was also thinking about extending the checkpoint information. Perhaps we should store all checkpoints in a separate sequential log file, pointing to the individual log files that contain changes since the start of the checkpoint. Perhaps all MLOG_FILE_ entries should be written to the checkpoint log file, while the log record files would only contain page-level log.

Marko Mäkelä added a comment - 2018-04-03 16:48 I agree with micai that likely the only practical solution for preventing the scenario is to flush all redo log files whenever state change needs to be made durable in the database. This would seem to remove any performance benefits of partitioning the log file. inaamrana ’s idea sould still work in special cases, such as when each mini-transaction modifies its private set of pages, or if a (short) user transaction keeps page locks for the whole duration of the transaction. Maybe we could consider having separate groups of undo-redo-log or rseg-redo-log files, and allowing some level of partial ordering among related mini-transaction commits. I was also thinking about extending the checkpoint information. Perhaps we should store all checkpoints in a separate sequential log file, pointing to the individual log files that contain changes since the start of the checkpoint. Perhaps all MLOG_FILE_ entries should be written to the checkpoint log file, while the log record files would only contain page-level log.

Inaam Rana added a comment - 2018-04-04 14:36

micai you are right. I haven't thought it through enough. Back to the drawing board

Inaam Rana added a comment - 2018-04-04 14:36 micai you are right. I haven't thought it through enough. Back to the drawing board

Marko Mäkelä added a comment - 2018-04-11 13:35 - edited

Regarding the preallocate.c vs append.c, today I found out that there still exist file systems where writing to a preallocated file is faster than appending to a file. This means that we should continue to offer an option where the log file is preallocated and written in circular fashion, instead of being written in append-only mode.

I also ran a modified version of preallocate.c that would omit the O_CREAT flag and the posix_fallocate() call. On ext4, fallocate -l 2g file completed in virtually no time, and using the modified test program to write to the preallocated file took around 25 seconds on the same hardware where I previously reported 24.2 seconds.

Marko Mäkelä added a comment - 2018-04-11 13:35 - edited Regarding the preallocate.c vs append.c , today I found out that there still exist file systems where writing to a preallocated file is faster than appending to a file. This means that we should continue to offer an option where the log file is preallocated and written in circular fashion, instead of being written in append-only mode. I also ran a modified version of preallocate.c that would omit the O_CREAT flag and the posix_fallocate() call. On ext4, fallocate -l 2g file completed in virtually no time, and using the modified test program to write to the preallocated file took around 25 seconds on the same hardware where I previously reported 24.2 seconds.

Marko Mäkelä added a comment - 2018-04-25 13:45

~~MDEV-15914~~ (and its main fix) showed that a small change to the redo log volume can have a huge impact on performance.
I think that the redo log record format must allow multiple byte strings to be written to the same page without repeating the tablespace identifier or page number.

Marko Mäkelä added a comment - 2018-04-25 13:45 MDEV-15914 (and its main fix ) showed that a small change to the redo log volume can have a huge impact on performance. I think that the redo log record format must allow multiple byte strings to be written to the same page without repeating the tablespace identifier or page number.

Marko Mäkelä added a comment - 2018-04-25 13:52

The VCDIFF format implemented in Xdelta could be a good starting point for a new redo log format.

Marko Mäkelä added a comment - 2018-04-25 13:52 The VCDIFF format implemented in Xdelta could be a good starting point for a new redo log format.

Marko Mäkelä added a comment - 2018-05-23 09:24

The bsdiff format has a different design goal: using a lot of RAM and CPU, create a minimal "binary patch" that can be distributed to a large number of clients. In InnoDB, there usually is at most 1 "consumer" of the redo log: the InnoDB crash recovery.

Marko Mäkelä added a comment - 2018-05-23 09:24 The bsdiff format has a different design goal: using a lot of RAM and CPU, create a minimal "binary patch" that can be distributed to a large number of clients. In InnoDB, there usually is at most 1 "consumer" of the redo log: the InnoDB crash recovery.

Marko Mäkelä added a comment - 2018-05-23 11:49 - edited

Related to this work, ~~MDEV-18115~~ will stop creating a fil_space_t object and using the fil_io() and fil_flush(SRV_LOG_SPACE_FIRST_ID) interfaces for writing to the redo log files. This should reduce contention on fil_system.mutex.

Marko Mäkelä added a comment - 2018-05-23 11:49 - edited Related to this work, MDEV-18115 will stop creating a fil_space_t object and using the fil_io() and fil_flush(SRV_LOG_SPACE_FIRST_ID) interfaces for writing to the redo log files. This should reduce contention on fil_system.mutex .

Marko Mäkelä added a comment - 2019-02-27 18:40

While implementing this, we should ensure that the latest redo log block is never being overwritten. InnoDB is currently doing that, and mariabackup is compensating it by re-reading the latest redo log block if it was not completely filled. Rewriting the latest redo log block feels crash-unsafe: if the server is killed during the write, you could end up with a corrupted log block, and lose a few redo log records. Worst case, you would lose a already durable transaction commit, or some pages would have been already flushed with the LSN of the mini-transaction that was lost because of the log block corruption.

Marko Mäkelä added a comment - 2019-02-27 18:40 While implementing this, we should ensure that the latest redo log block is never being overwritten. InnoDB is currently doing that, and mariabackup is compensating it by re-reading the latest redo log block if it was not completely filled. Rewriting the latest redo log block feels crash-unsafe: if the server is killed during the write, you could end up with a corrupted log block, and lose a few redo log records. Worst case, you would lose a already durable transaction commit, or some pages would have been already flushed with the LSN of the mini-transaction that was lost because of the log block corruption.

Marko Mäkelä added a comment - 2019-03-11 17:04

I think that it should be simplest to exclusively use synchronous I/O for the redo log. Currently, the log checkpoint write uses asynchronous I/O.

Marko Mäkelä added a comment - 2019-03-11 17:04 I think that it should be simplest to exclusively use synchronous I/O for the redo log. Currently, the log checkpoint write uses asynchronous I/O.

Marko Mäkelä added a comment - 2019-04-09 17:04 - edited

My reading of man 2 write suggests that whether or not the individual redo log files are append-only or written in a circular fashion, we should not need any mutex to guard concurrent writes from multiple threads to a file through a shared file descriptor:

For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step.
…
BUGS

According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions with Regular File Operations"):

"All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: ..."

Among the APIs subsequently listed are write() and writev(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, on Linux before version 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a write() (or writev(2)) at the same time, then the I/O op‐ erations were not atomic with respect updating the file offset, with the result that the blocks of data output by the two processes might (incorrectly) overlap. This problem was fixed in Linux 3.14.

The mentioned Linux kernel bug should not affect InnoDB, because InnoDB would be writing to the log files from multiple threads of the same process.

The key seems to be to invoke write() or similar function that uses and updates the current position of the file descriptor. pwrite() would require the caller to keep track of a position, and we would do not want that. All the log of the mini-transaction should be written with a single system call. We will probably want some framing with explicit length and checksum around each mini-transaction log snippet, instead of forcing the log to be structured as blocks.

Edit: An open problem with multiple concurrent threads writing to a file is that each write can be truncated into a partial write if the write is interrupted by a signal. If such an interrupting signal can be sent only to a subset of the file-writing threads, the log could easily be corrupted. Having the very last write truncated due to the server being killed is tolerable, but continuing writes to the log after a truncated write is not.

Hence, I believe that it could be cleaner to have O_DIRECT synchronous write requests from a single thread, writing full, aligned blocks. Partly filled blocks would only be written on log_write_up_to(), just like it is now. The main difference would be that log could be written into multiple files in parallel.

We might also employ some form of Lempel-Ziv compression on the log data that is going to be written. This would require identifying some ‘restart points’ for parsing the log.

Marko Mäkelä added a comment - 2019-04-09 17:04 - edited My reading of man 2 write suggests that whether or not the individual redo log files are append-only or written in a circular fashion, we should not need any mutex to guard concurrent writes from multiple threads to a file through a shared file descriptor: For a seekable file (i.e., one to which lseek(2) may be applied, for example, a regular file) writing takes place at the file offset, and the file offset is incremented by the number of bytes actually written. If the file was open(2)ed with O_APPEND, the file offset is first set to the end of the file before writing. The adjustment of the file offset and the write operation are performed as an atomic step. … BUGS According to POSIX.1-2008/SUSv4 Section XSI 2.9.7 ("Thread Interactions with Regular File Operations"): "All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: ..." Among the APIs subsequently listed are write() and writev(2). And among the effects that should be atomic across threads (and processes) are updates of the file offset. However, on Linux before version 3.14, this was not the case: if two processes that share an open file description (see open(2)) perform a write() (or writev(2)) at the same time, then the I/O op‐ erations were not atomic with respect updating the file offset, with the result that the blocks of data output by the two processes might (incorrectly) overlap. This problem was fixed in Linux 3.14. The mentioned Linux kernel bug should not affect InnoDB, because InnoDB would be writing to the log files from multiple threads of the same process. The key seems to be to invoke write() or similar function that uses and updates the current position of the file descriptor. pwrite() would require the caller to keep track of a position, and we would do not want that. All the log of the mini-transaction should be written with a single system call. We will probably want some framing with explicit length and checksum around each mini-transaction log snippet, instead of forcing the log to be structured as blocks. Edit: An open problem with multiple concurrent threads writing to a file is that each write can be truncated into a partial write if the write is interrupted by a signal. If such an interrupting signal can be sent only to a subset of the file-writing threads, the log could easily be corrupted. Having the very last write truncated due to the server being killed is tolerable, but continuing writes to the log after a truncated write is not. Hence, I believe that it could be cleaner to have O_DIRECT synchronous write requests from a single thread, writing full, aligned blocks. Partly filled blocks would only be written on log_write_up_to() , just like it is now. The main difference would be that log could be written into multiple files in parallel. We might also employ some form of Lempel-Ziv compression on the log data that is going to be written. This would require identifying some ‘restart points’ for parsing the log.

Marko Mäkelä added a comment - 2019-05-24 13:39

~~MDEV-12353~~ will improve the format of individual redo log records. This task will implement some framing around them, such as compression, division into blocks, LSN assignment. I think that for now, we will write the entire log of a single mini-transaction into a single log file.

Marko Mäkelä added a comment - 2019-05-24 13:39 MDEV-12353 will improve the format of individual redo log records. This task will implement some framing around them, such as compression, division into blocks, LSN assignment. I think that for now, we will write the entire log of a single mini-transaction into a single log file.

Marko Mäkelä added a comment - 2019-05-31 19:06

Some ideas for improving the redo log block format:

Use a variable block size, so that a mini-transaction will never be split between blocks. Write the size at the start of the block and the checksum at the end. In this way, we can make the collection of log records point directly to the parse buffer, and also remove recv_data_copy_to_buf(). (If log blocks are compressed, decompression would output to the parse buffer.)
Each time a new block is seen, it will mark the start of a new mini-transaction. In this way, the log blocks can avoid encoding the LSN (which would grow by one per mini-transaction commit).
Allow NUL-padding of short blocks to the physical block size, so that if a log flush is needed, read-modify-write on the file system can be avoided.

When it comes to file operations and log checkpoints, I think that it could be worthwhile to have a separate sequential log file that keeps a number of latest checkpoints as well as all the file operations. The checkpoint information in the checkpoint log file would point to the data log files, which would only contain page-level redo log records. This would remove the need to call the equivalent of fil_names_clear() at log checkpoint. Only when emptying or rotating the checkpoint log file we would write the equivalent of MLOG_FILE_NAME records to the new checkpoint log file.

Marko Mäkelä added a comment - 2019-05-31 19:06 Some ideas for improving the redo log block format: Use a variable block size, so that a mini-transaction will never be split between blocks. Write the size at the start of the block and the checksum at the end. In this way, we can make the collection of log records point directly to the parse buffer, and also remove recv_data_copy_to_buf() . (If log blocks are compressed, decompression would output to the parse buffer.) Each time a new block is seen, it will mark the start of a new mini-transaction. In this way, the log blocks can avoid encoding the LSN (which would grow by one per mini-transaction commit). Allow NUL-padding of short blocks to the physical block size, so that if a log flush is needed, read-modify-write on the file system can be avoided. When it comes to file operations and log checkpoints, I think that it could be worthwhile to have a separate sequential log file that keeps a number of latest checkpoints as well as all the file operations. The checkpoint information in the checkpoint log file would point to the data log files, which would only contain page-level redo log records. This would remove the need to call the equivalent of fil_names_clear() at log checkpoint. Only when emptying or rotating the checkpoint log file we would write the equivalent of MLOG_FILE_NAME records to the new checkpoint log file.

Marko Mäkelä added a comment - 2019-10-31 15:11

I think that we must abandon the idea of partitioning the redo log. I would go with 3 files:

mostly dummy ib_logfile0 to identify the file format
an append-only, binlog-style-rotated file for checkpoints and file-level operations (create, delete, rename, modify)
a page-level redo log file, 2 variants: block-oriented circular, or append-only
LSN will be logical, incremented by 1 on mtr_t::commit() when redo log records were generated.

Even in the circular log file format, page-level redo log records will never be interrupted by log block trailers or headers. That is, we will write variable-size blocks, with the LSN at the start of the block (to detect the end of "new" log).

In the byte-oriented append-only log file format, if a persistent write is requested (on user transaction commit), we will write an extra record that contains a checksum of the bytes that were written since the previous persistent write. The payload of the stream could even be encrypted.

To alleviate the log_sys.mutex and log_sys.write_mutex bottleneck, we will introduce a dedicated log writing task, which will:

Get mtr_t::log records from and encode them to a private buffer of this task
In case of a circular log file, issue a log checkpoint if the log tail would overwrite the head. (Other calls to log_free_check() can be removed!)
Before a log checkpoint is issued, any background crash recovery (~~MDEV-14481~~) must be finished.
If persistence is requested, issue a synchronous write of the data to the log file.

Because page-level log records of any single mini-transaction will be continuous streams of bytes in the log file, on recovery we can avoid copying log record snippets to recv_sys.pages. Instead, we can simply attach pointers to log_sys.buf that was read from the page-level redo log file. This should automatically fix ~~MDEV-19176~~ when using the ~~MDEV-12353~~ format. Note: compressing the log record stream could prevent such optimization, so we will not introduce any compression for now.

Marko Mäkelä added a comment - 2019-10-31 15:11 I think that we must abandon the idea of partitioning the redo log. I would go with 3 files: mostly dummy ib_logfile0 to identify the file format an append-only, binlog-style-rotated file for checkpoints and file-level operations (create, delete, rename, modify) a page-level redo log file, 2 variants: block-oriented circular, or append-only LSN will be logical, incremented by 1 on mtr_t::commit() when redo log records were generated. Even in the circular log file format, page-level redo log records will never be interrupted by log block trailers or headers. That is, we will write variable-size blocks, with the LSN at the start of the block (to detect the end of "new" log). In the byte-oriented append-only log file format, if a persistent write is requested (on user transaction commit), we will write an extra record that contains a checksum of the bytes that were written since the previous persistent write. The payload of the stream could even be encrypted. To alleviate the log_sys.mutex and log_sys.write_mutex bottleneck, we will introduce a dedicated log writing task, which will: Get mtr_t::log records from and encode them to a private buffer of this task In case of a circular log file, issue a log checkpoint if the log tail would overwrite the head. (Other calls to log_free_check() can be removed!) Before a log checkpoint is issued, any background crash recovery ( MDEV-14481 ) must be finished. If persistence is requested, issue a synchronous write of the data to the log file. Because page-level log records of any single mini-transaction will be continuous streams of bytes in the log file, on recovery we can avoid copying log record snippets to recv_sys.pages . Instead, we can simply attach pointers to log_sys.buf that was read from the page-level redo log file. This should automatically fix MDEV-19176 when using the MDEV-12353 format. Note: compressing the log record stream could prevent such optimization, so we will not introduce any compression for now.

Marko Mäkelä added a comment - 2019-11-04 12:37 - edited

Currently, on recovery, redo log records are being copied twice from memory to memory:

from redo log file blocks to contiguous strings of bytes
from contiguous strings of bytes to recv_t (in limited-size chunks that are allocated from the buffer pool)

We have a 2MiB recv_sys.buf for the initial buffering. The minimum size of log_sys.buf would be 16MiB, and that buffer should be practically unused during recovery. If the buffer pool size is measured in gigabytes, it would indeed make sense to use the buffer pool for the recovered log records.

I updated ~~MDEV-19176~~ with a suggested design how to improve the buffer pool utilization during crash recovery. It is independent of the redo log record or file format.

Marko Mäkelä added a comment - 2019-11-04 12:37 - edited Currently, on recovery, redo log records are being copied twice from memory to memory: from redo log file blocks to contiguous strings of bytes from contiguous strings of bytes to recv_t (in limited-size chunks that are allocated from the buffer pool) We have a 2MiB recv_sys.buf for the initial buffering. The minimum size of log_sys.buf would be 16MiB, and that buffer should be practically unused during recovery. If the buffer pool size is measured in gigabytes, it would indeed make sense to use the buffer pool for the recovered log records. I updated MDEV-19176 with a suggested design how to improve the buffer pool utilization during crash recovery. It is independent of the redo log record or file format.

Marko Mäkelä added a comment - 2019-11-28 11:35

When innodb_scrub_log=ON, log checkpoint should clear the unused part of the redo log, simply by invoking fallocate(fd, FALLOC_FL_PUNCH_HOLE, offset, len) or its Windows equivalent. There is no need for a separate log_scrub_thread. The hole-punching should work with both circular and append-only log files.

Removing the log_scrub_thread should fix ~~MDEV-18370~~, ~~MDEV-20474~~, ~~MDEV-20475~~.

Marko Mäkelä added a comment - 2019-11-28 11:35 When innodb_scrub_log=ON , log checkpoint should clear the unused part of the redo log, simply by invoking fallocate(fd, FALLOC_FL_PUNCH_HOLE, offset, len) or its Windows equivalent. There is no need for a separate log_scrub_thread . The hole-punching should work with both circular and append-only log files. Removing the log_scrub_thread should fix MDEV-18370 , MDEV-20474 , MDEV-20475 .

Eugene Kosov (Inactive) added a comment - 2020-01-05 15:14

This is my test program which was created to test performance of writing to circular file and appending to file. I've run test on my laptop with HDD and ext4.

Results without fsync():

File size 524288 Kb

Writing 4194304 Kb to it

Simple cyclic file

Took 6 seconds 714 milliseconds

Mmapped cyclic file

Took 0 seconds 532 milliseconds

O_APPEND append file

Took 25 seconds 260 milliseconds

Simple append file

Took 24 seconds 623 milliseconds

Results with fsync():

File size 1024 Kb

Writing 4096 Kb to it

Simple cyclic file

Took 59 seconds 111 milliseconds

O_DSYNC cyclic file

Took 59 seconds 267 milliseconds

Mmapped cyclic file

Took 58 seconds 516 milliseconds

O_DIRECT|O_DSYNC cyclic file

Took 107 seconds 6 milliseconds

O_APPEND append file

Took 202 seconds 492 milliseconds

Simple append file

Took 249 seconds 301 milliseconds

And the program itself:

#include <sys/types.h>

#include <sys/mman.h>

#include <sys/stat.h>

#include <fcntl.h>

#include <unistd.h>

#include <cerrno>

#include <cassert>

#include <cstdlib>

#include <cstring>

#include <array>

#include <chrono>

#include <string>

#include <iostream>

static const size_t kWriteTotal = 4 * 1024 * 1024;

static const size_t kFileSize = 1 * 1024 * 1024;

static const std::array<unsigned char, 1024> kBuf = { 33 };

static const std::array<unsigned char, 1024> kAlignedBuf alignas(512) = { 33 };

static_assert(kFileSize % kBuf.size() == 0, "");

static const std::string kPath = "test_file";

void ShowErrnoAndExit(std::string function_name) {

  std::cerr << function_name << " returned errno " << strerror(errno) << "\n";

  std::exit(EXIT_FAILURE);

struct File {

  File(std::string path, int additional_flags = 0) : path_(std::move(path)) {

    fd_ = open(path_.c_str(),

	       O_CREAT | O_TRUNC | O_RDWR | additional_flags,

	       S_IRUSR | S_IWUSR);

    if (fd_ == -1)

      ShowErrnoAndExit("open()");

  void Resize(off_t length) {

    assert(fd_ != -1);

    if (int ret = posix_fallocate(fd_, 0, length)) {

      errno = ret;

      ShowErrnoAndExit("posix_fallocate()");

  void Write(const void *buf, size_t count) {

    assert(fd_ != -1);

    ssize_t ret = write(fd_, buf, count);

    if (ret == -1)

      ShowErrnoAndExit("write()");

    if (static_cast<size_t>(ret) != count) {

      std::cerr << "write() partial write\n";

      std::exit(EXIT_FAILURE);

  void PWrite(const void *buf, size_t count, off_t offset) {

    assert(fd_ != -1);

    ssize_t ret = pwrite(fd_, buf, count, offset);

    if (ret == -1)

      ShowErrnoAndExit("pwrite()");

    if (static_cast<size_t>(ret) != count) {

      std::cerr << "pwrite() partial write\n";

      std::exit(EXIT_FAILURE);

  void Fsync() {

    assert(fd_ != -1);

    if (fsync(fd_) == -1)

      ShowErrnoAndExit("fsync()");

  void Fdatasync() {

    assert(fd_ != -1);

    if (fdatasync(fd_) == -1)

      ShowErrnoAndExit("fdatasync()");

  size_t Size() {

    assert(fd_ != 1);

    struct stat s;

    if (fstat(fd_, &s) == -1)

      ShowErrnoAndExit("fstat()");

    return s.st_size;

  void Mmap() {

    assert(fd_ != -1);

    size_t size = Size();

    mapped_ = mmap(nullptr, size,

		   PROT_READ | PROT_WRITE,

		   MAP_SHARED | MAP_POPULATE,

		   fd_, 0);

    if (mapped_ == MAP_FAILED)

      ShowErrnoAndExit("mmap()");

  unsigned char *GetMmappedRegion() {

    assert(fd_ != -1);

    assert(mapped_);

    return static_cast<unsigned char *>(mapped_);

  void Msync(void *addr, size_t len) {

    if (msync(addr, len, MS_SYNC) == -1) {

      ShowErrnoAndExit("msync()");

  void Munmap() {

    assert(fd_ != -1);

    assert(mapped_ != nullptr);

    if (munmap(mapped_, Size()) == -1)

      ShowErrnoAndExit("munmap()");

    mapped_ = nullptr;

  ~File() {

    if (mapped_)

      Munmap();

    if (fd_ != -1 && close(fd_) == -1)

      ShowErrnoAndExit("close()");

    if (unlink(path_.c_str()) == -1)

      ShowErrnoAndExit("unlink()");

  int fd_{-1};

  void *mapped_{nullptr};

  std::string path_;

};

struct SimpleCyclicWriter {

  SimpleCyclicWriter(std::string path) : file_(path) {

    file_.Resize(kFileSize);

    file_.Fsync();

  static std::string Name()  { return "Simple cyclic file"; }

  void Write() {

    const off_t offset = offset_ % kFileSize;

    file_.PWrite(kBuf.data(), kBuf.size(), offset);

    offset_ += kBuf.size();

  void Flush() { file_.Fdatasync(); }

  File file_;

  off_t offset_{0};

};

struct DSyncCyclicWriter {

  DSyncCyclicWriter(std::string path) : file_(path, O_DSYNC) {

    file_.Resize(kFileSize);

    file_.Fsync();

  static std::string Name()  { return "O_DSYNC cyclic file"; }

  void Write() {

    const off_t offset = offset_ % kFileSize;

    file_.PWrite(kBuf.data(), kBuf.size(), offset);

    offset_ += kBuf.size();

  void Flush() { }

  File file_;

  off_t offset_{0};

};

struct ODirectODsyncCyclicWriter {

  ODirectODsyncCyclicWriter(std::string path) :

      file_(path, O_DIRECT | O_DSYNC) {

    file_.Resize(kFileSize);

    file_.Fsync();

  static std::string Name()  { return "O_DIRECT|O_DSYNC cyclic file"; }

  void Write() {

    const off_t offset = offset_ % kFileSize;

    file_.PWrite(kAlignedBuf.data(), kAlignedBuf.size(), offset);

    offset_ += kAlignedBuf.size();

  void Flush() { }

  File file_;

  off_t offset_{0};

};

struct MmappedCyclicWriter {

  MmappedCyclicWriter(std::string path) : file_(path) {

    file_.Resize(kFileSize);

    file_.Fsync();

    file_.Mmap();

  static std::string Name()  { return "Mmapped cyclic file"; }

  void Write() {

    const off_t offset = offset_ % kFileSize;

    memcpy(file_.GetMmappedRegion() + offset, kBuf.data(), kBuf.size());

    prev_offset_ = offset;

    offset_ += kBuf.size();

  void Flush() {

    auto *start = file_.GetMmappedRegion() + prev_offset_;

    auto *end = start + kBuf.size();

    start = start - reinterpret_cast<std::uintptr_t>(start) % page_size_;

    assert(start >= file_.GetMmappedRegion());

    assert(end <= file_.GetMmappedRegion() + file_.Size());

    file_.Msync(start, end - start);

  File file_;

  uintptr_t page_size_{static_cast<uintptr_t>(sysconf(_SC_PAGE_SIZE))};

  off_t prev_offset_{0};

  off_t offset_{0};

};

struct OAppendAppendWriter {

  OAppendAppendWriter(std::string path) : file_(path, O_APPEND) {}

  static std::string Name()  { return "O_APPEND append file"; }

  void Write() { file_.Write(kBuf.data(), kBuf.size()); }

  void Flush() { file_.Fsync(); }

  File file_;

};

struct SimpleAppendWriter {

  SimpleAppendWriter(std::string path) : file_(path) {}

  static std::string Name()  { return "Simple append file"; }

  void Write() { file_.Write(kBuf.data(), kBuf.size()); }

  void Flush() { file_.Fsync(); }

  File file_;

};

struct Timer {

  using Clock = std::chrono::steady_clock;

  ~Timer() {

    auto duration = Clock::now() - now_;

    auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(duration);

    auto s = std::chrono::duration_cast<std::chrono::seconds>(ms);

    ms = ms - s;

    std::cout << "Took " << s.count() << " seconds " << ms.count() <<

        " milliseconds\n";

  Clock::time_point now_ = Clock::now();

};

template <class Writer>

void Test() {

  std::cout << Writer::Name() << "\n";

  Writer file_(kPath);

    Timer timer;

    size_t write_total = kWriteTotal;

    while (write_total) {

      file_.Write();

      file_.Flush();

      write_total -= kBuf.size();

  std::cout << "\n";

int main() {

  std::cout << "File size " << kFileSize / 1024 << " Kb\n";

  std::cout << "Writing " << kWriteTotal / 1024 << " Kb to it\n";

  std::cout << "\n";

  Test<SimpleCyclicWriter>();

  Test<DSyncCyclicWriter>();

  Test<MmappedCyclicWriter>();

  Test<ODirectODsyncCyclicWriter>();

  Test<OAppendAppendWriter>();

  Test<SimpleAppendWriter>();

  return EXIT_SUCCESS;

So, writing to circular file is clearly much faster than appending to file.

Eugene Kosov (Inactive) added a comment - 2020-01-05 15:14 This is my test program which was created to test performance of writing to circular file and appending to file. I've run test on my laptop with HDD and ext4. Results without fsync() : File size 524288 Kb Writing 4194304 Kb to it Simple cyclic file Took 6 seconds 714 milliseconds Mmapped cyclic file Took 0 seconds 532 milliseconds O_APPEND append file Took 25 seconds 260 milliseconds Simple append file Took 24 seconds 623 milliseconds Results with fsync() : File size 1024 Kb Writing 4096 Kb to it Simple cyclic file Took 59 seconds 111 milliseconds O_DSYNC cyclic file Took 59 seconds 267 milliseconds Mmapped cyclic file Took 58 seconds 516 milliseconds O_DIRECT|O_DSYNC cyclic file Took 107 seconds 6 milliseconds O_APPEND append file Took 202 seconds 492 milliseconds Simple append file Took 249 seconds 301 milliseconds And the program itself: #include <sys/types.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> #include <cerrno> #include <cassert> #include <cstdlib> #include <cstring> #include <array> #include <chrono> #include <string> #include <iostream> static const size_t kWriteTotal = 4 * 1024 * 1024; static const size_t kFileSize = 1 * 1024 * 1024; static const std::array<unsigned char , 1024> kBuf = { 33 }; static const std::array<unsigned char , 1024> kAlignedBuf alignas(512) = { 33 }; static_assert(kFileSize % kBuf.size() == 0, "" ); static const std::string kPath = "test_file" ; void ShowErrnoAndExit(std::string function_name) { std::cerr << function_name << " returned errno " << strerror ( errno ) << "\n" ; std:: exit (EXIT_FAILURE); } struct File { File(std::string path, int additional_flags = 0) : path_(std::move(path)) { fd_ = open(path_.c_str(), O_CREAT | O_TRUNC | O_RDWR | additional_flags, S_IRUSR | S_IWUSR); if (fd_ == -1) ShowErrnoAndExit( "open()" ); } void Resize(off_t length) { assert (fd_ != -1); if ( int ret = posix_fallocate(fd_, 0, length)) { errno = ret; ShowErrnoAndExit( "posix_fallocate()" ); } } void Write( const void *buf, size_t count) { assert (fd_ != -1); ssize_t ret = write(fd_, buf, count); if (ret == -1) ShowErrnoAndExit( "write()" ); if ( static_cast < size_t >(ret) != count) { std::cerr << "write() partial write\n" ; std:: exit (EXIT_FAILURE); } } void PWrite( const void *buf, size_t count, off_t offset) { assert (fd_ != -1); ssize_t ret = pwrite(fd_, buf, count, offset); if (ret == -1) ShowErrnoAndExit( "pwrite()" ); if ( static_cast < size_t >(ret) != count) { std::cerr << "pwrite() partial write\n" ; std:: exit (EXIT_FAILURE); } } void Fsync() { assert (fd_ != -1); if (fsync(fd_) == -1) ShowErrnoAndExit( "fsync()" ); } void Fdatasync() { assert (fd_ != -1); if (fdatasync(fd_) == -1) ShowErrnoAndExit( "fdatasync()" ); } size_t Size() { assert (fd_ != 1); struct stat s; if (fstat(fd_, &s) == -1) ShowErrnoAndExit( "fstat()" ); return s.st_size; } void Mmap() { assert (fd_ != -1); size_t size = Size(); mapped_ = mmap(nullptr, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd_, 0); if (mapped_ == MAP_FAILED) ShowErrnoAndExit( "mmap()" ); } unsigned char *GetMmappedRegion() { assert (fd_ != -1); assert (mapped_); return static_cast <unsigned char *>(mapped_); } void Msync( void *addr, size_t len) { if (msync(addr, len, MS_SYNC) == -1) { ShowErrnoAndExit( "msync()" ); } } void Munmap() { assert (fd_ != -1); assert (mapped_ != nullptr); if (munmap(mapped_, Size()) == -1) ShowErrnoAndExit( "munmap()" ); mapped_ = nullptr; } ~File() { if (mapped_) Munmap(); if (fd_ != -1 && close(fd_) == -1) ShowErrnoAndExit( "close()" ); if (unlink(path_.c_str()) == -1) ShowErrnoAndExit( "unlink()" ); } int fd_{-1}; void *mapped_{nullptr}; std::string path_; }; struct SimpleCyclicWriter { SimpleCyclicWriter(std::string path) : file_(path) { file_.Resize(kFileSize); file_.Fsync(); } static std::string Name() { return "Simple cyclic file" ; } void Write() { const off_t offset = offset_ % kFileSize; file_.PWrite(kBuf.data(), kBuf.size(), offset); offset_ += kBuf.size(); } void Flush() { file_.Fdatasync(); } File file_; off_t offset_{0}; }; struct DSyncCyclicWriter { DSyncCyclicWriter(std::string path) : file_(path, O_DSYNC) { file_.Resize(kFileSize); file_.Fsync(); } static std::string Name() { return "O_DSYNC cyclic file" ; } void Write() { const off_t offset = offset_ % kFileSize; file_.PWrite(kBuf.data(), kBuf.size(), offset); offset_ += kBuf.size(); } void Flush() { } File file_; off_t offset_{0}; }; struct ODirectODsyncCyclicWriter { ODirectODsyncCyclicWriter(std::string path) : file_(path, O_DIRECT | O_DSYNC) { file_.Resize(kFileSize); file_.Fsync(); } static std::string Name() { return "O_DIRECT|O_DSYNC cyclic file" ; } void Write() { const off_t offset = offset_ % kFileSize; file_.PWrite(kAlignedBuf.data(), kAlignedBuf.size(), offset); offset_ += kAlignedBuf.size(); } void Flush() { } File file_; off_t offset_{0}; }; struct MmappedCyclicWriter { MmappedCyclicWriter(std::string path) : file_(path) { file_.Resize(kFileSize); file_.Fsync(); file_.Mmap(); } static std::string Name() { return "Mmapped cyclic file" ; } void Write() { const off_t offset = offset_ % kFileSize; memcpy (file_.GetMmappedRegion() + offset, kBuf.data(), kBuf.size()); prev_offset_ = offset; offset_ += kBuf.size(); } void Flush() { auto *start = file_.GetMmappedRegion() + prev_offset_; auto *end = start + kBuf.size(); start = start - reinterpret_cast <std:: uintptr_t >(start) % page_size_; assert (start >= file_.GetMmappedRegion()); assert (end <= file_.GetMmappedRegion() + file_.Size()); file_.Msync(start, end - start); } File file_; uintptr_t page_size_{ static_cast < uintptr_t >(sysconf(_SC_PAGE_SIZE))}; off_t prev_offset_{0}; off_t offset_{0}; }; struct OAppendAppendWriter { OAppendAppendWriter(std::string path) : file_(path, O_APPEND) {} static std::string Name() { return "O_APPEND append file" ; } void Write() { file_.Write(kBuf.data(), kBuf.size()); } void Flush() { file_.Fsync(); } File file_; }; struct SimpleAppendWriter { SimpleAppendWriter(std::string path) : file_(path) {} static std::string Name() { return "Simple append file" ; } void Write() { file_.Write(kBuf.data(), kBuf.size()); } void Flush() { file_.Fsync(); } File file_; }; struct Timer { using Clock = std::chrono::steady_clock; ~Timer() { auto duration = Clock::now() - now_; auto ms = std::chrono::duration_cast<std::chrono::milliseconds>(duration); auto s = std::chrono::duration_cast<std::chrono::seconds>(ms); ms = ms - s; std::cout << "Took " << s.count() << " seconds " << ms.count() << " milliseconds\n" ; } Clock::time_point now_ = Clock::now(); }; template < class Writer> void Test() { std::cout << Writer::Name() << "\n" ; Writer file_(kPath); { Timer timer; size_t write_total = kWriteTotal; while (write_total) { file_.Write(); file_.Flush(); write_total -= kBuf.size(); } } std::cout << "\n" ; } int main() { std::cout << "File size " << kFileSize / 1024 << " Kb\n" ; std::cout << "Writing " << kWriteTotal / 1024 << " Kb to it\n" ; std::cout << "\n" ; Test<SimpleCyclicWriter>(); Test<DSyncCyclicWriter>(); Test<MmappedCyclicWriter>(); Test<ODirectODsyncCyclicWriter>(); Test<OAppendAppendWriter>(); Test<SimpleAppendWriter>(); return EXIT_SUCCESS; } So, writing to circular file is clearly much faster than appending to file.

Eugene Kosov (Inactive) added a comment - 2020-01-07 16:24

junsu danblack could you please run my program on some server hardware? I have no access to anything but my not so up to date laptop.

The biggest questing is to decide what's faster: appending to file or writing to a circular file. Sorry, no CLI interface for my program, but you tweak file size and amount of data to write by changing globals.

Eugene Kosov (Inactive) added a comment - 2020-01-07 16:24 junsu danblack could you please run my program on some server hardware? I have no access to anything but my not so up to date laptop. The biggest questing is to decide what's faster: appending to file or writing to a circular file. Sorry, no CLI interface for my program, but you tweak file size and amount of data to write by changing globals.

zongzhi chen added a comment - 2020-01-07 19:31

Hey guys. I have done almost the same work. change the redo log from circular file to appending to new file..
Of course writing to circular file is much faster than appending to file. Since when appending to file, it need to modify the inode and allocate an extent to the file, which need about 8 times than writing to circular file. I have show the result in this slide https://www.slideshare.net/baotiao/polardb-percona19 from page 18.
However, writing to circular file need to solve "read-on-write" issue, but appending to file don't need it.

so the way we use in our environment is that when there is no stale redo log file, we allocate a new redo log file and filling the file with zero. And in the background, if we don't need some stale redo log file, we don't delete directly, we rename the stale file as new redo log file. And in InnoDB, we also padding the write to 4k to avoid the "read-on-write" issue.

zongzhi chen added a comment - 2020-01-07 19:31 Hey guys. I have done almost the same work. change the redo log from circular file to appending to new file.. Of course writing to circular file is much faster than appending to file. Since when appending to file, it need to modify the inode and allocate an extent to the file, which need about 8 times than writing to circular file. I have show the result in this slide https://www.slideshare.net/baotiao/polardb-percona19 from page 18. However, writing to circular file need to solve "read-on-write" issue, but appending to file don't need it. so the way we use in our environment is that when there is no stale redo log file, we allocate a new redo log file and filling the file with zero. And in the background, if we don't need some stale redo log file, we don't delete directly, we rename the stale file as new redo log file. And in InnoDB, we also padding the write to 4k to avoid the "read-on-write" issue.

Daniel Black added a comment - 2020-01-08 03:03

'expanded size'
static const size_t kFileSize = 1 * 128 * 1024 * 1024;

static const size_t kWriteTotal = 4 * kFileSize;

'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9
$ ~/write_test
File size 131072 Kb
Writing 524288 Kb to it

Simple cyclic file
Took 132 seconds 538 milliseconds

O_DSYNC cyclic file
Took 61 seconds 86 milliseconds

Mmapped cyclic file
Took 248 seconds 543 milliseconds

O_DIRECT\|O_DSYNC cyclic file
pwrite() returned errno Invalid argument

O_APPEND append file
Took 959 seconds 463 milliseconds

Simple append file
Took 770 seconds 324 milliseconds

'2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64'
./write_test
File size 131072 Kb
Writing 524288 Kb to it

Simple cyclic file
Took 98 seconds 359 milliseconds

O_DSYNC cyclic file
Took 99 seconds 659 milliseconds

Mmapped cyclic file
Took 100 seconds 107 milliseconds

O_DIRECT\|O_DSYNC cyclic file
Took 42 seconds 208 milliseconds

O_APPEND append file
Took 300 seconds 383 milliseconds

Simple append file
Took 303 seconds 216 milliseconds

Looking why NVMe was slow - seems to be 4k LBA:

'smartctl -a /dev/nvme0n1'
smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: PCIe3 1.6TB NVMe Flash Adapter
Serial Number: CJH0010003EE
Firmware Version: KMIPP107
PCI Vendor ID: 0x1c58
PCI Vendor Subsystem ID: 0x1014
IEEE OUI Identifier: 0x000cca
Controller ID: 1269
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB]
Namespace 1 Formatted LBA Size: 4096
Local Time is: Wed Jan 8 13:22:03 2020 AEDT
Firmware Updates (0x08): 4 Slots
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 20.00W - - 1 1 1 1 15000 15000
2 + 15.00W - - 2 2 2 2 15000 15000
3 + 10.00W - - 3 3 3 3 15000 15000
4 - 10.00W - - 3 3 3 3 15000 15000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 4096 0 0
1 - 4096 8 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: PCIe3 1.6TB NVMe Flash Adapter
Serial Number: CJH0010003EE
Firmware Version: KMIPP107
PCI Vendor ID: 0x1c58
PCI Vendor Subsystem ID: 0x1014
IEEE OUI Identifier: 0x000cca
Controller ID: 1269
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB]
Namespace 1 Formatted LBA Size: 4096
Local Time is: Wed Jan 8 13:22:03 2020 AEDT
Firmware Updates (0x08): 4 Slots
Optional Admin Commands (0x0006): Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 25.00W - - 0 0 0 0 15000 15000
1 + 20.00W - - 1 1 1 1 15000 15000
2 + 15.00W - - 2 2 2 2 15000 15000
3 + 10.00W - - 3 3 3 3 15000 15000
4 - 10.00W - - 3 3 3 3 15000 15000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 4096 0 0
1 - 4096 8 1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 36 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 3,643,068 [1.86 TB]
Data Units Written: 6,647,820 [3.40 TB]
Host Read Commands: 83,729,047
Host Write Commands: 81,415,498
Controller Busy Time: 841
Power Cycles: 1,569
Power On Hours: 22,345
Unsafe Shutdowns: 516
Media and Data Integrity Errors: 0
Error Information Log Entries: 0

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged

Data Units Read: 3,643,068 [1.86 TB]
Data Units Written: 6,647,820 [3.40 TB]
Host Read Commands: 83,729,047
Host Write Commands: 81,415,498
Controller Busy Time: 841
Power Cycles: 1,569
Power On Hours: 22,345
Unsafe Shutdowns: 516
Media and Data Integrity Errors: 0
Error Information Log Entries: 0

Error Information (NVMe Log 0x01, max 63 entries)
No Errors Logged

'dumpe2fs'
sudo dumpe2fs -h /dev/nvme0n1p1 \| more
dumpe2fs 1.44.1 (24-Mar-2018)
Filesystem volume name: scratch
Last mounted on: /scratch
Filesystem UUID: 356af640-5d8e-4256-9dce-a0983c9e0e43
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags: unsigned_directory_hash
Default mount options: user_xattr acl
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 65052672
Block count: 260208384
Reserved block count: 13010419
Free blocks: 132171164
Free inodes: 62683469
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 961
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Wed Nov 22 15:34:07 2017
Last mount time: Tue Jan 7 20:38:12 2020
Last write time: Tue Jan 7 20:38:12 2020
Mount count: 293
Maximum mount count: -1
Last checked: Wed Nov 22 15:34:07 2017
Check interval: 0 (<none>)
Lifetime writes: 20 TB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 32
Desired extra isize: 32
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: b68c7dc9-1e4b-46d2-b6d2-6f2d0128b12a
Journal backup: inode blocks
Checksum type: crc32c
Checksum: 0xb0abf812
Journal features: journal_incompat_revoke journal_checksum_v3
Journal size: 1024M
Journal length: 262144
Journal sequence: 0x0050f6ac
Journal start: 153997
Journal checksum type: crc32c
Journal checksum: 0xc52e4213

'4k test'
static const size_t kFileSize = 1 * 128 * 1024 * 1024;

static const size_t kWriteTotal = 4 * kFileSize;

static const size_t kBuffSize = 4096;

static const std::array<unsigned char, kBuffSize> kBuf = { 33 };

static const std::array<unsigned char, kBuffSize> kAlignedBuf alignas(kBuffSize) = { 33 };

'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9 - 4k writes
File size 131072 Kb
Writing 524288 Kb to it

Simple cyclic file

Took 49 seconds 300 milliseconds

O_DSYNC cyclic file
Took 57 seconds 536 milliseconds

Mmapped cyclic file
Took 75 seconds 627 milliseconds

O_DIRECT\|O_DSYNC cyclic file
Took 64 seconds 600 milliseconds

O_APPEND append file
Took 231 seconds 47 milliseconds

Simple append file
Took 229 seconds 604 milliseconds

'2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64' - 4K writes
File size 131072 Kb
Writing 524288 Kb to it

Simple cyclic file
Took 16 seconds 5 milliseconds

O_DSYNC cyclic file
Took 15 seconds 664 milliseconds

Mmapped cyclic file
Took 17 seconds 846 milliseconds

O_DIRECT\|O_DSYNC cyclic file
Took 14 seconds 823 milliseconds

O_APPEND append file
Took 48 seconds 328 milliseconds

Simple append file
Took 48 seconds 207 milliseconds

'xfs_info'
xfs_info /var
meta-data=/dev/mapper/ka4_disks-ka4_var isize=512 agcount=4, agsize=73119744 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1 spinodes=0 rmapbt=0
= reflink=0
data = bsize=4096 blocks=292478976, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=142812, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

Seems this was setup for 4k too.

Daniel Black added a comment - 2020-01-08 03:03 'expanded size' static const size_t kFileSize = 1 * 128 * 1024 * 1024; static const size_t kWriteTotal = 4 * kFileSize; 'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9 $ ~/write_test File size 131072 Kb Writing 524288 Kb to it Simple cyclic file Took 132 seconds 538 milliseconds O_DSYNC cyclic file Took 61 seconds 86 milliseconds Mmapped cyclic file Took 248 seconds 543 milliseconds O_DIRECT|O_DSYNC cyclic file pwrite() returned errno Invalid argument O_APPEND append file Took 959 seconds 463 milliseconds Simple append file Took 770 seconds 324 milliseconds '2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64' ./write_test File size 131072 Kb Writing 524288 Kb to it Simple cyclic file Took 98 seconds 359 milliseconds O_DSYNC cyclic file Took 99 seconds 659 milliseconds Mmapped cyclic file Took 100 seconds 107 milliseconds O_DIRECT|O_DSYNC cyclic file Took 42 seconds 208 milliseconds O_APPEND append file Took 300 seconds 383 milliseconds Simple append file Took 303 seconds 216 milliseconds Looking why NVMe was slow - seems to be 4k LBA: 'smartctl -a /dev/nvme0n1' smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: PCIe3 1.6TB NVMe Flash Adapter Serial Number: CJH0010003EE Firmware Version: KMIPP107 PCI Vendor ID: 0x1c58 PCI Vendor Subsystem ID: 0x1014 IEEE OUI Identifier: 0x000cca Controller ID: 1269 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB] Namespace 1 Formatted LBA Size: 4096 Local Time is: Wed Jan 8 13:22:03 2020 AEDT Firmware Updates (0x08): 4 Slots Optional Admin Commands (0x0006): Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 15000 15000 1 + 20.00W - - 1 1 1 1 15000 15000 2 + 15.00W - - 2 2 2 2 15000 15000 3 + 10.00W - - 3 3 3 3 15000 15000 4 - 10.00W - - 3 3 3 3 15000 15000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 4096 0 0 1 - 4096 8 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02, NSID 0x1) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0%smartctl 6.6 2016-05-31 r4324 [ppc64le-linux-5.3.0] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Number: PCIe3 1.6TB NVMe Flash Adapter Serial Number: CJH0010003EE Firmware Version: KMIPP107 PCI Vendor ID: 0x1c58 PCI Vendor Subsystem ID: 0x1014 IEEE OUI Identifier: 0x000cca Controller ID: 1269 Number of Namespaces: 1 Namespace 1 Size/Capacity: 1,600,321,314,816 [1.60 TB] Namespace 1 Formatted LBA Size: 4096 Local Time is: Wed Jan 8 13:22:03 2020 AEDT Firmware Updates (0x08): 4 Slots Optional Admin Commands (0x0006): Format Frmw_DL Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 25.00W - - 0 0 0 0 15000 15000 1 + 20.00W - - 1 1 1 1 15000 15000 2 + 15.00W - - 2 2 2 2 15000 15000 3 + 10.00W - - 3 3 3 3 15000 15000 4 - 10.00W - - 3 3 3 3 15000 15000 Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 4096 0 0 1 - 4096 8 1 === START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED SMART/Health Information (NVMe Log 0x02, NSID 0x1) Critical Warning: 0x00 Temperature: 36 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 3,643,068 [1.86 TB] Data Units Written: 6,647,820 [3.40 TB] Host Read Commands: 83,729,047 Host Write Commands: 81,415,498 Controller Busy Time: 841 Power Cycles: 1,569 Power On Hours: 22,345 Unsafe Shutdowns: 516 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Error Information (NVMe Log 0x01, max 63 entries) No Errors Logged Data Units Read: 3,643,068 [1.86 TB] Data Units Written: 6,647,820 [3.40 TB] Host Read Commands: 83,729,047 Host Write Commands: 81,415,498 Controller Busy Time: 841 Power Cycles: 1,569 Power On Hours: 22,345 Unsafe Shutdowns: 516 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Error Information (NVMe Log 0x01, max 63 entries) No Errors Logged 'dumpe2fs' sudo dumpe2fs -h /dev/nvme0n1p1 | more dumpe2fs 1.44.1 (24-Mar-2018) Filesystem volume name: scratch Last mounted on: /scratch Filesystem UUID: 356af640-5d8e-4256-9dce-a0983c9e0e43 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum Filesystem flags: unsigned_directory_hash Default mount options: user_xattr acl Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 65052672 Block count: 260208384 Reserved block count: 13010419 Free blocks: 132171164 Free inodes: 62683469 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 961 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Wed Nov 22 15:34:07 2017 Last mount time: Tue Jan 7 20:38:12 2020 Last write time: Tue Jan 7 20:38:12 2020 Mount count: 293 Maximum mount count: -1 Last checked: Wed Nov 22 15:34:07 2017 Check interval: 0 (<none>) Lifetime writes: 20 TB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 32 Desired extra isize: 32 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: b68c7dc9-1e4b-46d2-b6d2-6f2d0128b12a Journal backup: inode blocks Checksum type: crc32c Checksum: 0xb0abf812 Journal features: journal_incompat_revoke journal_checksum_v3 Journal size: 1024M Journal length: 262144 Journal sequence: 0x0050f6ac Journal start: 153997 Journal checksum type: crc32c Journal checksum: 0xc52e4213 '4k test' static const size_t kFileSize = 1 * 128 * 1024 * 1024; static const size_t kWriteTotal = 4 * kFileSize; static const size_t kBuffSize = 4096; static const std::array<unsigned char, kBuffSize> kBuf = { 33 }; static const std::array<unsigned char, kBuffSize> kAlignedBuf alignas(kBuffSize) = { 33 }; 'nvme, ext4 (rw,relatime), kernel 5.3.0 - POWER9 - 4k writes File size 131072 Kb Writing 524288 Kb to it Simple cyclic file Took 49 seconds 300 milliseconds O_DSYNC cyclic file Took 57 seconds 536 milliseconds Mmapped cyclic file Took 75 seconds 627 milliseconds O_DIRECT|O_DSYNC cyclic file Took 64 seconds 600 milliseconds O_APPEND append file Took 231 seconds 47 milliseconds Simple append file Took 229 seconds 604 milliseconds '2x 12G SAS disks, raid0 - ServeRAID M5210, lvm (1 linear continuous map), xfs (rw,noatime,attr2,inode64,logbufs=8,logbsize=32k,noquota), 5.3.0-24-generic (ubuntu), x86_64' - 4K writes File size 131072 Kb Writing 524288 Kb to it Simple cyclic file Took 16 seconds 5 milliseconds O_DSYNC cyclic file Took 15 seconds 664 milliseconds Mmapped cyclic file Took 17 seconds 846 milliseconds O_DIRECT|O_DSYNC cyclic file Took 14 seconds 823 milliseconds O_APPEND append file Took 48 seconds 328 milliseconds Simple append file Took 48 seconds 207 milliseconds 'xfs_info' xfs_info /var meta-data=/dev/mapper/ka4_disks-ka4_var isize=512 agcount=4, agsize=73119744 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1 spinodes=0 rmapbt=0 = reflink=0 data = bsize=4096 blocks=292478976, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=142812, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 Seems this was setup for 4k too.

zongzhi chen added a comment - 2020-01-09 21:28

@Marko why did you abandon the work of partitioned the redo log. Since we know that AWS aurora must have done this work. Otherwise, they can't partitioned the data by space_id:page_id into multi storage node. The redo log which modify the page must stay in the same partition with the page. This is the basic design that then can apply the redo log and crash recovery parallel..

I really think that partitioned redo log is a good idea, since in the compute-storage separation architecture, InnoDB need to support much more larger data size. such as 20T or 100T. POLARDB has meet this case, If we don't partitioned the redo log and data page, then we can't parallel well..

zongzhi chen added a comment - 2020-01-09 21:28 @Marko why did you abandon the work of partitioned the redo log. Since we know that AWS aurora must have done this work. Otherwise, they can't partitioned the data by space_id:page_id into multi storage node. The redo log which modify the page must stay in the same partition with the page. This is the basic design that then can apply the redo log and crash recovery parallel.. I really think that partitioned redo log is a good idea, since in the compute-storage separation architecture, InnoDB need to support much more larger data size. such as 20T or 100T. POLARDB has meet this case, If we don't partitioned the redo log and data page, then we can't parallel well..

Eugene Kosov (Inactive) added a comment - 2020-01-10 07:18

baotiao thank you for your answer! I think to solve `read-on-write` we can use posix_fadvise() or posix_mavdise(). Did you try that? Now as I understand you fill file with zeroes to force OS to cache it. Can you instead preread it somehow? At first glance that looks less invasive than writing.

Eugene Kosov (Inactive) added a comment - 2020-01-10 07:18 baotiao thank you for your answer! I think to solve `read-on-write` we can use posix_fadvise() or posix_mavdise() . Did you try that? Now as I understand you fill file with zeroes to force OS to cache it. Can you instead preread it somehow? At first glance that looks less invasive than writing.

zongzhi chen added a comment - 2020-01-10 08:02

NO, The root cause is that if the read size isn't aligned to 4k block, os need to read the whole 4k block, and modify the data you want. The write operation need another read operation

The fill file with zero operation solve the allocation of extent when append to file. If the address in file hasn't beed written before then it need allocate extent from file system.

zongzhi chen added a comment - 2020-01-10 08:02 NO, The root cause is that if the read size isn't aligned to 4k block, os need to read the whole 4k block, and modify the data you want. The write operation need another read operation The fill file with zero operation solve the allocation of extent when append to file. If the address in file hasn't beed written before then it need allocate extent from file system.

Eugene Kosov (Inactive) added a comment - 2020-01-29 17:12

danblack hi. We become interested in write optimization FUA (https://bobsql.com/sql-server-on-linux-forced-unit-access-fua-internals/) It's implemented at least on XFS. And you performed benchmarks on it where `O_DIRECT|O_DSYNC` was the fastest thing. Do you have that FUA enabled? You probably can check it like this:

$ dmesg | grep -i fua

[    1.549434] sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

O_DIRECT|O_DSYNC cyclic file

pwrite() returned errno Invalid argument

Do you think it was a bug in a testing program?

Also, maybe you know what `fdatasync()` does for file descriptors opened with `O_DSYNC`? In my understanding it whould be a no-op.

Eugene Kosov (Inactive) added a comment - 2020-01-29 17:12 danblack hi. We become interested in write optimization FUA ( https://bobsql.com/sql-server-on-linux-forced-unit-access-fua-internals/ ) It's implemented at least on XFS. And you performed benchmarks on it where `O_DIRECT|O_DSYNC` was the fastest thing. Do you have that FUA enabled? You probably can check it like this: $ dmesg | grep -i fua [ 1.549434] sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA O_DIRECT|O_DSYNC cyclic file pwrite() returned errno Invalid argument Do you think it was a bug in a testing program? Also, maybe you know what `fdatasync()` does for file descriptors opened with `O_DSYNC`? In my understanding it whould be a no-op.

Daniel Black added a comment - 2020-01-30 04:46 - edited

pwrite() returned errno Invalid argument - I assume this was writing a 512 block when the underlying layer was 4k as changing to 4k got a result for this. fstat on the fs and use `st_blksize` probably avoids this.

Seem FUA is a standard part of nvme (which I used) and the linux nvme driver has some concept of it based on its codebase.

Appears to be nothing special with `O_DSYNC` (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/sync.c?h=v5.5#n196) nothing seemingly special at ext4/xfs implementation either.

Daniel Black added a comment - 2020-01-30 04:46 - edited pwrite() returned errno Invalid argument - I assume this was writing a 512 block when the underlying layer was 4k as changing to 4k got a result for this. fstat on the fs and use `st_blksize` probably avoids this. Seem FUA is a standard part of nvme (which I used) and the linux nvme driver has some concept of it based on its codebase. Appears to be nothing special with `O_DSYNC` ( https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/sync.c?h=v5.5#n196 ) nothing seemingly special at ext4/xfs implementation either.

Marko Mäkelä added a comment - 2020-02-14 13:22

baotiao, sorry, I missed your question.

I think that if we eliminate the undo log pages and the TRX_SYS page and write all user transaction data only to the redo log, we will remove an artificial synchronization point between independent transactions. I have been toying with the idea, ever since junsu challenged me to think how to make more efficient use of NVDIMM or PMEM (byte-addressable persistent storage). The idea would be to write undo log records into the redo log, like many databases do it. The ~~MDEV-12353~~ redo log format does allow this easily. We could even do memory-mapped I/O and let the DB_ROLL_PTR be a direct pointer to the redo log, to speed up MVCC and ROLLBACK. I have not come up with any good solution for the redo log checkpointing, though. In this scheme, an old read view or active transaction can prevent a log checkpoint from being made. (Alternatively, we would have to append old undo log information to the redo log and patch all the DB_ROLL_PTR that are pointing to them.)

If transactions are truly independent due to not sharing any undo log pages, then I think that the partitioned log should work.

Marko Mäkelä added a comment - 2020-02-14 13:22 baotiao , sorry, I missed your question. I think that if we eliminate the undo log pages and the TRX_SYS page and write all user transaction data only to the redo log, we will remove an artificial synchronization point between independent transactions. I have been toying with the idea, ever since junsu challenged me to think how to make more efficient use of NVDIMM or PMEM (byte-addressable persistent storage). The idea would be to write undo log records into the redo log, like many databases do it. The MDEV-12353 redo log format does allow this easily. We could even do memory-mapped I/O and let the DB_ROLL_PTR be a direct pointer to the redo log, to speed up MVCC and ROLLBACK . I have not come up with any good solution for the redo log checkpointing, though. In this scheme, an old read view or active transaction can prevent a log checkpoint from being made. (Alternatively, we would have to append old undo log information to the redo log and patch all the DB_ROLL_PTR that are pointing to them.) If transactions are truly independent due to not sharing any undo log pages, then I think that the partitioned log should work.

Marko Mäkelä added a comment - 2020-02-21 08:19 - edited

The current design idea is as follows:

Checkpoint information & file operations

There will be a separate file that contains information about log checkpoints and data file names. This file can contain information about multiple checkpoints. (The old ib_logfile0 only has room for 2 checkpoints.)

The checkpoint log file allows to construct the mapping between numeric tablespace identifiers and file names.
The checkpoint log file is never encrypted. This allows mariabackup --backup to work without having access to encryption keys. Because file names are not encrypted in the file system either, and because LSNs appear unencrypted in the diagnostic output, encrypting this file does not offer any security benefits.

A checkpoint log record comprises the checkpoint LSN and a byte offset in the circular log file, pointing to the log right after the LSN. It will also include the value of the sequence_bit that is described below.

The circular log

If any parameters of the log change, the redo log will be rebuilt:

innodb_log_file_size
innodb_encrypt_log, or the encryption key (key rotation will require the log to be rebuilt)
innodb_log_checksums (if we choose to revive this deprecated parameter and implement a variant that lacks checksums)

There will be no fixed block structure in the circular log file. The LSN will count InnoDB mini-transactions, not bytes. This allows some flexibility: For example, a future version of mariabackup --backup --incremental could inject records to the backup of the main log file, instead of writing separate .delta files.

The circular log file will consist of length-tagged sequences of bytes:

byte *append_log(byte *log, const void *payload, size_t size, bool skip_bit, bool sequence_bit)

  size_t length= size;

  if (!skip_bit && innodb_log_checksums)

    length+= 4; /* CRC-32C at the end of the payload */

  byte * const start= log;

  log= mlog_encode_varint(log, length << 2 | skip_bit << 1 | sequence_bit);

  if (!skip_bit)

    memcpy(log, payload, size);

  log+= size;

  if (!skip_bit && innodb_log_checksums)

    /* Always compute the checksum without the sequence_bit. */

    log[-size - 1]&= 0xfe;

    mach_write_to_4(log, ut_crc32(start, log - start));

    log[-size - 1]|= sequence_bit;

    log+= 4;

  return log;

Explanation:

If encryption is enabled, the payload will have been encrypted before the log is written. The length and the checksum will not be encrypted.
The sequence_bit will be toggled whenever the write position jumps from the end of the circular log file to the beginning.
The skip_bit allows us to write a partially filled log block of any size. If we need to persist the log (due to user transaction commit) and we are L bytes into a N-byte block (this depends on the underlying storage!), we can write a special record to say ‘skip the next N-L bytes’. There is no need to initialize (memset()) any skipped garbage bytes.
We assume that the log ends when we get a CRC-32C mismatch or the sequence_bit of the next record differs from what we expect. (The last log record could end exactly at a byte offset where a log record before the last wrap-around had been stored, and that record would have a valid checksum.)
Note: due to the skip_bit and the lack of memset(), it may be necessary to always store checksums, to reliably detect the end of the log.

Marko Mäkelä added a comment - 2020-02-21 08:19 - edited The current design idea is as follows: Checkpoint information & file operations There will be a separate file that contains information about log checkpoints and data file names. This file can contain information about multiple checkpoints. (The old ib_logfile0 only has room for 2 checkpoints.) The checkpoint log file allows to construct the mapping between numeric tablespace identifiers and file names. The checkpoint log file is never encrypted. This allows mariabackup --backup to work without having access to encryption keys. Because file names are not encrypted in the file system either, and because LSNs appear unencrypted in the diagnostic output, encrypting this file does not offer any security benefits. A checkpoint log record comprises the checkpoint LSN and a byte offset in the circular log file, pointing to the log right after the LSN. It will also include the value of the sequence_bit that is described below. The circular log If any parameters of the log change, the redo log will be rebuilt: innodb_log_file_size innodb_encrypt_log , or the encryption key (key rotation will require the log to be rebuilt) innodb_log_checksums (if we choose to revive this deprecated parameter and implement a variant that lacks checksums) There will be no fixed block structure in the circular log file. The LSN will count InnoDB mini-transactions, not bytes. This allows some flexibility: For example, a future version of mariabackup --backup --incremental could inject records to the backup of the main log file, instead of writing separate .delta files. The circular log file will consist of length-tagged sequences of bytes: byte *append_log(byte * log , const void *payload, size_t size, bool skip_bit, bool sequence_bit) { size_t length= size; if (!skip_bit && innodb_log_checksums) length+= 4; /* CRC-32C at the end of the payload */ byte * const start= log ; log = mlog_encode_varint( log , length << 2 | skip_bit << 1 | sequence_bit); if (!skip_bit) memcpy ( log , payload, size); log += size; if (!skip_bit && innodb_log_checksums) { /* Always compute the checksum without the sequence_bit. */ log [-size - 1]&= 0xfe; mach_write_to_4( log , ut_crc32(start, log - start)); log [-size - 1]|= sequence_bit; log += 4; } return log ; } Explanation: If encryption is enabled, the payload will have been encrypted before the log is written. The length and the checksum will not be encrypted. The sequence_bit will be toggled whenever the write position jumps from the end of the circular log file to the beginning. The skip_bit allows us to write a partially filled log block of any size. If we need to persist the log (due to user transaction commit) and we are L bytes into a N-byte block (this depends on the underlying storage!), we can write a special record to say ‘skip the next N-L bytes’. There is no need to initialize ( memset() ) any skipped garbage bytes. We assume that the log ends when we get a CRC-32C mismatch or the sequence_bit of the next record differs from what we expect. (The last log record could end exactly at a byte offset where a log record before the last wrap-around had been stored, and that record would have a valid checksum.) Note: due to the skip_bit and the lack of memset() , it may be necessary to always store checksums, to reliably detect the end of the log.

zongzhi chen added a comment - 2020-02-24 20:00

@markoMarko Mäkelä

The design right now is that the partition is still based on space_id:page_no, and then if the mtr modify multi page, then we need flush all the redo log that id modified.. Is that right? Please correct me if I misunderstanding something..
And even though this solution looks like too rought, I think it is a practical solution, since most of the mtr only modify single page, most of the mtr modify multi page is the smo operation and undo operation. And we can write the undo log in the redo log directly, then the undo log will stay the same file with the redo log. And the smo operation is useless..

However, how can we keep the operation of flushing multi file atomic?

zongzhi chen added a comment - 2020-02-24 20:00 @markoMarko Mäkelä The design right now is that the partition is still based on space_id:page_no, and then if the mtr modify multi page, then we need flush all the redo log that id modified.. Is that right? Please correct me if I misunderstanding something.. And even though this solution looks like too rought, I think it is a practical solution, since most of the mtr only modify single page, most of the mtr modify multi page is the smo operation and undo operation. And we can write the undo log in the redo log directly, then the undo log will stay the same file with the redo log. And the smo operation is useless.. However, how can we keep the operation of flushing multi file atomic?

Eugene Kosov (Inactive) added a comment - 2020-02-25 09:04

> However, how can we keep the operation of flushing multi file atomic?

I don't think we have such a problem at all. Only one file with redo data will exists. Writing to it looks like this: thread which owns mtr_t prepares buffer to write to a file (prepends it's size, computes crc32 and appends it to the end), then takes log mutex to write() buffer, releases mutex and performs fsync(). And that's it. Then, after some LSN is flushed to redo we can write(O_DIRECT|O_APPEND) corresponding checkpoint to a another file. Writes to log file and checkpoint files are not required to be atomical, thus, it's safe to crash right after flushing redo log, and before writing a checkpoint.

Eugene Kosov (Inactive) added a comment - 2020-02-25 09:04 > However, how can we keep the operation of flushing multi file atomic? I don't think we have such a problem at all. Only one file with redo data will exists. Writing to it looks like this: thread which owns mtr_t prepares buffer to write to a file (prepends it's size, computes crc32 and appends it to the end), then takes log mutex to write() buffer, releases mutex and performs fsync() . And that's it. Then, after some LSN is flushed to redo we can write(O_DIRECT|O_APPEND) corresponding checkpoint to a another file. Writes to log file and checkpoint files are not required to be atomical, thus, it's safe to crash right after flushing redo log, and before writing a checkpoint.

zongzhi chen added a comment - 2020-02-25 19:39

No, If the mtr is a smo operation, then this mtr will include multi page operation..
Then when mtr commit, it need to change more than one data page, then these two page may exist two different redo files as your partitioned by space_id:page_no. When we commit the trx, then we need to promise that the write to two redo logs is atomic. Since in the origin version, we only need to write to one file. It is easy to guarantee that the write is atomic. We don't have this problem.

zongzhi chen added a comment - 2020-02-25 19:39 No, If the mtr is a smo operation, then this mtr will include multi page operation.. Then when mtr commit, it need to change more than one data page, then these two page may exist two different redo files as your partitioned by space_id:page_no. When we commit the trx, then we need to promise that the write to two redo logs is atomic. Since in the origin version, we only need to write to one file. It is easy to guarantee that the write is atomic. We don't have this problem.

Eugene Kosov (Inactive) added a comment - 2020-02-25 20:07

Sorry, I think you have some misunderstanding. There are no several redo log files partitioned by space_it::page_no. Before 10.5 it was possible to have several redo log files, but they're used as one logical circular file. In current 10.5 it's already impossible to have several log files. And current design for new file format still assumes just one circular redo log file.

Eugene Kosov (Inactive) added a comment - 2020-02-25 20:07 Sorry, I think you have some misunderstanding. There are no several redo log files partitioned by space_it::page_no. Before 10.5 it was possible to have several redo log files, but they're used as one logical circular file. In current 10.5 it's already impossible to have several log files. And current design for new file format still assumes just one circular redo log file.

zongzhi chen added a comment - 2020-02-26 07:06

Sorry, I saw the design document
"The idea: Partition the log into append-only, truncate-the-start files"

I suppose we are talking about partition the redo log into multi redo log file..

zongzhi chen added a comment - 2020-02-26 07:06 Sorry, I saw the design document "The idea: Partition the log into append-only, truncate-the-start files" I suppose we are talking about partition the redo log into multi redo log file..

Eugene Kosov (Inactive) added a comment - 2020-02-26 11:26

baotiao this initial design was changed. You can find the recent version in comments for this issue. We decided to the the simplest possible thing: metadata file with creator version and such info is a separate file, checkpoints + file operations (create, delete, rename) is a separate append-only file, and actual redo log data is a third separate circular file.

Eugene Kosov (Inactive) added a comment - 2020-02-26 11:26 baotiao this initial design was changed. You can find the recent version in comments for this issue. We decided to the the simplest possible thing: metadata file with creator version and such info is a separate file, checkpoints + file operations (create, delete, rename) is a separate append-only file, and actual redo log data is a third separate circular file.

zongzhi chen added a comment - 2020-02-26 19:46

@Eugene Kosov Ok, I got it..

let me summary changes are:
1. separate the chekpoint information from ib_logfile
2. add the file operation in the checkpoint file
3. the undo data in the redo log as comment by marko
4. lsn count the number of mtr, not byte

is it right?

However, I really interested in seperate the data into multi redo log files, since in the Architeture like aurora and PolarDB, there will exist about 100T large btree, if we don't partition it, there will only one redo log file, we can't make use of the under layer storage of fs..

zongzhi chen added a comment - 2020-02-26 19:46 @Eugene Kosov Ok, I got it.. let me summary changes are: 1. separate the chekpoint information from ib_logfile 2. add the file operation in the checkpoint file 3. the undo data in the redo log as comment by marko 4. lsn count the number of mtr, not byte is it right? However, I really interested in seperate the data into multi redo log files, since in the Architeture like aurora and PolarDB, there will exist about 100T large btree, if we don't partition it, there will only one redo log file, we can't make use of the under layer storage of fs..

Marko Mäkelä added a comment - 2020-02-28 18:00 - edited

baotiao, I have been thinking of the following format:

Maintain a 512-byte header in ib_logfile0. Add information needed by innodb_encrypt_log there.
After the first 512 bytes, write an append-only log consisting only of file name records (similar to the ~~MDEV-12353~~ FILE_ records) and fixed-length checkpoint records. Each record will be followed by a CRC-32C checksum. The ib_logfile0 will never be encrypted. (File names and LSNs appear unencrypted in the file system and in the logs anyway.)
We write a separate circular file ib_logdata that may be encrypted. The checkpoint records point to a byte offset within this file. This file can support any underlying physical sector size.
LSN will be in bytes, just like before. But, encryption might no longer use LSN as part of the initialization vector, so that we can encrypt mtr_t::m_log before acquiring any mutex.
Rebuilding the redo log file will not affect the LSN.
In the future, mariabackup --backup --incremental could get rid of .delta files and instead write the information to the ib_logfile0 file. Any amount of data can be written to that file without affecting the LSN.

The circular log file could technically be split to multiple files, but we did not see a need for that. I think 128 TiB should suffice for quite some time in the future. The log checkpoint record would be 1+8+6+4=19 bytes.

At this point, we will not write undo log data to the redo log. I do not know if we will ever do that. I only mentioned the idea and challenges around it.

Marko Mäkelä added a comment - 2020-02-28 18:00 - edited baotiao , I have been thinking of the following format: Maintain a 512-byte header in ib_logfile0 . Add information needed by innodb_encrypt_log there. After the first 512 bytes, write an append-only log consisting only of file name records (similar to the MDEV-12353 FILE_ records) and fixed-length checkpoint records. Each record will be followed by a CRC-32C checksum. The ib_logfile0 will never be encrypted. (File names and LSNs appear unencrypted in the file system and in the logs anyway.) We write a separate circular file ib_logdata that may be encrypted. The checkpoint records point to a byte offset within this file. This file can support any underlying physical sector size. LSN will be in bytes, just like before. But, encryption might no longer use LSN as part of the initialization vector, so that we can encrypt mtr_t::m_log before acquiring any mutex. Rebuilding the redo log file will not affect the LSN. In the future, mariabackup --backup --incremental could get rid of .delta files and instead write the information to the ib_logfile0 file. Any amount of data can be written to that file without affecting the LSN. The circular log file could technically be split to multiple files, but we did not see a need for that. I think 128 TiB should suffice for quite some time in the future. The log checkpoint record would be 1+8+6+4=19 bytes. At this point, we will not write undo log data to the redo log. I do not know if we will ever do that. I only mentioned the idea and challenges around it.

Eugene Kosov (Inactive) added a comment - 2020-04-28 09:06

I did some research on writing to a file from different threads. My testing code is in https://github.com/kevgs/redo/
Here are the results for single thread for commit 0400fe0cc3829177c05341413e555c1f09f81b54 on my weak laptop with HDD:

File size: 134217728, threads: 1, duration: 20s

Circular file:

RedoSyncTLSBuffer handled 841 commits

RedoSync handled 368 commits

RedoSyncBuffer handled 801 commits

RedoODirectSparse handled 390 commits

RedoODirectBuffer handled 513 commits

RedoODirectTwoBuffers handled 747 commits

RedoOverlappedFsync handled 670 commits

RedoOverlappedMsync handled 3235 commits

RedoGroupCommit handled 744 commits

Append-only file:

RedoSyncTLSBuffer handled 758 commits

RedoSync handled 361 commits

RedoSyncBuffer handled 823 commits

RedoODirectSparse handled 365 commits

RedoODirectBuffer handled 756 commits

RedoODirectTwoBuffers handled 753 commits

RedoOverlappedFsync handled 786 commits

RedoGroupCommit handled 766 commits

And for 64 threads:

File size: 134217728, threads: 64, duration: 20s

Circular file:

RedoSyncTLSBuffer handled 825 commits

RedoSync handled 434 commits

RedoSyncBuffer handled 823 commits

RedoODirectSparse handled 487 commits

RedoODirectBuffer handled 851 commits

RedoODirectTwoBuffers handled 863 commits

RedoOverlappedFsync handled 11242 commits

RedoOverlappedMsync handled 58714 commits

RedoGroupCommit handled 1035 commits

Append-only file:

RedoSyncTLSBuffer handled 879 commits

RedoSync handled 514 commits

RedoSyncBuffer handled 855 commits

RedoODirectSparse handled 477 commits

RedoODirectBuffer handled 830 commits

RedoODirectTwoBuffers handled 896 commits

RedoOverlappedFsync handled 11318 commits

RedoGroupCommit handled 1361 commits

In multithreaded environment append-only file has the same performance as circular one.

Eugene Kosov (Inactive) added a comment - 2020-04-28 09:06 I did some research on writing to a file from different threads. My testing code is in https://github.com/kevgs/redo/ Here are the results for single thread for commit 0400fe0cc3829177c05341413e555c1f09f81b54 on my weak laptop with HDD: File size: 134217728, threads: 1, duration: 20s Circular file: RedoSyncTLSBuffer handled 841 commits RedoSync handled 368 commits RedoSyncBuffer handled 801 commits RedoODirectSparse handled 390 commits RedoODirectBuffer handled 513 commits RedoODirectTwoBuffers handled 747 commits RedoOverlappedFsync handled 670 commits RedoOverlappedMsync handled 3235 commits RedoGroupCommit handled 744 commits Append-only file: RedoSyncTLSBuffer handled 758 commits RedoSync handled 361 commits RedoSyncBuffer handled 823 commits RedoODirectSparse handled 365 commits RedoODirectBuffer handled 756 commits RedoODirectTwoBuffers handled 753 commits RedoOverlappedFsync handled 786 commits RedoGroupCommit handled 766 commits And for 64 threads: File size: 134217728, threads: 64, duration: 20s Circular file: RedoSyncTLSBuffer handled 825 commits RedoSync handled 434 commits RedoSyncBuffer handled 823 commits RedoODirectSparse handled 487 commits RedoODirectBuffer handled 851 commits RedoODirectTwoBuffers handled 863 commits RedoOverlappedFsync handled 11242 commits RedoOverlappedMsync handled 58714 commits RedoGroupCommit handled 1035 commits Append-only file: RedoSyncTLSBuffer handled 879 commits RedoSync handled 514 commits RedoSyncBuffer handled 855 commits RedoODirectSparse handled 477 commits RedoODirectBuffer handled 830 commits RedoODirectTwoBuffers handled 896 commits RedoOverlappedFsync handled 11318 commits RedoGroupCommit handled 1361 commits In multithreaded environment append-only file has the same performance as circular one.

Eugene Kosov (Inactive) added a comment - 2020-04-28 09:53

Actually with fdatasync() instead of fsync() the picture is different now:

File size: 134217728, threads: 64, duration: 20s

Circular file:

RedoOverlappedFsync handled 50247 commits

Append-only file:

RedoOverlappedFsync handled 14399 commits

Eugene Kosov (Inactive) added a comment - 2020-04-28 09:53 Actually with fdatasync() instead of fsync() the picture is different now: File size: 134217728, threads: 64, duration: 20s Circular file: RedoOverlappedFsync handled 50247 commits Append-only file: RedoOverlappedFsync handled 14399 commits

Nuno added a comment - 2021-02-04 08:52

Hi,

If " innodb_log_files_in_group " is being removed/ignored, you may want to update the documentation at:
https://mariadb.com/kb/en/innodb-redo-log/

Which advices to configure that variable as required.

Thank you.

Nuno added a comment - 2021-02-04 08:52 Hi, If " innodb_log_files_in_group " is being removed/ignored, you may want to update the documentation at: https://mariadb.com/kb/en/innodb-redo-log/ Which advices to configure that variable as required. Thank you.

Ian Gilfillan added a comment - 2021-02-05 20:49

Thanks nunop, the documentation has been expanded to mention the 10.5 change in each place where it could be relevant, so hopefully its not misleading any longer.

Ian Gilfillan added a comment - 2021-02-05 20:49 Thanks nunop , the documentation has been expanded to mention the 10.5 change in each place where it could be relevant, so hopefully its not misleading any longer.

Marko Mäkelä added a comment - 2021-03-02 12:48

I am sorry, but it does not look like this can be completed in the 10.6 release. At the time when I was finally ready to resume this, we were already close to a feature freeze, and I was reluctant to start changing the file format and rewriting the recovery code that late. So, instead I spent the time on addressing another bottleneck: lock_sys.mutex. With ~~MDEV-23855~~, ~~MDEV-20612~~ and ~~MDEV-24738~~ completed, the major remaining scalability bottleneck in InnoDB is log_sys.mutex, which will be addressed by this task, hopefully very early during the 10.7 development cycle.

Marko Mäkelä added a comment - 2021-03-02 12:48 I am sorry, but it does not look like this can be completed in the 10.6 release. At the time when I was finally ready to resume this, we were already close to a feature freeze, and I was reluctant to start changing the file format and rewriting the recovery code that late. So, instead I spent the time on addressing another bottleneck: lock_sys.mutex . With MDEV-23855 , MDEV-20612 and MDEV-24738 completed, the major remaining scalability bottleneck in InnoDB is log_sys.mutex , which will be addressed by this task, hopefully very early during the 10.7 development cycle.

Marko Mäkelä added a comment - 2021-07-26 05:58 - edited

I experimented whether if it makes sense to eliminate FILE_MODIFY and similar records from the normal redo log. The idea was to introduce a separate append-only file exclusively for checkpoint and file name information. I ran Sysbench oltp_update_index using 80×10,000 rows and innodb_log_file_size=2G before and after the change, on the server process pinned to a single Intel® Xeon® E5-2630 processor. During the benchmark, the LSN grew to about 12.6 GiB, that is, 6 times the log file size.
I observed the following numbers of transactions per second with different numbers of concurrent connections:

server	10/tps	20/tps	30/tps
10.7	100909	157561	159615
10.7-modified	100806	159263	160353

We would seem to need a run with significantly more log checkpoints, because log checkpoints are where I would expect the FILE_MODIFY bookkeeping (the fil_system.named_spaces) to make the most difference. Here is another test with innodb_log_file_size=256M (1/8 of the original log file size):

server	10/tps	20/tps	30/tps
10.7	97272	151572	153554
10.7-modified	99020	154903	157719

The 2-minute benchmark runs probably are too short for us to draw any conclusions.

If this change does not appear to consistently lead to a significant improvement, then I think that it would be best to keep the single circular log file, with a slightly changed structure that I think should be friendly for both persistent memory (PMEM) and computational storage drives (such as those by ScaleFlux):

2 checkpoint (and file format) information blocks of 4096 bytes each
Circular log file, with arbitrary block size (64 to 4096 bytes); padded with NUL bytes that are not encrypted nor checksummed

Marko Mäkelä added a comment - 2021-07-26 05:58 - edited I experimented whether if it makes sense to eliminate FILE_MODIFY and similar records from the normal redo log. The idea was to introduce a separate append-only file exclusively for checkpoint and file name information. I ran Sysbench oltp_update_index using 80×10,000 rows and innodb_log_file_size=2G before and after the change, on the server process pinned to a single Intel® Xeon® E5-2630 processor. During the benchmark, the LSN grew to about 12.6 GiB, that is, 6 times the log file size. I observed the following numbers of transactions per second with different numbers of concurrent connections: server 10/tps 20/tps 30/tps 10.7 100909 157561 159615 10.7-modified 100806 159263 160353 We would seem to need a run with significantly more log checkpoints, because log checkpoints are where I would expect the FILE_MODIFY bookkeeping (the fil_system.named_spaces ) to make the most difference. Here is another test with innodb_log_file_size=256M (1/8 of the original log file size): server 10/tps 20/tps 30/tps 10.7 97272 151572 153554 10.7-modified 99020 154903 157719 The 2-minute benchmark runs probably are too short for us to draw any conclusions. If this change does not appear to consistently lead to a significant improvement, then I think that it would be best to keep the single circular log file, with a slightly changed structure that I think should be friendly for both persistent memory (PMEM) and computational storage drives (such as those by ScaleFlux): 2 checkpoint (and file format) information blocks of 4096 bytes each Circular log file, with arbitrary block size (64 to 4096 bytes); padded with NUL bytes that are not encrypted nor checksummed

Marko Mäkelä added a comment - 2021-07-30 05:16

Another benchmark for assessing the impact of eliminating the FILE_MODIFY records showed some improvement at 32 concurrent connections, and virtually no improvement at 16 concurrent connections.

It might still turn out that changing the log block format and switching to asynchronous O_DIRECT|O_DSYNC writes of log blocks will reduce contention on log_sys.mutex so much that eliminating the FILE_MODIFY records would not bring significant additional benefit. For this reason, it may be wise to first develop the concurrency-friendlier log block format and then test the removal of the FILE_MODIFY records on top of that.

Marko Mäkelä added a comment - 2021-07-30 05:16 Another benchmark for assessing the impact of eliminating the FILE_MODIFY records showed some improvement at 32 concurrent connections, and virtually no improvement at 16 concurrent connections. It might still turn out that changing the log block format and switching to asynchronous O_DIRECT|O_DSYNC writes of log blocks will reduce contention on log_sys.mutex so much that eliminating the FILE_MODIFY records would not bring significant additional benefit. For this reason, it may be wise to first develop the concurrency-friendlier log block format and then test the removal of the FILE_MODIFY records on top of that.

Marko Mäkelä added a comment - 2021-12-03 10:27 - edited

I am currently debugging a prototype that will retain the FILE_MODIFY records and a single ib_logfile0, to keep backups and log resizing simple.

Changes to log block format

innodb_encrypt_log metadata (encryption key information) will be moved to the 512-byte log file header block.
The 2 checkpoint blocks will move to 64 bytes at the start of 4096-byte blocks at offsets 4096 and 8192. This should allow O_DIRECT writes in all file systems as well as allow efficient writing of checkpoints on PMEM.
Redo log record data will start at byte offset 12288, right after the 2 checkpoint blocks.
The 512-byte log block structure will be eliminated. Basically, every mini-transaction will be an arbitrary-sized log block.

It will be easier to read and write the log file, because each mini-transaction will be a contiguous stream of bytes (except when the mini-transaction wraps around from the last byte of ib_logfile0 to the 12288 bytes from the start of the file.

The checkpoint block format

Bytes 4096 to 12287 will be filled with NUL bytes, except for the 64-byte checkpoint blocks, which will contain the following information:

64-bit checkpoint log sequence number (LSN)
64-bit LSN of log with optional FILE_MODIFY records and a FILE_CHECKPOINT record pointing to the checkpoint
64-bit offset in ib_logfile0, pointing to the log record at the checkpoint LSN
36 bytes of NUL (reserved for future extension)
32-bit CRC-32C checksum of the 64-byte block

Changes to log record format

The 1-byte mini-transaction trailer (a NUL byte) will be replaced with the following:

A byte 0 or 1, corresponding to a "sequence bit" that replaces the 31-bit LOG_BLOCK_HDR_NO field of the log header.
Each time the log wraps around from the end to offset 12288, this bit will be toggled. The value of the bit is computed based on the log header field LOG_HEADER_START_LSN, which is the LSN of the very first record that was written to the file, at offset 12288.
32 bits of CRC-32C checksums from the start of the mini-transaction to the end, excluding the sequence bit.
Only for innodb_encrypt_log=ON: 64 bits of nonce that is used as part of the initialization vector.

This format will allow us to simply memcpy() log records and the checksum to the log buffer, or directly to the redo log that resides in PMEM and thus reduce contention on log_sys.mutex. In the old format with 512-byte blocks, some memset(), my_crc32c() and encryption_crypt() will be executed while holding log_sys.mutex.

Encryption

To allow the data to be backed up without requiring any decryption, encryption will be limited to the payload bytes of page-level redo log records. That is, checkpoint information, file names, tablespace identifiers and page numbers will be in clear text. It can be argued that this information is already mostly available in clear text even in encrypted data files. Neither the file names in the file system nor the FIL_PAGE_LSN in data pages was ever encrypted. Also the tablespace ID is stored in clear text in the first page of each data file.

This means that INIT_PAGE and FREE_PAGE records (which lack any payload) will be entirely unencrypted. For applying backed up log to backed up data files, the ability to decrypt the log will be needed.

Padding

When we want to write an incomplete log block, we can pad the log to the desired block size (be it 64 or 4096 or whatever amount of bytes) by writing special FILE_CHECKPOINT records whose payload is filled with NUL bytes. The minimum padding size would be 7 bytes: 0xf1, sequence bit, checksum. If innodb_encrypt_log=ON, each record will be 8 bytes longer, due to a "nonce" being added to each mini-transaction. Normal FILE_CHECKPOINT records cannot be confused with them, because the checkpoint payload will never be 0.

Because the padding records have to be written while holding log_sys.mutex, we will use pre-computed checksums. To minimize the cache impact, we will use 15 distinct record sizes. For example, 22 bytes could be padded using 2 records when innodb_encrypt_log=OFF and the value of the sequence bit is 1:

f1 00 01 a6 59 c1 db

f9 00 00 00 00 00 00 00 00 00 01 ba 73 b2 a3

When innodb_encrypt_log=ON, each record would be 8 bytes longer, and the pad record sizes will range from 15 to 29 bytes. Thus, 22 bytes would be padded using a single record:

f8 00 00 00 00 00 00 00 00 01 eb 20 12 33 00 00 00 00 00 00 00 00

The log parser will handle any pad record size up to 65536 bytes, but we do not want to compute checksums on pad records while holding log_sys.mutex, and it could be detrimental to performance to have a larger checksum lookup table.

Marko Mäkelä added a comment - 2021-12-03 10:27 - edited I am currently debugging a prototype that will retain the FILE_MODIFY records and a single ib_logfile0 , to keep backups and log resizing simple. Changes to log block format innodb_encrypt_log metadata (encryption key information) will be moved to the 512-byte log file header block. The 2 checkpoint blocks will move to 64 bytes at the start of 4096-byte blocks at offsets 4096 and 8192. This should allow O_DIRECT writes in all file systems as well as allow efficient writing of checkpoints on PMEM. Redo log record data will start at byte offset 12288, right after the 2 checkpoint blocks. The 512-byte log block structure will be eliminated. Basically, every mini-transaction will be an arbitrary-sized log block. It will be easier to read and write the log file, because each mini-transaction will be a contiguous stream of bytes (except when the mini-transaction wraps around from the last byte of ib_logfile0 to the 12288 bytes from the start of the file. The checkpoint block format Bytes 4096 to 12287 will be filled with NUL bytes, except for the 64-byte checkpoint blocks, which will contain the following information: 64-bit checkpoint log sequence number (LSN) 64-bit LSN of log with optional FILE_MODIFY records and a FILE_CHECKPOINT record pointing to the checkpoint 64-bit offset in ib_logfile0 , pointing to the log record at the checkpoint LSN 36 bytes of NUL (reserved for future extension) 32-bit CRC-32C checksum of the 64-byte block Changes to log record format The 1-byte mini-transaction trailer (a NUL byte) will be replaced with the following: A byte 0 or 1, corresponding to a "sequence bit" that replaces the 31-bit LOG_BLOCK_HDR_NO field of the log header. Each time the log wraps around from the end to offset 12288, this bit will be toggled. The value of the bit is computed based on the log header field LOG_HEADER_START_LSN , which is the LSN of the very first record that was written to the file, at offset 12288. 32 bits of CRC-32C checksums from the start of the mini-transaction to the end, excluding the sequence bit. Only for innodb_encrypt_log=ON : 64 bits of nonce that is used as part of the initialization vector. This format will allow us to simply memcpy() log records and the checksum to the log buffer, or directly to the redo log that resides in PMEM and thus reduce contention on log_sys.mutex . In the old format with 512-byte blocks, some memset() , my_crc32c() and encryption_crypt() will be executed while holding log_sys.mutex . Encryption To allow the data to be backed up without requiring any decryption, encryption will be limited to the payload bytes of page-level redo log records. That is, checkpoint information, file names, tablespace identifiers and page numbers will be in clear text. It can be argued that this information is already mostly available in clear text even in encrypted data files. Neither the file names in the file system nor the FIL_PAGE_LSN in data pages was ever encrypted. Also the tablespace ID is stored in clear text in the first page of each data file. This means that INIT_PAGE and FREE_PAGE records (which lack any payload) will be entirely unencrypted. For applying backed up log to backed up data files, the ability to decrypt the log will be needed. Padding When we want to write an incomplete log block, we can pad the log to the desired block size (be it 64 or 4096 or whatever amount of bytes) by writing special FILE_CHECKPOINT records whose payload is filled with NUL bytes. The minimum padding size would be 7 bytes: 0xf1 , sequence bit, checksum. If innodb_encrypt_log=ON , each record will be 8 bytes longer, due to a "nonce" being added to each mini-transaction. Normal FILE_CHECKPOINT records cannot be confused with them, because the checkpoint payload will never be 0. Because the padding records have to be written while holding log_sys.mutex , we will use pre-computed checksums. To minimize the cache impact, we will use 15 distinct record sizes. For example, 22 bytes could be padded using 2 records when innodb_encrypt_log=OFF and the value of the sequence bit is 1: f1 00 01 a6 59 c1 db f9 00 00 00 00 00 00 00 00 00 01 ba 73 b2 a3 When innodb_encrypt_log=ON , each record would be 8 bytes longer, and the pad record sizes will range from 15 to 29 bytes. Thus, 22 bytes would be padded using a single record: f8 00 00 00 00 00 00 00 00 01 eb 20 12 33 00 00 00 00 00 00 00 00 The log parser will handle any pad record size up to 65536 bytes, but we do not want to compute checksums on pad records while holding log_sys.mutex , and it could be detrimental to performance to have a larger checksum lookup table.

Marko Mäkelä added a comment - 2021-12-14 18:20

A few mariadb-backup tests are disabled for now (in particular all tests that use --incremental), and there are some crash recovery bugs, for which rr traces are needed. Also, it is possible (and intended) to improve performance later, based on what this format change allows.

InnoDB will refuse to start up without ib_logfile0, unless innodb_force_recovery=6 is set. This allows ~~MDEV-27199~~ to stop the inherently risky updates of the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace file.

Marko Mäkelä added a comment - 2021-12-14 18:20 A few mariadb-backup tests are disabled for now (in particular all tests that use --incremental ), and there are some crash recovery bugs, for which rr traces are needed. Also, it is possible (and intended) to improve performance later, based on what this format change allows. InnoDB will refuse to start up without ib_logfile0 , unless innodb_force_recovery=6 is set. This allows MDEV-27199 to stop the inherently risky updates of the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace file.

Marko Mäkelä added a comment - 2021-12-15 14:43

Most incremental backup tests work now. The final issue was that after the incremental log apply, the dummy log file was being created in the wrong directory (not --target-dir) and thus the backup was being restored with a too old log sequence number in the dummy log file.

Marko Mäkelä added a comment - 2021-12-15 14:43 Most incremental backup tests work now. The final issue was that after the incremental log apply, the dummy log file was being created in the wrong directory (not --target-dir ) and thus the backup was being restored with a too old log sequence number in the dummy log file.

Axel Schwenke added a comment - 2021-12-16 12:15 - edited

preview-10.8-MDEV-14425 commit fe030b137f4 looks promising

Axel Schwenke added a comment - 2021-12-16 12:15 - edited preview-10.8- MDEV-14425 commit fe030b137f4 looks promising

Marko Mäkelä added a comment - 2021-12-16 16:26

axel, thank you. The branch preview-10.8-MDEV-14425-innodb was updated at least twice since you tested it, to fix some mariadb-backup tests as well as an issue with innodb_encrypt_log: I made a wrong assumption that a string may be encrypted piecewise in multiple calls to encryption_crypt().

There still remains some room for performance improvements. In particular, we are writing unaligned data to the ib_logfile0 and never padding it. We have some test failures on Microsoft Windows, possibly related to that. I think that we should enable O_DIRECT writes wherever possible, and write data that is aligned to the physical sector size (or 4096 bytes if the physical sector size cannot be determined).

Furthermore, I did not have time to simplify the PMEM interface yet. On PMEM, the physical sector size would be CPU_LEVEL1_DCACHE_LINESIZE (64 bytes on AMD64).

Marko Mäkelä added a comment - 2021-12-16 16:26 axel , thank you. The branch preview-10.8-MDEV-14425-innodb was updated at least twice since you tested it, to fix some mariadb-backup tests as well as an issue with innodb_encrypt_log : I made a wrong assumption that a string may be encrypted piecewise in multiple calls to encryption_crypt() . There still remains some room for performance improvements. In particular, we are writing unaligned data to the ib_logfile0 and never padding it. We have some test failures on Microsoft Windows, possibly related to that. I think that we should enable O_DIRECT writes wherever possible, and write data that is aligned to the physical sector size (or 4096 bytes if the physical sector size cannot be determined). Furthermore, I did not have time to simplify the PMEM interface yet. On PMEM, the physical sector size would be CPU_LEVEL1_DCACHE_LINESIZE (64 bytes on AMD64).

Marko Mäkelä added a comment - 2021-12-19 16:22

We never claimed to support (or test) downgrades between major versions. Users who are desperate to downgrade could try the following:

Perform a clean shutdown. Note the log sequence number.
Ensure that the last LSN in the new-format ib_logfile0 matches the one in the shutdown message.
Back up the data directory.
Write that LSN to the system tablespace (see the test mariabackup.huge_lsn for how to do that) and delete the ib_logfile0 file.
Start the older version of MariaDB.

A mandatory step in the above is that the LSN in the first page of the system tablespace needs to be updated. If that is neglected, the old server will start with a too old LSN (see ~~MDEV-27199~~), and all InnoDB files that will be modified may become corrupted. An indicator of that are messages like "Page … log sequence number … is in the future".

Marko Mäkelä added a comment - 2021-12-19 16:22 We never claimed to support (or test) downgrades between major versions. Users who are desperate to downgrade could try the following: Perform a clean shutdown. Note the log sequence number. Ensure that the last LSN in the new-format ib_logfile0 matches the one in the shutdown message. Back up the data directory. Write that LSN to the system tablespace (see the test mariabackup.huge_lsn for how to do that) and delete the ib_logfile0 file. Start the older version of MariaDB. A mandatory step in the above is that the LSN in the first page of the system tablespace needs to be updated. If that is neglected, the old server will start with a too old LSN (see MDEV-27199 ), and all InnoDB files that will be modified may become corrupted. An indicator of that are messages like "Page … log sequence number … is in the future".

Matthias Leich added a comment - 2021-12-20 11:11 - edited

Preliminary results of RQG testing on origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19T17:28:12+02:00

1. The ASAN failures around crc32 (seen on previous tree) have disappeared.

2. Failure pattern TBR-1310

    kill DB server when being under load, the restart attempt fails with

    mysqld: storage/innobase/rem/rem0rec.cc:304: void rec_init_offsets_comp_ordinary(const rec_t*, ...): Assertion `n_fields <= ulint(index->n_fields) + 1' failed.

    sdp:/data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/rr

    The rr trace (mysqld-1) till injected SIGSEGV is at its end has trouble around end.

    /data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/data_copy

              Copy of the data dir before restart attempt.

2. Failure pattern TBR-1311

    kill DB server when being under load, restart with success, SELECT ... FROM ... FORCE INDEX .... harvests 1030,

    [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t4

    sdp:/data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/rr

    Both rr traces work well.

    /data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/data_copy/

              Copy of the data dir before restart attempt.

3. Failure pattern TBR-1312

    kill DB server when being under load, the restart attempt fails with

    [ERROR] [FATAL] InnoDB: Page 3242543642:134 name ./test/t6.ibd page_type 32770 key_version 1 lsn 78387097 compressed_len 55514

    sdp:/data/results/1639941539/TBR-1312

    gdb -c dev/shm/rqg/1639941539/158/1/data/core /data/Server_bin/preview-10.8-MDEV-14425-innodbA_asan/bin/mysqld

               Core at end of restart attempt.

    /data/results/1639941539/TBR-1312/dev/shm/rqg/1639941539/158/1/data_copy/

               Copy of the data dir before restart attempt.

4. Most if not all other failures observed occur on actual main trees 10.6 - 10.8 too

Upgrade (stop is initiated by SIGTERM) from

10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16

to

10.8.0 origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19

Failure patterns (TBR-1313 - TBR-1315)

1. The restart with preview-10.8... fails with trouble like

     - InnoDB: Background Page read failed to read, uncompress, or decrypt

     - InnoDB: Failed to read page ... from file ....: Table is compressed or encrypted but uncompress or decrypt failed

 2. The mysql_upgrade script fails like

      - MariaDB tried to use the .{1,10} compression, but its provider plugin is not loaded

or

      - Table ... is compressed with ..., which is not currently loaded. Please ... the bzip2 provider plugin to open the table'

or

      -  # ERROR 2013 (HY000) at line 795: Lost connection to server during query

         # ERROR 2006 (HY000) at line 796: Server has gone away

         # ERROR: AddressSanitizer: heap-buffer-overflow on address ...

         # READ of size 19 at 0x602000010bb7 thread T16

        #0 0x7f9cd1091cff  (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xdacff)

        #1 0x557eb5408a05 in cmp_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:322

    #2 0x557eb5404137 in cmp_data_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:378

    #3 0x557eb57fa0e8 in cmp_dfield_dfield /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/rem0cmp.ic:49

    #4 0x557eb57fb12a in eval_cmp(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:183

    #5 0x557eb57fc3c4 in eval_func(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:595

    #6 0x557eb57fd04e in eval_exp /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/eval0eval.ic:117

    #7 0x557eb57fd522 in if_step(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0proc.cc:48

    #8 0x557eb53f497a in que_thr_step /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:611

    #9 0x557eb53f50e1 in que_run_threads_low /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:709

    #10 0x557eb53f5283 in que_run_threads(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:729

    #11 0x557eb53f55a9 in que_eval_sql(pars_info_t*, char const*, trx_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:768

    #12 0x557eb511c76c in innodb_drop_database /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/handler/ha_innodb.cc:1506

    # Query (0x62900004b2d0): DROP DATABASE IF EXISTS performance_schema

    ==> MDEV-27336

3. There are other failures too like schema or data content mismatches between state

     before and after upgrade. But these were observed main trees too.

Pseudoupgrade preview-10.8-MDEV-14425-innodb -> preview-10.8-MDEV-14425-innodb

(origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025)

1. The failures seen when running this pseudoupgrade on a previous version of a MDEV-14425 development tree are gone.

2. Other failures observed are known for the main trees too.

Matthias Leich added a comment - 2021-12-20 11:11 - edited Preliminary results of RQG testing on origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19T17:28:12+02:00 1. The ASAN failures around crc32 (seen on previous tree) have disappeared. 2. Failure pattern TBR-1310 kill DB server when being under load, the restart attempt fails with mysqld: storage/innobase/rem/rem0rec.cc:304: void rec_init_offsets_comp_ordinary(const rec_t*, ...): Assertion `n_fields <= ulint(index->n_fields) + 1' failed. sdp:/data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/rr The rr trace (mysqld-1) till injected SIGSEGV is at its end has trouble around end. /data/results/1639941539/TBR-1310/dev/shm/rqg/1639941539/181/1/data_copy Copy of the data dir before restart attempt. 2. Failure pattern TBR-1311 kill DB server when being under load, restart with success, SELECT ... FROM ... FORCE INDEX .... harvests 1030, [ERROR] InnoDB indexes are inconsistent with what defined in .frm for table ./test/t4 sdp:/data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/rr Both rr traces work well. /data/results/1639941539/TBR-1311/dev/shm/rqg/1639941539/48/1/data_copy/ Copy of the data dir before restart attempt. 3. Failure pattern TBR-1312 kill DB server when being under load, the restart attempt fails with [ERROR] [FATAL] InnoDB: Page 3242543642:134 name ./test/t6.ibd page_type 32770 key_version 1 lsn 78387097 compressed_len 55514 sdp:/data/results/1639941539/TBR-1312 gdb -c dev/shm/rqg/1639941539/158/1/data/core /data/Server_bin/preview-10.8-MDEV-14425-innodbA_asan/bin/mysqld Core at end of restart attempt. /data/results/1639941539/TBR-1312/dev/shm/rqg/1639941539/158/1/data_copy/ Copy of the data dir before restart attempt. 4. Most if not all other failures observed occur on actual main trees 10.6 - 10.8 too Upgrade (stop is initiated by SIGTERM) from 10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16 to 10.8.0 origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025 2021-12-19 Failure patterns (TBR-1313 - TBR-1315) 1. The restart with preview-10.8... fails with trouble like - InnoDB: Background Page read failed to read, uncompress, or decrypt - InnoDB: Failed to read page ... from file ....: Table is compressed or encrypted but uncompress or decrypt failed 2. The mysql_upgrade script fails like - MariaDB tried to use the .{1,10} compression, but its provider plugin is not loaded or - Table ... is compressed with ..., which is not currently loaded. Please ... the bzip2 provider plugin to open the table' or - # ERROR 2013 (HY000) at line 795: Lost connection to server during query # ERROR 2006 (HY000) at line 796: Server has gone away # ERROR: AddressSanitizer: heap-buffer-overflow on address ... # READ of size 19 at 0x602000010bb7 thread T16 #0 0x7f9cd1091cff (/usr/lib/x86_64-linux-gnu/libasan.so.5+0xdacff) #1 0x557eb5408a05 in cmp_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:322 #2 0x557eb5404137 in cmp_data_data(unsigned long, unsigned long, unsigned char const*, unsigned long, unsigned char const*, unsigned long) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/rem/rem0cmp.cc:378 #3 0x557eb57fa0e8 in cmp_dfield_dfield /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/rem0cmp.ic:49 #4 0x557eb57fb12a in eval_cmp(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:183 #5 0x557eb57fc3c4 in eval_func(func_node_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0eval.cc:595 #6 0x557eb57fd04e in eval_exp /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/include/eval0eval.ic:117 #7 0x557eb57fd522 in if_step(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/eval/eval0proc.cc:48 #8 0x557eb53f497a in que_thr_step /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:611 #9 0x557eb53f50e1 in que_run_threads_low /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:709 #10 0x557eb53f5283 in que_run_threads(que_thr_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:729 #11 0x557eb53f55a9 in que_eval_sql(pars_info_t*, char const*, trx_t*) /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/que/que0que.cc:768 #12 0x557eb511c76c in innodb_drop_database /data/Server/preview-10.8-MDEV-14425-innodbA/storage/innobase/handler/ha_innodb.cc:1506 # Query (0x62900004b2d0): DROP DATABASE IF EXISTS performance_schema ==> MDEV-27336 3. There are other failures too like schema or data content mismatches between state before and after upgrade. But these were observed main trees too. Pseudoupgrade preview-10.8-MDEV-14425-innodb -> preview-10.8-MDEV-14425-innodb (origin/preview-10.8-MDEV-14425-innodb 23849209738153bed4ea60f39830305840ee4025) 1. The failures seen when running this pseudoupgrade on a previous version of a MDEV-14425 development tree are gone. 2. Other failures observed are known for the main trees too.

Marko Mäkelä added a comment - 2021-12-20 14:53

mleich, I would expect the upgrade failures for page_compressed tables to be related to ~~MDEV-12933~~, and affect already an upgrade to 10.7. Can you check that? This branch only includes changes to the log file format, not to any data page encryption or compression. However, it is theoretically possible that if recovery fails to find and process some INIT_PAGE records due to wrongly detected EOF, we would attempt to read a corrupted page that was not supposed to be read during recovery (~~MDEV-19738~~).

Today, I did some cleanup and enabled O_DIRECT access to the log file on Linux when the physical block size is 512 bytes. After my Sysbench based test, the Linux file system cache no longer grew to the size of the ib_logfile0, like it used to do. We really should replace the constant log_sys.BLOCK_SIZE with a variable that we will determine from the operating system.

Marko Mäkelä added a comment - 2021-12-20 14:53 mleich , I would expect the upgrade failures for page_compressed tables to be related to MDEV-12933 , and affect already an upgrade to 10.7. Can you check that? This branch only includes changes to the log file format, not to any data page encryption or compression. However, it is theoretically possible that if recovery fails to find and process some INIT_PAGE records due to wrongly detected EOF, we would attempt to read a corrupted page that was not supposed to be read during recovery ( MDEV-19738 ). Today, I did some cleanup and enabled O_DIRECT access to the log file on Linux when the physical block size is 512 bytes . After my Sysbench based test, the Linux file system cache no longer grew to the size of the ib_logfile0 , like it used to do. We really should replace the constant log_sys.BLOCK_SIZE with a variable that we will determine from the operating system.

Matthias Leich added a comment - 2021-12-20 16:27

Upgrade (stop is initiated by SIGTERM) from

10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16

to

10.7.2 origin/10.7 92a4e76a2c1c15fb44dc0cb05e06d5aa408a8e35 2021-12-14

The failure patterns TBR-1313 - TBR-1315 were observed.

This lets assume that there are no preview-10.8-MDEV-14425-innodb specific upgrade failures.

Matthias Leich added a comment - 2021-12-20 16:27 Upgrade (stop is initiated by SIGTERM) from 10.5.14 origin/10.5 2776635cb98d35867447d375fdc04a44ef11a697 2021-12-16 to 10.7.2 origin/10.7 92a4e76a2c1c15fb44dc0cb05e06d5aa408a8e35 2021-12-14 The failure patterns TBR-1313 - TBR-1315 were observed. This lets assume that there are no preview-10.8-MDEV-14425-innodb specific upgrade failures.

Marko Mäkelä added a comment - 2021-12-21 14:36

I made one more change today, which missed the preview releases. On Linux and Microsoft Windows, we will bypass the file system cache for the redo log if the physical block size is 64 to 4096 bytes. The environments where it was tested had 512-byte or 4096-byte sectors. When the buffer is bypassed, you would see a message like this in the server message log:

2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)

A final change that I plan to implement is a more efficient PMEM interface, to make log_sys.buf point directly to the persistent memory.

Marko Mäkelä added a comment - 2021-12-21 14:36 I made one more change today, which missed the preview releases. On Linux and Microsoft Windows, we will bypass the file system cache for the redo log if the physical block size is 64 to 4096 bytes. The environments where it was tested had 512-byte or 4096-byte sectors. When the buffer is bypassed, you would see a message like this in the server message log: 2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes) A final change that I plan to implement is a more efficient PMEM interface, to make log_sys.buf point directly to the persistent memory.

Marko Mäkelä added a comment - 2022-01-05 07:41

There now is a new PMEM (~~MDEV-25090~~) interface that I have tested on Linux, On Linux, it is also used if innodb_log_group_home_dir (or datadir) points to /dev/shm. A start-up message will identify this interface as follows:

2022-01-05  7:44:52 0 [Note] InnoDB: Memory-mapped log (10485760 bytes)

It is still possible to avoid using mmap() on tmpfs if you use any other tmpfs mount point, such as --innodb-log-group-home-dir=/run/user/$UID.

The Linux mmap() based interface for PMEM will only work if the file system has been mounted with -o dax. If the option is missing, conventional file I/O will be used. In this case, I saw a start-up message like this:

2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)

Marko Mäkelä added a comment - 2022-01-05 07:41 There now is a new PMEM ( MDEV-25090 ) interface that I have tested on Linux, On Linux, it is also used if innodb_log_group_home_dir (or datadir ) points to /dev/shm . A start-up message will identify this interface as follows: 2022-01-05 7:44:52 0 [Note] InnoDB: Memory-mapped log (10485760 bytes) It is still possible to avoid using mmap() on tmpfs if you use any other tmpfs mount point, such as --innodb-log-group-home-dir=/run/user/$UID . The Linux mmap() based interface for PMEM will only work if the file system has been mounted with -o dax . If the option is missing, conventional file I/O will be used. In this case, I saw a start-up message like this: 2021-12-21 14:02:49 0 [Note] InnoDB: File system buffers for log disabled (block size=4096 bytes)

Marko Mäkelä added a comment - 2022-01-05 07:48

A user-visible change is that this is bundled with ~~MDEV-27199~~. We will require the ib_logfile0 to always exist. Previously, if the file was empty or missing, InnoDB would create a new log file, assuming that all data files are clean and that the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace (ibdata1) contains the most recent log sequence number. mariadb-backup --prepare will create a minimal ib_logfile0 file.

See also my previous note about downgrades to earlier versions (which we do not support). Because with ~~MDEV-27199~~, we would no longer update the FIL_PAGE_FILE_FLUSH_LSN in the InnoDB system tablespace on shutdown, a simple approach of removing ib_logfile0 and starting up an older version would likely result in a disaster, caused by a rewind of the log sequence number.

Marko Mäkelä added a comment - 2022-01-05 07:48 A user-visible change is that this is bundled with MDEV-27199 . We will require the ib_logfile0 to always exist. Previously, if the file was empty or missing, InnoDB would create a new log file, assuming that all data files are clean and that the field FIL_PAGE_FILE_FLUSH_LSN in the first page of the system tablespace ( ibdata1 ) contains the most recent log sequence number. mariadb-backup --prepare will create a minimal ib_logfile0 file. See also my previous note about downgrades to earlier versions (which we do not support). Because with MDEV-27199 , we would no longer update the FIL_PAGE_FILE_FLUSH_LSN in the InnoDB system tablespace on shutdown, a simple approach of removing ib_logfile0 and starting up an older version would likely result in a disaster, caused by a rewind of the log sequence number.

Marko Mäkelä added a comment - 2022-01-07 16:33

Related to ~~MDEV-27437~~, I realized that a restored backup from an older version would consist of an ib_logfile0 whose size is 0 bytes. We must allow upgrade straight from a backup. In that case, we will recover the log sequence number from the FIL_PAGE_FILE_FLUSH_LSN field. If that field contains 0 (like it will after ~~MDEV-27199~~), we will refuse to start up.

Marko Mäkelä added a comment - 2022-01-07 16:33 Related to MDEV-27437 , I realized that a restored backup from an older version would consist of an ib_logfile0 whose size is 0 bytes. We must allow upgrade straight from a backup. In that case, we will recover the log sequence number from the FIL_PAGE_FILE_FLUSH_LSN field. If that field contains 0 (like it will after MDEV-27199 ), we will refuse to start up.

Marko Mäkelä added a comment - 2022-01-07 17:24

After the preview release, in preparation for the PMEM interface, I had removed flush_lock and had unconditionally enabled O_DSYNC on the redo log. This can result in a performance regression on some drives. So, we will only attempt to use O_DIRECT on the log file (if a compatible physical block size is detected on Linux or Windows).

However, I’d change innodb_flush_method=O_DSYNC to enable O_DIRECT no data files. I see no reason disable O_DIRECT.

Some counters related to os_file_flush() (fsync(), fdatasync() or similar) will be cleaned up. I do not think that it makes sense to have a counter of pending log fsync operations, or a separate counter of log flush operations.

Marko Mäkelä added a comment - 2022-01-07 17:24 After the preview release, in preparation for the PMEM interface, I had removed flush_lock and had unconditionally enabled O_DSYNC on the redo log. This can result in a performance regression on some drives. So, we will only attempt to use O_DIRECT on the log file (if a compatible physical block size is detected on Linux or Windows). However, I’d change innodb_flush_method=O_DSYNC to enable O_DIRECT no data files. I see no reason disable O_DIRECT . Some counters related to os_file_flush() ( fsync() , fdatasync() or similar) will be cleaned up. I do not think that it makes sense to have a counter of pending log fsync operations, or a separate counter of log flush operations.

Marko Mäkelä added a comment - 2022-01-11 07:47 - edited

There was a performance problem with the mmap() based interface when the redo log is located in /dev/shm or a mount -o dax PMEM device:

    if (log_sys.buf_free >= log_sys.max_buf_free)

      log_sys.set_check_flush_or_checkpoint();

The field log_sys.max_buf_free is only applicable to the pwrite() based interface. That code must not be executed for the mmap() based log, because it will cause other threads to acquire log_sys.mutex very frequently, to ensure that a pwrite() will be issued.

On a quick local test on /dev/shm with mmap() and the possible overhead of pmem_persist(), I am seeing about 5% better throughput than with the pwrite() and fdatasync() based log.

On the PMEM device that I tested, the pmem_deep_persist() introduced a slowdown of several orders of magnitude, compared to pmem_persist(), which did not incur any significant overhead. The old code used pmem_memcpy_persist(), which would seem to pair with pmem_persist(). Both should be fine if the PMEM device guarantees durable writes in the event of a sudden power loss.

Marko Mäkelä added a comment - 2022-01-11 07:47 - edited There was a performance problem with the mmap() based interface when the redo log is located in /dev/shm or a mount -o dax PMEM device: if (log_sys.buf_free >= log_sys.max_buf_free) log_sys.set_check_flush_or_checkpoint(); The field log_sys.max_buf_free is only applicable to the pwrite() based interface. That code must not be executed for the mmap() based log, because it will cause other threads to acquire log_sys.mutex very frequently, to ensure that a pwrite() will be issued. On a quick local test on /dev/shm with mmap() and the possible overhead of pmem_persist() , I am seeing about 5% better throughput than with the pwrite() and fdatasync() based log. On the PMEM device that I tested, the pmem_deep_persist() introduced a slowdown of several orders of magnitude, compared to pmem_persist() , which did not incur any significant overhead. The old code used pmem_memcpy_persist() , which would seem to pair with pmem_persist() . Both should be fine if the PMEM device guarantees durable writes in the event of a sudden power loss.

Axel Schwenke added a comment - 2022-01-12 08:41

Benchmark results for commit 81cf92e9471 : 81cf92e9471.pdf show good results for UPDATE workload. The 90:10 numbers show an anomaly. Cannot explain yet. But the baseline (vanilla 10.8, commit a81c75f5a96) behaves better at 128 and 256 threads. It was not so for previous 10.8 commits.

Axel Schwenke added a comment - 2022-01-12 08:41 Benchmark results for commit 81cf92e9471 : 81cf92e9471.pdf show good results for UPDATE workload. The 90:10 numbers show an anomaly. Cannot explain yet. But the baseline (vanilla 10.8, commit a81c75f5a96) behaves better at 128 and 256 threads. It was not so for previous 10.8 commits.

Marko Mäkelä added a comment - 2022-01-12 12:02

axel, thank you. The anomaly that you observed could be be because O_DIRECT was enabled on the log file. The preview release, which you had tested earlier, did not enable O_DIRECT, and it was issuing unaligned writes to the log file.

On my system, with Linux kernel 5.15.5 and ext4fs on an NVMe drive with 512-byte block size, I observed yesterday that O_DIRECT|O_DSYNC is slightly faster than O_DIRECT, and the fastest was to open the log file without O_DIRECT and to issue explicit fdatasync() for durability. On Microsoft Windows, already since ~~MDEV-16264~~ (10.5) we used FILE_FLAG_NO_BUFFERING (equivalent to O_DIRECT) if the physical block size is 512 bytes.

Later yesterday, I updated the branch to only use O_DIRECT on the log file together with O_DSYNC, that is, when innodb_flush_method=O_DSYNC is specified. In this branch, that setting will also enable O_DIRECT (along with O_DSYNC) for data files. I hope that this will fix the anomaly for you as well.

Marko Mäkelä added a comment - 2022-01-12 12:02 axel , thank you. The anomaly that you observed could be be because O_DIRECT was enabled on the log file. The preview release, which you had tested earlier, did not enable O_DIRECT , and it was issuing unaligned writes to the log file. On my system, with Linux kernel 5.15.5 and ext4fs on an NVMe drive with 512-byte block size, I observed yesterday that O_DIRECT|O_DSYNC is slightly faster than O_DIRECT , and the fastest was to open the log file without O_DIRECT and to issue explicit fdatasync() for durability. On Microsoft Windows, already since MDEV-16264 (10.5) we used FILE_FLAG_NO_BUFFERING (equivalent to O_DIRECT ) if the physical block size is 512 bytes. Later yesterday, I updated the branch to only use O_DIRECT on the log file together with O_DSYNC , that is, when innodb_flush_method=O_DSYNC is specified. In this branch, that setting will also enable O_DIRECT (along with O_DSYNC ) for data files. I hope that this will fix the anomaly for you as well.

Vladislav Vaintroub added a comment - 2022-01-12 12:22 - edited

A correction - we always used FILE_FLAG_NO_BUFFERING on Windows, on redo log only, for innodb_flush_log_at_trx_commit=1. MySQL did that, too, until 10.8 . ~~MDEV-16264~~ has not changed any logic in that regard.

Now, O_DIRECT|O_DSYNC might appear faster, because only in rare cases it actually flushes hardware disk buffers. The rare cases are known as "FUA"-capable hardware, which is probably not given on marko's case, or when disk buffering is disabled, on hardware level. So, always flushing should be the way to go, unless user indeed sets innodb_flush_method=O_DSYNC, in which case we assume user knows what he does. Otherwise, the ACID can well be compromised.

I'd also like to ask axel to benchmark threadpool, in heavy write benchmarks, mostly because since 10.6 there was a big improvement for this case, and I'd like to see if it was not nullified by the patch, on a premium hardware.

Vladislav Vaintroub added a comment - 2022-01-12 12:22 - edited A correction - we always used FILE_FLAG_NO_BUFFERING on Windows, on redo log only, for innodb_flush_log_at_trx_commit=1. MySQL did that, too, until 10.8 . MDEV-16264 has not changed any logic in that regard. Now, O_DIRECT|O_DSYNC might appear faster, because only in rare cases it actually flushes hardware disk buffers. The rare cases are known as "FUA"-capable hardware, which is probably not given on marko 's case, or when disk buffering is disabled, on hardware level. So, always flushing should be the way to go, unless user indeed sets innodb_flush_method=O_DSYNC, in which case we assume user knows what he does. Otherwise, the ACID can well be compromised. I'd also like to ask axel to benchmark threadpool, in heavy write benchmarks, mostly because since 10.6 there was a big improvement for this case, and I'd like to see if it was not nullified by the patch, on a premium hardware.

VAROQUI Stephane added a comment - 2022-01-12 13:29

Is there a possibility to load data into some empty table and get page compressed in the tablespace to save space right away on disk , from experimentation on 10.6 this only works if i set max dirty page pct 0 so that all dirty pages get flushed write away . I remember about some minimal flushing activity on disk when having io bandwidth , any MDEV on that feature ?

VAROQUI Stephane added a comment - 2022-01-12 13:29 Is there a possibility to load data into some empty table and get page compressed in the tablespace to save space right away on disk , from experimentation on 10.6 this only works if i set max dirty page pct 0 so that all dirty pages get flushed write away . I remember about some minimal flushing activity on disk when having io bandwidth , any MDEV on that feature ?

Marko Mäkelä added a comment - 2022-01-12 17:53

stephane@skysql.com, data page flushing is not directly related to these changes, other than the fact that I would change innodb_flush_method=O_DSYNC to behave like innodb_flush_method=O_DIRECT for data files, only adding the O_DSYNC attribute. axel tested page_compressed some time ago, I thought it was related to ~~MDEV-11068~~ but I did not find the graphs. In the end, we found that writing uncompressed tables to a thinly provisioned SSD (ScaleFlux computational storage device) was not only fastest, but also resulted in best compression. Related to page flushing when the server is idle, you might want to check ~~MDEV-24949~~.

Marko Mäkelä added a comment - 2022-01-12 17:53 stephane@skysql.com , data page flushing is not directly related to these changes, other than the fact that I would change innodb_flush_method=O_DSYNC to behave like innodb_flush_method=O_DIRECT for data files, only adding the O_DSYNC attribute. axel tested page_compressed some time ago, I thought it was related to MDEV-11068 but I did not find the graphs. In the end, we found that writing uncompressed tables to a thinly provisioned SSD (ScaleFlux computational storage device) was not only fastest, but also resulted in best compression. Related to page flushing when the server is idle, you might want to check MDEV-24949 .

Marko Mäkelä added a comment - 2022-01-13 16:31

If the performance regression occurs due to both buf_pool.mutex and log_sys.mutex being very contended at large number of concurrent connections, it could help to disable the adaptive spinning (MY_MUTEX_INIT_FAST) for buf_pool.mutex. We currently enable it on log_sys.mutex only on ARMv8 (see ~~MDEV-26855~~). I tried enabling the spinning for log_sys.mutex on my AMD64 system a couple of days ago, and the throughput nearly halved. So, I would suggest to disable the spinning:

diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc

index e7fc3264d60..1127a191f7e 100644

--- a/storage/innobase/buf/buf0buf.cc

+++ b/storage/innobase/buf/buf0buf.cc

@@ -1175,7 +1175,7 @@ bool buf_pool_t::create()

   while (++chunk < chunks + n_chunks);

   ut_ad(is_initialised());

-  mysql_mutex_init(buf_pool_mutex_key, &mutex, MY_MUTEX_INIT_FAST);

+  mysql_mutex_init(buf_pool_mutex_key, &mutex, nullptr);

   UT_LIST_INIT(LRU, &buf_page_t::LRU);

   UT_LIST_INIT(withdraw, &buf_page_t::list);

I believe that this format change could enable scalability improvements, such as asynchronous log writes, or interleaving of flush_lock and write_lock. Such attempts did not help in the past, possibly because the old format that requires log_sys.mutex to be held during memset() or my_crc32c() of the 512-byte log block.

Marko Mäkelä added a comment - 2022-01-13 16:31 If the performance regression occurs due to both buf_pool.mutex and log_sys.mutex being very contended at large number of concurrent connections, it could help to disable the adaptive spinning ( MY_MUTEX_INIT_FAST ) for buf_pool.mutex . We currently enable it on log_sys.mutex only on ARMv8 (see MDEV-26855 ). I tried enabling the spinning for log_sys.mutex on my AMD64 system a couple of days ago, and the throughput nearly halved. So, I would suggest to disable the spinning: diff --git a/storage/innobase/buf/buf0buf.cc b/storage/innobase/buf/buf0buf.cc index e7fc3264d60..1127a191f7e 100644 --- a/storage/innobase/buf/buf0buf.cc +++ b/storage/innobase/buf/buf0buf.cc @@ -1175,7 +1175,7 @@ bool buf_pool_t::create() while (++chunk < chunks + n_chunks); ut_ad(is_initialised()); - mysql_mutex_init(buf_pool_mutex_key, &mutex, MY_MUTEX_INIT_FAST); + mysql_mutex_init(buf_pool_mutex_key, &mutex, nullptr); UT_LIST_INIT(LRU, &buf_page_t::LRU); UT_LIST_INIT(withdraw, &buf_page_t::list); I believe that this format change could enable scalability improvements, such as asynchronous log writes, or interleaving of flush_lock and write_lock . Such attempts did not help in the past, possibly because the old format that requires log_sys.mutex to be held during memset() or my_crc32c() of the 512-byte log block.

Marko Mäkelä added a comment - 2022-01-14 16:08

In addition to the buf_pool.mutex spinloop removal, I wanted to see whether applying ~~MDEV-26827~~ on top would help. ~~MDEV-26827~~ is expected to reduce contention on buf_pool.mutex, but in the past it caused performance regression. According to axel’s tests, it still seems to be the case.

Marko Mäkelä added a comment - 2022-01-14 16:08 In addition to the buf_pool.mutex spinloop removal, I wanted to see whether applying MDEV-26827 on top would help. MDEV-26827 is expected to reduce contention on buf_pool.mutex , but in the past it caused performance regression. According to axel ’s tests, it still seems to be the case.

Matthias Leich added a comment - 2022-01-14 16:15

The tree

origin/bb-10.8-MDEV-14425 614e46b89ffe7357e5b72ea0d0fd3f490567a384 2022-01-13T20:32:56+02:00

behaved well in RQG testing. Bad effects observed exist in the main trees too and are known.

- InnoDB Standardtestbattery for covering a broad range of functionality

- Upgrade Testbattery (10.5 -> bb-10.8-MDEV-14425)

- Testbattery for Crashrecovery

Matthias Leich added a comment - 2022-01-14 16:15 The tree origin/bb-10.8-MDEV-14425 614e46b89ffe7357e5b72ea0d0fd3f490567a384 2022-01-13T20:32:56+02:00 behaved well in RQG testing. Bad effects observed exist in the main trees too and are known. - InnoDB Standardtestbattery for covering a broad range of functionality - Upgrade Testbattery (10.5 -> bb-10.8-MDEV-14425) - Testbattery for Crashrecovery

Marko Mäkelä added a comment - 2022-01-14 19:40

mleich, thank you. I have since then rebased the tree for final testing.

The two commits of ~~MDEV-26827~~ are omitted. This was an attempt to see if performance would improve.
No change to the buf_pool.mutex initialization is done (the adaptive spinloop will be allowed).
The redo log will be opened in O_DIRECT mode when the physical block size can be determined.
Some fixes of 10.5 or 10.6 bugs that were found during testing are included.

Marko Mäkelä added a comment - 2022-01-14 19:40 mleich , thank you. I have since then rebased the tree for final testing. The two commits of MDEV-26827 are omitted. This was an attempt to see if performance would improve. No change to the buf_pool.mutex initialization is done (the adaptive spinloop will be allowed). The redo log will be opened in O_DIRECT mode when the physical block size can be determined. Some fixes of 10.5 or 10.6 bugs that were found during testing are included.

Axel Schwenke added a comment - 2022-01-17 10:22

Added latest benchmark results:

using both NUMA nodes NUMA_2.pdf
using only one NUMA node NUMA_1.pdf
comparison between the two NUMA_1vs2.pdf

All runs with PFS disabled and using SSD storage.

The graphs show a significant difference in variance. 2 NUMA nodes results are in general nearer to each other. This could mean that it was maxing out the SSD (two SATA SSD in RAID 0) storage. With 1 NUMA node the differences between commits were bigger. Recommended final configuration: commit 81cf92e9471 with spinning on buf_pool_mutex disabled (the light blue line).

When comparing 2 NUMA node performance to 1 NUMA node, scalability looks good. The second NUMA domain adds ~60% (write-only) to ~80% (all other workloads) to performance. Only when operating in AUTO_COMMIT mode, it doesn't scale.

Axel Schwenke added a comment - 2022-01-17 10:22 Added latest benchmark results: using both NUMA nodes NUMA_2.pdf using only one NUMA node NUMA_1.pdf comparison between the two NUMA_1vs2.pdf All runs with PFS disabled and using SSD storage. The graphs show a significant difference in variance. 2 NUMA nodes results are in general nearer to each other. This could mean that it was maxing out the SSD (two SATA SSD in RAID 0) storage. With 1 NUMA node the differences between commits were bigger. Recommended final configuration: commit 81cf92e9471 with spinning on buf_pool_mutex disabled (the light blue line). When comparing 2 NUMA node performance to 1 NUMA node, scalability looks good. The second NUMA domain adds ~60% (write-only) to ~80% (all other workloads) to performance. Only when operating in AUTO_COMMIT mode, it doesn't scale.

Matthias Leich added a comment - 2022-01-17 19:13

The tree

origin/bb-10.8-MDEV-14425 7f75466f539b61d3dc8696e72a2d715c59aa04d6 2022-01-14T19:52:40+02:00

behaved well in RQG testing. Bad effects observed happen on other trees too.

Matthias Leich added a comment - 2022-01-17 19:13 The tree origin/bb-10.8-MDEV-14425 7f75466f539b61d3dc8696e72a2d715c59aa04d6 2022-01-14T19:52:40+02:00 behaved well in RQG testing. Bad effects observed happen on other trees too.

Marko Mäkelä added a comment - 2022-01-18 13:35

I rebased the bb-10.8-~~MDEV-14425~~ branch once more, so that krunalbauskar can test it. Previously, his tests were contaminated by ~~MDEV-27499~~.

Marko Mäkelä added a comment - 2022-01-18 13:35 I rebased the bb-10.8- MDEV-14425 branch once more, so that krunalbauskar can test it. Previously, his tests were contaminated by MDEV-27499 .

Marko Mäkelä added a comment - 2022-01-21 07:01

An observation was made during testing: rr record mariadb-backup --backup cannot work reliably if the redo log file was opened via mmap(). The reason is that rr assumes that the mmap file contents may only be changed by the traced process(es). To avoid bogus failures while running backup under rr, there are a few solutions:

Build without libpmem: rm CMakeCache.txt; cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1 /path/to/source
Place the server’s redo log somewhere else than /dev/shm or a PMEM device mounted with -o dax.
Patch log_t::attach() so that mmap() will not be attempted if srv_operation == SRV_OPERATION_BACKUP.
Implement server-side backup (MDEV-14992).

Marko Mäkelä added a comment - 2022-01-21 07:01 An observation was made during testing: rr record mariadb-backup --backup cannot work reliably if the redo log file was opened via mmap() . The reason is that rr assumes that the mmap file contents may only be changed by the traced process(es) . To avoid bogus failures while running backup under rr , there are a few solutions: Build without libpmem : rm CMakeCache.txt; cmake -DCMAKE_DISABLE_FIND_PACKAGE_PMEM=1 /path/to/source Place the server’s redo log somewhere else than /dev/shm or a PMEM device mounted with -o dax . Patch log_t::attach() so that mmap() will not be attempted if srv_operation == SRV_OPERATION_BACKUP . Implement server-side backup ( MDEV-14992 ).

Matthias Leich added a comment - 2022-01-21 11:41

The tree

    origin/bb-10.8-MDEV-14425 14eaeb68e60626a9b1e4f4b611f0bc23a79f7202 2022-01-20T13:35:18+02:00

performed sufficient well in RQG testing focused on Mariabackup.

There was some surprising amount of  bad effects (unknown to me but maybe already in JIRA).

But the same test battery applied to

    origin/10.8 baef53a70c675da6d19ac3c7f23c7b8b4ed8458c 2022-01-20T16:01:10+01:00

showed nearly the same and in sum not less bad effects.

Hence I stop testing now and vote for integrating MDEV-14425 into 10.8 if

the corresponding MTR test pass.

Matthias Leich added a comment - 2022-01-21 11:41 The tree origin/bb-10.8-MDEV-14425 14eaeb68e60626a9b1e4f4b611f0bc23a79f7202 2022-01-20T13:35:18+02:00 performed sufficient well in RQG testing focused on Mariabackup. There was some surprising amount of bad effects (unknown to me but maybe already in JIRA). But the same test battery applied to origin/10.8 baef53a70c675da6d19ac3c7f23c7b8b4ed8458c 2022-01-20T16:01:10+01:00 showed nearly the same and in sum not less bad effects. Hence I stop testing now and vote for integrating MDEV-14425 into 10.8 if the corresponding MTR test pass.

Marko Mäkelä added a comment - 2022-01-21 15:06

Thank you to everyone who tested this and provided feedback.

As recommended by axel, spinning on buf_pool.mutex was disabled, except on ARMv8.

Based on performance tests on 512-byte block devices by myself and krunalbauskar, we will not enable O_DIRECT on the redo log on Linux by default. With the setting innodb_flush_method=O_SYNC we will enable O_DIRECT on the log as well as data files. On Microsoft Windows, buffering had already been disabled for the redo log.

On other operating systems than Windows or Linux, writes to the log will keep using a block size of 512 bytes and not bypass any file system cache. Changing that would require implementing a way to detect the physical block size.

Marko Mäkelä added a comment - 2022-01-21 15:06 Thank you to everyone who tested this and provided feedback. As recommended by axel , spinning on buf_pool.mutex was disabled , except on ARMv8. Based on performance tests on 512-byte block devices by myself and krunalbauskar , we will not enable O_DIRECT on the redo log on Linux by default. With the setting innodb_flush_method=O_SYNC we will enable O_DIRECT on the log as well as data files. On Microsoft Windows, buffering had already been disabled for the redo log. On other operating systems than Windows or Linux, writes to the log will keep using a block size of 512 bytes and not bypass any file system cache. Changing that would require implementing a way to detect the physical block size.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 8 Vote for this issue

Watchers:: 33 Start watching this issue

Dates

Created:: 2017-11-17 05:54

Updated:: 2025-02-04 12:16

Resolved:: 2022-01-21 15:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

The circular log file ib_logfile0

Payload encoding

Padding

Mini-transaction encoding: Prepending or appending a CRC to each MDEV-12353 mini-transaction

Alternative encoding (scrapped idea): Prepending a mini-transaction header with length and CRC

Log writing and synchronous flushing

Crash recovery and backup

Attachments

Attachments

Issue Links

Activity

Checkpoint information & file operations

The circular log

Changes to log block format

The checkpoint block format

Changes to log record format

Encryption

Padding

People

Dates

Git Integration

The circular log file `ib_logfile0`