[MDEV-28111] Redo log writes are being buffered on Linux for no good reason - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.8.2
Fix Version/s: 10.8.3
Component/s: Storage Engine - InnoDB
Labels:
- performance
Environment:
Linux

Description

In ~~MDEV-14425~~, we experimented with enabling O_DIRECT when writing to the InnoDB redo log. Previously, this was only done on Microsoft Windows when the physical block size was detected to be 512 bytes. On Linux, we ended up allowing O_DIRECT on the redo log only if innodb_flush_method=O_DSYNC is specified. The reason for this was that the throughput was slightly better on two systems when O_DIRECT was disabled.

On other systems than Linux or Microsoft Windows, we do not enable O_DIRECT on the redo log, because we are not aware of interfaces that would allow the physical block size to be determined. We do not want to write 4096-byte redo log blocks "just in case"; devices with 512-byte physical block size are still common.

The throughput was greatly improved by implementing ~~MDEV-27774~~, to allow multiple threads to concurrently write to log_sys.buf. I tested the performance again, and now my simple benchmark with proper durability (innodb_flush_log_at_trx_commit=1) on NVMe using ext4fs and io_uring on Linux kernel 5.16.12 shows a clear improvement when enabling O_DIRECT for the redo log by default (innodb_flush_method=O_DIRECT).

Here are the average throughput and 95 percentile latency for a 180-second benchmark show:

revision	throughput/tps	latency/ms
10.8 86820837cb34dea54b3221a278a96b667743c11f	58431.08	0.80
patched	63742.95	0.67

For a 30-second benchmark, the impact on latency was slightly more prominent:

revision	throughput/tps	latency/ms
10.8 86820837cb34dea54b3221a278a96b667743c11f	57255.49	0.81
patched	63331.22	0.65

Attachments

Issue Links

causes

MDEV-28766 MDEV-28111 breaks innodb_flush_log_at_trx_commit=2

Closed

relates to

MDEV-14425 Change the InnoDB redo log format to reduce write amplification

Closed

MDEV-27774 Reduce scalability bottlenecks in mtr_t::commit()

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä created issue - 2022-03-17 09:55

Marko Mäkelä made changes - 2022-03-17 09:55

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-14425~~ [ ~~MDEV-14425~~ ]

Marko Mäkelä made changes - 2022-03-17 09:55

Link

This issue relates to ~~MDEV-27774~~ [ ~~MDEV-27774~~ ]

Marko Mäkelä added a comment - 2022-03-17 10:25

Before ~~MDEV-14425~~, writes to the InnoDB redo log ib_logfile0 were always buffered on Linux.

In ~~MDEV-14425~~, the setting innodb_flush_method=O_DSYNC enabled O_DIRECT on InnoDB log and data files.

With this change, we will also enable O_DIRECT on the InnoDB log for the following settings:
innodb_flush_method=O_DIRECT_NO_FSYNC
innodb_flush_method=O_DIRECT (the default setting since ~~MDEV-24854~~).

Marko Mäkelä added a comment - 2022-03-17 10:25 Before MDEV-14425 , writes to the InnoDB redo log ib_logfile0 were always buffered on Linux. In MDEV-14425 , the setting innodb_flush_method=O_DSYNC enabled O_DIRECT on InnoDB log and data files. With this change, we will also enable O_DIRECT on the InnoDB log for the following settings: innodb_flush_method=O_DIRECT_NO_FSYNC innodb_flush_method=O_DIRECT (the default setting since MDEV-24854 ).

Marko Mäkelä made changes - 2022-03-17 10:25

issue.field.resolutiondate

2022-03-17 10:25:36.0

2022-03-17 10:25:36.445

Marko Mäkelä made changes - 2022-03-17 10:25

Fix Version/s		10.8.3 [ 27502 ]
Fix Version/s	10.8 [ 26121 ]
Resolution		Fixed [ 1 ]
Status	Open [ 1 ]	Closed [ 6 ]

Marko Mäkelä added a comment - 2022-03-18 13:18

I conducted some more tests, also comparing innodb_flush_method=O_DSYNC to this revised innodb_flush_method=O_DIRECT in a Sysbench oltp_update_non_index workload that almost completely avoids log checkpoints, and concentrates on innodb_flush_log_at_trx_commit=1 latency on transaction commits.

On the 3 devices I tested with the Linux 5.16.14 kernel, ext4 file system and io_uring, O_DSYNC was slightly faster on an NVMe drive as well as a SATA SSD (both with 512-byte physical block size), and slightly slower on a SATA 3.0 HDD (with 4096-byte physical block size). None of the devices support FUA mode, which according to https://lwn.net/Articles/400541/ implies that each O_DSYNC write may have to execute both a write and a flush inside the kernel. Apparently, for the solid-state drives that I tested it ends up consuming less time than the savings from the skipped fdatasync() system calls.

On the SATA SSD, less than a minute into the benchmark, the write speed was reduced to about a third. I assume that the drive had run out of flash erase blocks (the volume was rather full) and it had to start throttling writes. This occured at about the same time both with and without O_DSYNC.

Apparently, until some time before 2010, Linux could have wrongly returned from an O_DSYNC write already when the data had been written to the drive cache, that is, the flush may have been skipped: https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct-and-barriers

Marko Mäkelä added a comment - 2022-03-18 13:18 I conducted some more tests, also comparing innodb_flush_method=O_DSYNC to this revised innodb_flush_method=O_DIRECT in a Sysbench oltp_update_non_index workload that almost completely avoids log checkpoints, and concentrates on innodb_flush_log_at_trx_commit=1 latency on transaction commits. On the 3 devices I tested with the Linux 5.16.14 kernel, ext4 file system and io_uring , O_DSYNC was slightly faster on an NVMe drive as well as a SATA SSD (both with 512-byte physical block size), and slightly slower on a SATA 3.0 HDD (with 4096-byte physical block size). None of the devices support FUA mode, which according to https://lwn.net/Articles/400541/ implies that each O_DSYNC write may have to execute both a write and a flush inside the kernel. Apparently, for the solid-state drives that I tested it ends up consuming less time than the savings from the skipped fdatasync() system calls. On the SATA SSD, less than a minute into the benchmark, the write speed was reduced to about a third. I assume that the drive had run out of flash erase blocks (the volume was rather full) and it had to start throttling writes. This occured at about the same time both with and without O_DSYNC . Apparently, until some time before 2010, Linux could have wrongly returned from an O_DSYNC write already when the data had been written to the drive cache, that is, the flush may have been skipped: https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct-and-barriers

Marko Mäkelä made changes - 2022-06-07 14:43

Link

This issue causes ~~MDEV-28766~~ [ ~~MDEV-28766~~ ]

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2022-03-17 09:55

Updated:: 2022-06-14 14:46

Resolved:: 2022-03-17 10:25

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server