[MDEV-24854] Change innodb_flush_method=O_DIRECT by default Created: 2021-02-12 Updated: 2024-02-05 Resolved: 2021-02-20 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Fix Version/s: | 10.6.0 |
| Type: | Task | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | performance | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
We have innodb_use_native_aio=ON by default since the introduction of that parameter in MariaDB 5.5. However, to really benefit from the setting, the files should be opened in O_DIRECT mode, to bypass the file system cache. In this way, the reads and writes can be submitted with DMA, using the InnoDB buffer pool directly, and no processor cycles need to be used for copying data. The setting O_DIRECT should be equivalent to the old default innodb_flush_method=fsync in other aspects. Only the file system cache will be bypassed. Note: innodb_flush_method=O_DIRECT in combination with a tiny innodb_buffer_pool_size may cause a significant performance regression, because we will no longer be able to take advantage of the file system cache of the operating system kernel. The InnoDB buffer pool will completely replace it. Affected users should configure innodb_flush_method=fsync. This change will not affect Microsoft Windows. The default there is innodb_flush_method=unbuffered, which is roughly equivalent to O_DIRECT. |
| Comments |
| Comment by Marko Mäkelä [ 2021-02-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was planning to change the semantics of O_DIRECT on Linux, so that we would switch the file to O_DSYNC mode and avoid the fdatasync() calls. Alas, this would require larger code changes, because fcntl() cannot be used to set the O_SYNC or O_DSYNC flags on an already open file descriptor. On my system (Linux 5.10 kernel and ext4 file system), I did not notice any performance difference between O_DIRECT_NO_FSYNC and O_DIRECT. I tested performance using a small redo log so that log checkpoints will cause frequent page flushing. With the previous default innodb_flush_method=fsync the throughput would vary a lot, and be slightly lower in average, on both libaio and liburing. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-02-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I suspect that innodb_flush_method=O_DIRECT_NO_FSYNC could be safe for most cases (overwriting a non-sparse block in file). When extending data files, InnoDB should always invoke explicit fdatasync(). The setting innodb_flush_method=O_DIRECT_NO_FSYNC could be unsafe when using sparse files (page_compressed tables), because as far as I understand, such writes may require updating file system metadata. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-12-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I found a plausible claim regarding when fdatasync() is needed after an O_DIRECT write:
These are rather rare cases, so the overhead of a no-op fdatasync() call should be relatively small. The InnoDB implementation in MariaDB does attempt to extend files using fallocate(), and falls back to actually writing NUL bytes if that operation fails. In both cases, some metadata will have to be updated after the write of the data, so that in case the operating system is killed and restarted, the file will be recovered with the correct length or contents of the block. File system metadata such as the length and the sector mapping of the file will not be updated as part of the data write. I assume that when fallocate() is supported, it will quickly update the metadata to say that the data has not been initialized yet, and must be 'read' as NUL bytes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Starting with wlad commented that on devices that do not support FUA (Force Unit Access), writes may be cached and could disappear in the event of a sudden power loss. A separate command needs to be issued to make them durable. That separate command should be part of fdatasync() or fsync(). In other words, innodb_flush_method=O_DSYNC might not be safe to use on some storage that does not support FUA. elenst reported that on a Fedora Rawhide system that uses btrfs by default, InnoDB fails to start up because of data page corruption. It would appear to work correctly when started up with innodb_flush_method=fsync. This would seem to suggest that there is a bug in the O_DIRECT implementation of btrfs. I cannot imagine any theoretical reason why O_DIRECT would not work a copy-on-write filesystem like btrfs or xfs or zfs. Even if there were some technical reason, then btrfs could simply return EINVAL to the fcntl() system call, like tmpfs does. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Black [ 2022-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
btrfs in above test also succeded when wihout directio
survived reinstall and sysbench prepare and innodb_flush_method=O_DIRECT | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
For In my test, O_DSYNC was slower on the HDD, and slightly faster on the SSD and NVMe drives. According to https://lwn.net/Articles/400541/ Linux should work correctly on devices that lack FUA support. The unsafety claim that wlad made matches the situation before 2010: https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct-and-barriers | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|