[MDEV-24854] Change innodb_flush_method=O_DIRECT by default Created: 2021-02-12  Updated: 2024-02-05  Resolved: 2021-02-20

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.6.0

Type: Task Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: performance

Issue Links:
Problem/Incident
causes MDEV-33095 innodb_flush_method=O_DIRECT creates ... Closed
Relates
relates to MDEV-26040 os_file_set_size() may not work on O_... Closed
relates to MDEV-27772 Performance regression with default c... Open
relates to MDEV-28766 MDEV-28111 breaks innodb_flush_log_at... Closed
relates to MDEV-28909 Write performance not scale to NVMe SSD Stalled
relates to MDEV-30136 Map innodb_flush_method to new settab... Closed
relates to MDEV-30986 Slow full index scan in 10.6 vs 10.5 ... Closed
relates to MDEV-32067 InnoDB linear read ahead had better b... Open
relates to MDEV-32357 Make innodb_flush_method dynamic Open
relates to MDEV-33379 innodb_log_file_buffering=OFF causes ... In Progress
relates to MDEV-16526 Overhaul the InnoDB page flushing Closed
relates to MDEV-23855 InnoDB log checkpointing causes regre... Closed
relates to MDEV-27900 aio returning partial reads/writes lo... Closed

 Description   

We have innodb_use_native_aio=ON by default since the introduction of that parameter in MariaDB 5.5. However, to really benefit from the setting, the files should be opened in O_DIRECT mode, to bypass the file system cache. In this way, the reads and writes can be submitted with DMA, using the InnoDB buffer pool directly, and no processor cycles need to be used for copying data.

The setting O_DIRECT should be equivalent to the old default innodb_flush_method=fsync in other aspects. Only the file system cache will be bypassed.

Note: innodb_flush_method=O_DIRECT in combination with a tiny innodb_buffer_pool_size may cause a significant performance regression, because we will no longer be able to take advantage of the file system cache of the operating system kernel. The InnoDB buffer pool will completely replace it. Affected users should configure innodb_flush_method=fsync.

This change will not affect Microsoft Windows. The default there is innodb_flush_method=unbuffered, which is roughly equivalent to O_DIRECT.



 Comments   
Comment by Marko Mäkelä [ 2021-02-19 ]

I was planning to change the semantics of O_DIRECT on Linux, so that we would switch the file to O_DSYNC mode and avoid the fdatasync() calls. Alas, this would require larger code changes, because fcntl() cannot be used to set the O_SYNC or O_DSYNC flags on an already open file descriptor.

On my system (Linux 5.10 kernel and ext4 file system), I did not notice any performance difference between O_DIRECT_NO_FSYNC and O_DIRECT. I tested performance using a small redo log so that log checkpoints will cause frequent page flushing. With the previous default innodb_flush_method=fsync the throughput would vary a lot, and be slightly lower in average, on both libaio and liburing.

Comment by Marko Mäkelä [ 2021-02-20 ]

I suspect that innodb_flush_method=O_DIRECT_NO_FSYNC could be safe for most cases (overwriting a non-sparse block in file). When extending data files, InnoDB should always invoke explicit fdatasync().

The setting innodb_flush_method=O_DIRECT_NO_FSYNC could be unsafe when using sparse files (page_compressed tables), because as far as I understand, such writes may require updating file system metadata.

Comment by Marko Mäkelä [ 2021-12-21 ]

I found a plausible claim regarding when fdatasync() is needed after an O_DIRECT write:

  • If the file is being extended as part of the write.
  • If this is the first write after the space had been allocated with fallocate().

These are rather rare cases, so the overhead of a no-op fdatasync() call should be relatively small. The InnoDB implementation in MariaDB does attempt to extend files using fallocate(), and falls back to actually writing NUL bytes if that operation fails.

In both cases, some metadata will have to be updated after the write of the data, so that in case the operating system is killed and restarted, the file will be recovered with the correct length or contents of the block. File system metadata such as the length and the sector mapping of the file will not be updated as part of the data write. I assume that when fallocate() is supported, it will quickly update the metadata to say that the data has not been initialized yet, and must be 'read' as NUL bytes.

Comment by Marko Mäkelä [ 2022-01-21 ]

Starting with MDEV-14425 (MariaDB Server 10.8.1), the setting innodb_flush_method=O_DSYNC will enable O_DIRECT on data files (whose page size is at least 4096 bytes) and log files (if the physical block size can be determined to be between 64 and 4096 bytes).

wlad commented that on devices that do not support FUA (Force Unit Access), writes may be cached and could disappear in the event of a sudden power loss. A separate command needs to be issued to make them durable. That separate command should be part of fdatasync() or fsync(). In other words, innodb_flush_method=O_DSYNC might not be safe to use on some storage that does not support FUA.

elenst reported that on a Fedora Rawhide system that uses btrfs by default, InnoDB fails to start up because of data page corruption. It would appear to work correctly when started up with innodb_flush_method=fsync. This would seem to suggest that there is a bug in the O_DIRECT implementation of btrfs. I cannot imagine any theoretical reason why O_DIRECT would not work a copy-on-write filesystem like btrfs or xfs or zfs. Even if there were some technical reason, then btrfs could simply return EINVAL to the fcntl() system call, like tmpfs does.

Comment by Daniel Black [ 2022-01-24 ]

btrfs test 5.15.14-200.fc35.x86_64

$ dd if=/dev/zero  of=../btrfs.blk bs=1M count=2K
2048+0 records in
2048+0 records out
2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.50241 s, 1.4 GB/s
 
$ sudo losetup --direct-io=on  -f ../btrfs.blk 
 
$ sudo mkfs.btrfs /dev/loop6
btrfs-progs v5.15.1 
See http://btrfs.wiki.kernel.org for more information.
 
Performing full device TRIM /dev/loop6 (2.00GiB) ...
NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)
 
Label:              (null)
UUID:               c364f0a2-b9a9-4b13-a21a-7639fd896765
Node size:          16384
Sector size:        4096
Filesystem size:    2.00GiB
Block group profiles:
  Data:             single            8.00MiB
  Metadata:         DUP             102.38MiB
  System:           DUP               8.00MiB
SSD detected:       yes
Zoned device:       no
Incompat features:  extref, skinny-metadata, no-holes
Runtime features:   free-space-tree
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  PATH
    1     2.00GiB  /dev/loop6
 
$ sudo mount /dev/loop6  /mnt/
 
$ sudo chown dan: /mnt/
 
$ scripts/mysql_install_db --no-defaults --srcdir=$OLDPWD --builddir=$PWD --datadir=/mnt/dd
Installing MariaDB/MySQL system tables in '/mnt/dd' ...
OK
 
$   sql/mysqld --no-defaults --skip-networking --datadir=/mnt/dd  --verbose
2022-01-24 14:35:49 0 [Note] sql/mysqld (server 10.8.0-MariaDB) starting as process 41741 ...
2022-01-24 14:35:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2022-01-24 14:35:49 0 [Note] InnoDB: Number of transaction pools: 1
2022-01-24 14:35:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2022-01-24 14:35:49 0 [Note] InnoDB: Using liburing
2022-01-24 14:35:49 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2022-01-24 14:35:49 0 [Note] InnoDB: Completed initialization of buffer pool
2022-01-24 14:35:49 0 [Note] InnoDB: 128 rollback segments are active.
2022-01-24 14:35:49 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2022-01-24 14:35:49 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2022-01-24 14:35:49 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2022-01-24 14:35:49 0 [Note] InnoDB: 10.8.0 started; log sequence number 42173; transaction id 14
2022-01-24 14:35:49 0 [Note] InnoDB: Loading buffer pool(s) from /mnt/dd/ib_buffer_pool
2022-01-24 14:35:49 0 [Note] Plugin 'FEEDBACK' is disabled.
2022-01-24 14:35:49 0 [Note] InnoDB: Buffer pool(s) load completed at 220124 14:35:49
2022-01-24 14:35:49 0 [Note] sql/mysqld: ready for connections.
Version: '10.8.0-MariaDB'  socket: '/tmp/mysql.sock'  port: 0  Source distribution
 
$ client/mariadb -S /tmp/mysql.sock 
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 10.8.0-MariaDB Source distribution
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> show global variables like 'innodb_flush_method%';
+---------------------+----------+
| Variable_name       | Value    |
+---------------------+----------+
| innodb_flush_method | O_DIRECT |
+---------------------+----------+
1 row in set (0.002 sec)
 
$ echo $SYSBENCH
sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket=/tmp/mysql.sock --mysql-user=dan --mysql-db=test --percentile=99 --tables=2 --table_size=2000000
 
$ $SYSBENCH prepare
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
 
Creating table 'sbtest1'...
Inserting 2000000 records into 'sbtest1'
Creating a secondary index on 'sbtest1'...
Creating table 'sbtest2'...
Inserting 2000000 records into 'sbtest2'
Creating a secondary index on 'sbtest2'...
 
$   $SYSBENCH --rand-seed=42 --rand-type=uniform --max-requests=0 --time=60 --report-interval=5 --threads=2 run
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
 
Running the test with following options:
Number of threads: 2
Report intermediate results every 5 second(s)
Initializing random number generator from seed (42).
 
 
Initializing worker threads...
 
Threads started!
 
[ 5s ] thds: 2 tps: 99.77 qps: 99.77 (r/w/o: 0.00/99.77/0.00) lat (ms,99%): 30.26 err/s: 0.00 reconn/s: 0.00
[ 10s ] thds: 2 tps: 90.14 qps: 90.14 (r/w/o: 0.00/90.14/0.00) lat (ms,99%): 31.94 err/s: 0.00 reconn/s: 0.00
[ 15s ] thds: 2 tps: 98.88 qps: 98.88 (r/w/o: 0.00/98.88/0.00) lat (ms,99%): 48.34 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 2 tps: 62.60 qps: 62.60 (r/w/o: 0.00/62.60/0.00) lat (ms,99%): 89.16 err/s: 0.00 reconn/s: 0.00
 
[ 25s ] thds: 2 tps: 32.98 qps: 32.98 (r/w/o: 0.00/32.98/0.00) lat (ms,99%): 186.54 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 2 tps: 61.20 qps: 61.20 (r/w/o: 0.00/61.20/0.00) lat (ms,99%): 54.83 err/s: 0.00 reconn/s: 0.00
[ 35s ] thds: 2 tps: 65.19 qps: 65.19 (r/w/o: 0.00/65.19/0.00) lat (ms,99%): 44.98 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 2 tps: 72.01 qps: 72.01 (r/w/o: 0.00/72.01/0.00) lat (ms,99%): 41.85 err/s: 0.00 reconn/s: 0.00
[ 45s ] thds: 2 tps: 65.19 qps: 65.19 (r/w/o: 0.00/65.19/0.00) lat (ms,99%): 40.37 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 2 tps: 54.21 qps: 54.21 (r/w/o: 0.00/54.21/0.00) lat (ms,99%): 211.60 err/s: 0.00 reconn/s: 0.00
[ 55s ] thds: 2 tps: 78.00 qps: 78.00 (r/w/o: 0.00/78.00/0.00) lat (ms,99%): 38.94 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 2 tps: 64.59 qps: 64.59 (r/w/o: 0.00/64.59/0.00) lat (ms,99%): 62.19 err/s: 0.00 reconn/s: 0.00
SQL statistics:
    queries performed:
        read:                            0
        write:                           4226
        other:                           0
        total:                           4226
    transactions:                        4226   (70.39 per sec.)
    queries:                             4226   (70.39 per sec.)
    ignored errors:                      0      (0.00 per sec.)
    reconnects:                          0      (0.00 per sec.)
 
General statistics:
    total time:                          60.0297s
    total number of events:              4226
 
Latency (ms):
         min:                                    7.73
         avg:                                   28.40
         max:                                  481.22
         99th percentile:                       95.81
         sum:                               120025.38
 
Threads fairness:
    events (avg/stddev):           2113.0000/0.00
    execution time (avg/stddev):   60.0127/0.01

Comment by Daniel Black [ 2022-01-24 ]

btrfs in above test also succeded when wihout directio

without direct io

$ sudo losetup --direct-io=off  -f ../btrfs.blk 
 
$ sudo mount /dev/loop6  /mnt/
 
$ losetup -a
/dev/loop1: []: (/var/lib/snapd/snaps/multipass_5317.snap)
/dev/loop6: []: (/home/dan/repos/btrfs.blk)

survived reinstall and sysbench prepare and innodb_flush_method=O_DIRECT

Comment by Marko Mäkelä [ 2022-03-18 ]

For MDEV-28111 (MariaDB Server 10.8.3), I tested O_DIRECT writes of the InnoDB redo log ib_logfile0 with and without O_DSYNC on 3 non-FUA devices today: a SATA HDD, SSD, and a PCIe NVMe drive.

In my test, O_DSYNC was slower on the HDD, and slightly faster on the SSD and NVMe drives. According to https://lwn.net/Articles/400541/ Linux should work correctly on devices that lack FUA support. The unsafety claim that wlad made matches the situation before 2010: https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct-and-barriers

Comment by Marko Mäkelä [ 2022-09-01 ]

MDEV-28766 (starting with MariaDB Server 10.8.4) allows O_DIRECT to be enabled or disabled on the InnoDB write-ahead log file ib_logfile0 on Linux and Microsoft Windows:

SET GLOBAL innodb_log_file_buffering=OFF;
SET GLOBAL innodb_log_file_buffering=ON;

Generated at Thu Feb 08 09:33:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.