Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24854

Change innodb_flush_method=O_DIRECT by default

Details

    Description

      We have innodb_use_native_aio=ON by default since the introduction of that parameter in MariaDB 5.5. However, to really benefit from the setting, the files should be opened in O_DIRECT mode, to bypass the file system cache. In this way, the reads and writes can be submitted with DMA, using the InnoDB buffer pool directly, and no processor cycles need to be used for copying data.

      The setting O_DIRECT should be equivalent to the old default innodb_flush_method=fsync in other aspects. Only the file system cache will be bypassed.

      Note: innodb_flush_method=O_DIRECT in combination with a tiny innodb_buffer_pool_size may cause a significant performance regression, because we will no longer be able to take advantage of the file system cache of the operating system kernel. The InnoDB buffer pool will completely replace it. Affected users should configure innodb_flush_method=fsync.

      This change will not affect Microsoft Windows. The default there is innodb_flush_method=unbuffered, which is roughly equivalent to O_DIRECT.

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -

            I was planning to change the semantics of O_DIRECT on Linux, so that we would switch the file to O_DSYNC mode and avoid the fdatasync() calls. Alas, this would require larger code changes, because fcntl() cannot be used to set the O_SYNC or O_DSYNC flags on an already open file descriptor.

            On my system (Linux 5.10 kernel and ext4 file system), I did not notice any performance difference between O_DIRECT_NO_FSYNC and O_DIRECT. I tested performance using a small redo log so that log checkpoints will cause frequent page flushing. With the previous default innodb_flush_method=fsync the throughput would vary a lot, and be slightly lower in average, on both libaio and liburing.

            marko Marko Mäkelä added a comment - I was planning to change the semantics of O_DIRECT on Linux, so that we would switch the file to O_DSYNC mode and avoid the fdatasync() calls. Alas, this would require larger code changes, because fcntl() cannot be used to set the O_SYNC or O_DSYNC flags on an already open file descriptor. On my system (Linux 5.10 kernel and ext4 file system), I did not notice any performance difference between O_DIRECT_NO_FSYNC and O_DIRECT . I tested performance using a small redo log so that log checkpoints will cause frequent page flushing. With the previous default innodb_flush_method=fsync the throughput would vary a lot, and be slightly lower in average, on both libaio and liburing .

            I suspect that innodb_flush_method=O_DIRECT_NO_FSYNC could be safe for most cases (overwriting a non-sparse block in file). When extending data files, InnoDB should always invoke explicit fdatasync().

            The setting innodb_flush_method=O_DIRECT_NO_FSYNC could be unsafe when using sparse files (page_compressed tables), because as far as I understand, such writes may require updating file system metadata.

            marko Marko Mäkelä added a comment - I suspect that innodb_flush_method=O_DIRECT_NO_FSYNC could be safe for most cases (overwriting a non-sparse block in file). When extending data files, InnoDB should always invoke explicit fdatasync() . The setting innodb_flush_method=O_DIRECT_NO_FSYNC could be unsafe when using sparse files ( page_compressed tables), because as far as I understand, such writes may require updating file system metadata.
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2021-02-20 10:05:40.0 2021-02-20 10:05:40.395
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6.0 [ 24431 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 119126 ] MariaDB v4 [ 134390 ]

            I found a plausible claim regarding when fdatasync() is needed after an O_DIRECT write:

            • If the file is being extended as part of the write.
            • If this is the first write after the space had been allocated with fallocate().

            These are rather rare cases, so the overhead of a no-op fdatasync() call should be relatively small. The InnoDB implementation in MariaDB does attempt to extend files using fallocate(), and falls back to actually writing NUL bytes if that operation fails.

            In both cases, some metadata will have to be updated after the write of the data, so that in case the operating system is killed and restarted, the file will be recovered with the correct length or contents of the block. File system metadata such as the length and the sector mapping of the file will not be updated as part of the data write. I assume that when fallocate() is supported, it will quickly update the metadata to say that the data has not been initialized yet, and must be 'read' as NUL bytes.

            marko Marko Mäkelä added a comment - I found a plausible claim regarding when fdatasync() is needed after an O_DIRECT write: If the file is being extended as part of the write. If this is the first write after the space had been allocated with fallocate() . These are rather rare cases, so the overhead of a no-op fdatasync() call should be relatively small. The InnoDB implementation in MariaDB does attempt to extend files using fallocate() , and falls back to actually writing NUL bytes if that operation fails. In both cases, some metadata will have to be updated after the write of the data, so that in case the operating system is killed and restarted, the file will be recovered with the correct length or contents of the block. File system metadata such as the length and the sector mapping of the file will not be updated as part of the data write. I assume that when fallocate() is supported, it will quickly update the metadata to say that the data has not been initialized yet, and must be 'read' as NUL bytes.

            Starting with MDEV-14425 (MariaDB Server 10.8.1), the setting innodb_flush_method=O_DSYNC will enable O_DIRECT on data files (whose page size is at least 4096 bytes) and log files (if the physical block size can be determined to be between 64 and 4096 bytes).

            wlad commented that on devices that do not support FUA (Force Unit Access), writes may be cached and could disappear in the event of a sudden power loss. A separate command needs to be issued to make them durable. That separate command should be part of fdatasync() or fsync(). In other words, innodb_flush_method=O_DSYNC might not be safe to use on some storage that does not support FUA.

            elenst reported that on a Fedora Rawhide system that uses btrfs by default, InnoDB fails to start up because of data page corruption. It would appear to work correctly when started up with innodb_flush_method=fsync. This would seem to suggest that there is a bug in the O_DIRECT implementation of btrfs. I cannot imagine any theoretical reason why O_DIRECT would not work a copy-on-write filesystem like btrfs or xfs or zfs. Even if there were some technical reason, then btrfs could simply return EINVAL to the fcntl() system call, like tmpfs does.

            marko Marko Mäkelä added a comment - Starting with MDEV-14425 (MariaDB Server 10.8.1), the setting innodb_flush_method=O_DSYNC will enable O_DIRECT on data files (whose page size is at least 4096 bytes) and log files (if the physical block size can be determined to be between 64 and 4096 bytes). wlad commented that on devices that do not support FUA (Force Unit Access), writes may be cached and could disappear in the event of a sudden power loss. A separate command needs to be issued to make them durable. That separate command should be part of fdatasync() or fsync() . In other words, innodb_flush_method=O_DSYNC might not be safe to use on some storage that does not support FUA. elenst reported that on a Fedora Rawhide system that uses btrfs by default, InnoDB fails to start up because of data page corruption. It would appear to work correctly when started up with innodb_flush_method=fsync . This would seem to suggest that there is a bug in the O_DIRECT implementation of btrfs . I cannot imagine any theoretical reason why O_DIRECT would not work a copy-on-write filesystem like btrfs or xfs or zfs . Even if there were some technical reason, then btrfs could simply return EINVAL to the fcntl() system call, like tmpfs does.
            danblack Daniel Black added a comment -

            btrfs test 5.15.14-200.fc35.x86_64

            $ dd if=/dev/zero  of=../btrfs.blk bs=1M count=2K
            2048+0 records in
            2048+0 records out
            2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.50241 s, 1.4 GB/s
             
            $ sudo losetup --direct-io=on  -f ../btrfs.blk 
             
            $ sudo mkfs.btrfs /dev/loop6
            btrfs-progs v5.15.1 
            See http://btrfs.wiki.kernel.org for more information.
             
            Performing full device TRIM /dev/loop6 (2.00GiB) ...
            NOTE: several default settings have changed in version 5.15, please make sure
                  this does not affect your deployments:
                  - DUP for metadata (-m dup)
                  - enabled no-holes (-O no-holes)
                  - enabled free-space-tree (-R free-space-tree)
             
            Label:              (null)
            UUID:               c364f0a2-b9a9-4b13-a21a-7639fd896765
            Node size:          16384
            Sector size:        4096
            Filesystem size:    2.00GiB
            Block group profiles:
              Data:             single            8.00MiB
              Metadata:         DUP             102.38MiB
              System:           DUP               8.00MiB
            SSD detected:       yes
            Zoned device:       no
            Incompat features:  extref, skinny-metadata, no-holes
            Runtime features:   free-space-tree
            Checksum:           crc32c
            Number of devices:  1
            Devices:
               ID        SIZE  PATH
                1     2.00GiB  /dev/loop6
             
            $ sudo mount /dev/loop6  /mnt/
             
            $ sudo chown dan: /mnt/
             
            $ scripts/mysql_install_db --no-defaults --srcdir=$OLDPWD --builddir=$PWD --datadir=/mnt/dd
            Installing MariaDB/MySQL system tables in '/mnt/dd' ...
            OK
             
            $   sql/mysqld --no-defaults --skip-networking --datadir=/mnt/dd  --verbose
            2022-01-24 14:35:49 0 [Note] sql/mysqld (server 10.8.0-MariaDB) starting as process 41741 ...
            2022-01-24 14:35:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
            2022-01-24 14:35:49 0 [Note] InnoDB: Number of transaction pools: 1
            2022-01-24 14:35:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
            2022-01-24 14:35:49 0 [Note] InnoDB: Using liburing
            2022-01-24 14:35:49 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
            2022-01-24 14:35:49 0 [Note] InnoDB: Completed initialization of buffer pool
            2022-01-24 14:35:49 0 [Note] InnoDB: 128 rollback segments are active.
            2022-01-24 14:35:49 0 [Note] InnoDB: Creating shared tablespace for temporary tables
            2022-01-24 14:35:49 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
            2022-01-24 14:35:49 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
            2022-01-24 14:35:49 0 [Note] InnoDB: 10.8.0 started; log sequence number 42173; transaction id 14
            2022-01-24 14:35:49 0 [Note] InnoDB: Loading buffer pool(s) from /mnt/dd/ib_buffer_pool
            2022-01-24 14:35:49 0 [Note] Plugin 'FEEDBACK' is disabled.
            2022-01-24 14:35:49 0 [Note] InnoDB: Buffer pool(s) load completed at 220124 14:35:49
            2022-01-24 14:35:49 0 [Note] sql/mysqld: ready for connections.
            Version: '10.8.0-MariaDB'  socket: '/tmp/mysql.sock'  port: 0  Source distribution
             
            $ client/mariadb -S /tmp/mysql.sock 
            Welcome to the MariaDB monitor.  Commands end with ; or \g.
            Your MariaDB connection id is 3
            Server version: 10.8.0-MariaDB Source distribution
             
            Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
             
            Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
             
            MariaDB [(none)]> show global variables like 'innodb_flush_method%';
            +---------------------+----------+
            | Variable_name       | Value    |
            +---------------------+----------+
            | innodb_flush_method | O_DIRECT |
            +---------------------+----------+
            1 row in set (0.002 sec)
             
            $ echo $SYSBENCH
            sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket=/tmp/mysql.sock --mysql-user=dan --mysql-db=test --percentile=99 --tables=2 --table_size=2000000
             
            $ $SYSBENCH prepare
            sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
             
            Creating table 'sbtest1'...
            Inserting 2000000 records into 'sbtest1'
            Creating a secondary index on 'sbtest1'...
            Creating table 'sbtest2'...
            Inserting 2000000 records into 'sbtest2'
            Creating a secondary index on 'sbtest2'...
             
            $   $SYSBENCH --rand-seed=42 --rand-type=uniform --max-requests=0 --time=60 --report-interval=5 --threads=2 run
            sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)
             
            Running the test with following options:
            Number of threads: 2
            Report intermediate results every 5 second(s)
            Initializing random number generator from seed (42).
             
             
            Initializing worker threads...
             
            Threads started!
             
            [ 5s ] thds: 2 tps: 99.77 qps: 99.77 (r/w/o: 0.00/99.77/0.00) lat (ms,99%): 30.26 err/s: 0.00 reconn/s: 0.00
            [ 10s ] thds: 2 tps: 90.14 qps: 90.14 (r/w/o: 0.00/90.14/0.00) lat (ms,99%): 31.94 err/s: 0.00 reconn/s: 0.00
            [ 15s ] thds: 2 tps: 98.88 qps: 98.88 (r/w/o: 0.00/98.88/0.00) lat (ms,99%): 48.34 err/s: 0.00 reconn/s: 0.00
            [ 20s ] thds: 2 tps: 62.60 qps: 62.60 (r/w/o: 0.00/62.60/0.00) lat (ms,99%): 89.16 err/s: 0.00 reconn/s: 0.00
             
            [ 25s ] thds: 2 tps: 32.98 qps: 32.98 (r/w/o: 0.00/32.98/0.00) lat (ms,99%): 186.54 err/s: 0.00 reconn/s: 0.00
            [ 30s ] thds: 2 tps: 61.20 qps: 61.20 (r/w/o: 0.00/61.20/0.00) lat (ms,99%): 54.83 err/s: 0.00 reconn/s: 0.00
            [ 35s ] thds: 2 tps: 65.19 qps: 65.19 (r/w/o: 0.00/65.19/0.00) lat (ms,99%): 44.98 err/s: 0.00 reconn/s: 0.00
            [ 40s ] thds: 2 tps: 72.01 qps: 72.01 (r/w/o: 0.00/72.01/0.00) lat (ms,99%): 41.85 err/s: 0.00 reconn/s: 0.00
            [ 45s ] thds: 2 tps: 65.19 qps: 65.19 (r/w/o: 0.00/65.19/0.00) lat (ms,99%): 40.37 err/s: 0.00 reconn/s: 0.00
            [ 50s ] thds: 2 tps: 54.21 qps: 54.21 (r/w/o: 0.00/54.21/0.00) lat (ms,99%): 211.60 err/s: 0.00 reconn/s: 0.00
            [ 55s ] thds: 2 tps: 78.00 qps: 78.00 (r/w/o: 0.00/78.00/0.00) lat (ms,99%): 38.94 err/s: 0.00 reconn/s: 0.00
            [ 60s ] thds: 2 tps: 64.59 qps: 64.59 (r/w/o: 0.00/64.59/0.00) lat (ms,99%): 62.19 err/s: 0.00 reconn/s: 0.00
            SQL statistics:
                queries performed:
                    read:                            0
                    write:                           4226
                    other:                           0
                    total:                           4226
                transactions:                        4226   (70.39 per sec.)
                queries:                             4226   (70.39 per sec.)
                ignored errors:                      0      (0.00 per sec.)
                reconnects:                          0      (0.00 per sec.)
             
            General statistics:
                total time:                          60.0297s
                total number of events:              4226
             
            Latency (ms):
                     min:                                    7.73
                     avg:                                   28.40
                     max:                                  481.22
                     99th percentile:                       95.81
                     sum:                               120025.38
             
            Threads fairness:
                events (avg/stddev):           2113.0000/0.00
                execution time (avg/stddev):   60.0127/0.01
            

            danblack Daniel Black added a comment - btrfs test 5.15.14-200.fc35.x86_64 $ dd if=/dev/zero of=../btrfs.blk bs=1M count=2K 2048+0 records in 2048+0 records out 2147483648 bytes (2.1 GB, 2.0 GiB) copied, 1.50241 s, 1.4 GB/s   $ sudo losetup --direct-io=on -f ../btrfs.blk   $ sudo mkfs.btrfs /dev/loop6 btrfs-progs v5.15.1 See http://btrfs.wiki.kernel.org for more information.   Performing full device TRIM /dev/loop6 (2.00GiB) ... NOTE: several default settings have changed in version 5.15, please make sure this does not affect your deployments: - DUP for metadata (-m dup) - enabled no-holes (-O no-holes) - enabled free-space-tree (-R free-space-tree)   Label: (null) UUID: c364f0a2-b9a9-4b13-a21a-7639fd896765 Node size: 16384 Sector size: 4096 Filesystem size: 2.00GiB Block group profiles: Data: single 8.00MiB Metadata: DUP 102.38MiB System: DUP 8.00MiB SSD detected: yes Zoned device: no Incompat features: extref, skinny-metadata, no-holes Runtime features: free-space-tree Checksum: crc32c Number of devices: 1 Devices: ID SIZE PATH 1 2.00GiB /dev/loop6   $ sudo mount /dev/loop6 /mnt/   $ sudo chown dan: /mnt/   $ scripts/mysql_install_db --no-defaults --srcdir=$OLDPWD --builddir=$PWD --datadir=/mnt/dd Installing MariaDB/MySQL system tables in '/mnt/dd' ... OK   $ sql/mysqld --no-defaults --skip-networking --datadir=/mnt/dd --verbose 2022-01-24 14:35:49 0 [Note] sql/mysqld (server 10.8.0-MariaDB) starting as process 41741 ... 2022-01-24 14:35:49 0 [Note] InnoDB: Compressed tables use zlib 1.2.11 2022-01-24 14:35:49 0 [Note] InnoDB: Number of transaction pools: 1 2022-01-24 14:35:49 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions 2022-01-24 14:35:49 0 [Note] InnoDB: Using liburing 2022-01-24 14:35:49 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB 2022-01-24 14:35:49 0 [Note] InnoDB: Completed initialization of buffer pool 2022-01-24 14:35:49 0 [Note] InnoDB: 128 rollback segments are active. 2022-01-24 14:35:49 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2022-01-24 14:35:49 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ... 2022-01-24 14:35:49 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB. 2022-01-24 14:35:49 0 [Note] InnoDB: 10.8.0 started; log sequence number 42173; transaction id 14 2022-01-24 14:35:49 0 [Note] InnoDB: Loading buffer pool(s) from /mnt/dd/ib_buffer_pool 2022-01-24 14:35:49 0 [Note] Plugin 'FEEDBACK' is disabled. 2022-01-24 14:35:49 0 [Note] InnoDB: Buffer pool(s) load completed at 220124 14:35:49 2022-01-24 14:35:49 0 [Note] sql/mysqld: ready for connections. Version: '10.8.0-MariaDB' socket: '/tmp/mysql.sock' port: 0 Source distribution   $ client/mariadb -S /tmp/mysql.sock Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 3 Server version: 10.8.0-MariaDB Source distribution   Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.   Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.   MariaDB [(none)]> show global variables like 'innodb_flush_method%'; +---------------------+----------+ | Variable_name | Value | +---------------------+----------+ | innodb_flush_method | O_DIRECT | +---------------------+----------+ 1 row in set (0.002 sec)   $ echo $SYSBENCH sysbench /usr/share/sysbench/oltp_update_index.lua --mysql-socket=/tmp/mysql.sock --mysql-user=dan --mysql-db=test --percentile=99 --tables=2 --table_size=2000000   $ $SYSBENCH prepare sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)   Creating table 'sbtest1'... Inserting 2000000 records into 'sbtest1' Creating a secondary index on 'sbtest1'... Creating table 'sbtest2'... Inserting 2000000 records into 'sbtest2' Creating a secondary index on 'sbtest2'...   $ $SYSBENCH --rand-seed=42 --rand-type=uniform --max-requests=0 --time=60 --report-interval=5 --threads=2 run sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)   Running the test with following options: Number of threads: 2 Report intermediate results every 5 second(s) Initializing random number generator from seed (42).     Initializing worker threads...   Threads started!   [ 5s ] thds: 2 tps: 99.77 qps: 99.77 (r/w/o: 0.00/99.77/0.00) lat (ms,99%): 30.26 err/s: 0.00 reconn/s: 0.00 [ 10s ] thds: 2 tps: 90.14 qps: 90.14 (r/w/o: 0.00/90.14/0.00) lat (ms,99%): 31.94 err/s: 0.00 reconn/s: 0.00 [ 15s ] thds: 2 tps: 98.88 qps: 98.88 (r/w/o: 0.00/98.88/0.00) lat (ms,99%): 48.34 err/s: 0.00 reconn/s: 0.00 [ 20s ] thds: 2 tps: 62.60 qps: 62.60 (r/w/o: 0.00/62.60/0.00) lat (ms,99%): 89.16 err/s: 0.00 reconn/s: 0.00   [ 25s ] thds: 2 tps: 32.98 qps: 32.98 (r/w/o: 0.00/32.98/0.00) lat (ms,99%): 186.54 err/s: 0.00 reconn/s: 0.00 [ 30s ] thds: 2 tps: 61.20 qps: 61.20 (r/w/o: 0.00/61.20/0.00) lat (ms,99%): 54.83 err/s: 0.00 reconn/s: 0.00 [ 35s ] thds: 2 tps: 65.19 qps: 65.19 (r/w/o: 0.00/65.19/0.00) lat (ms,99%): 44.98 err/s: 0.00 reconn/s: 0.00 [ 40s ] thds: 2 tps: 72.01 qps: 72.01 (r/w/o: 0.00/72.01/0.00) lat (ms,99%): 41.85 err/s: 0.00 reconn/s: 0.00 [ 45s ] thds: 2 tps: 65.19 qps: 65.19 (r/w/o: 0.00/65.19/0.00) lat (ms,99%): 40.37 err/s: 0.00 reconn/s: 0.00 [ 50s ] thds: 2 tps: 54.21 qps: 54.21 (r/w/o: 0.00/54.21/0.00) lat (ms,99%): 211.60 err/s: 0.00 reconn/s: 0.00 [ 55s ] thds: 2 tps: 78.00 qps: 78.00 (r/w/o: 0.00/78.00/0.00) lat (ms,99%): 38.94 err/s: 0.00 reconn/s: 0.00 [ 60s ] thds: 2 tps: 64.59 qps: 64.59 (r/w/o: 0.00/64.59/0.00) lat (ms,99%): 62.19 err/s: 0.00 reconn/s: 0.00 SQL statistics: queries performed: read: 0 write: 4226 other: 0 total: 4226 transactions: 4226 (70.39 per sec.) queries: 4226 (70.39 per sec.) ignored errors: 0 (0.00 per sec.) reconnects: 0 (0.00 per sec.)   General statistics: total time: 60.0297s total number of events: 4226   Latency (ms): min: 7.73 avg: 28.40 max: 481.22 99th percentile: 95.81 sum: 120025.38   Threads fairness: events (avg/stddev): 2113.0000/0.00 execution time (avg/stddev): 60.0127/0.01
            danblack Daniel Black added a comment -

            btrfs in above test also succeded when wihout directio

            without direct io

            $ sudo losetup --direct-io=off  -f ../btrfs.blk 
             
            $ sudo mount /dev/loop6  /mnt/
             
            $ losetup -a
            /dev/loop1: []: (/var/lib/snapd/snaps/multipass_5317.snap)
            /dev/loop6: []: (/home/dan/repos/btrfs.blk)
            

            survived reinstall and sysbench prepare and innodb_flush_method=O_DIRECT

            danblack Daniel Black added a comment - btrfs in above test also succeded when wihout directio without direct io $ sudo losetup --direct-io=off -f ../btrfs.blk   $ sudo mount /dev/loop6 /mnt/   $ losetup -a /dev/loop1: []: (/var/lib/snapd/snaps/multipass_5317.snap) /dev/loop6: []: (/home/dan/repos/btrfs.blk) survived reinstall and sysbench prepare and innodb_flush_method=O_DIRECT
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            For MDEV-28111 (MariaDB Server 10.8.3), I tested O_DIRECT writes of the InnoDB redo log ib_logfile0 with and without O_DSYNC on 3 non-FUA devices today: a SATA HDD, SSD, and a PCIe NVMe drive.

            In my test, O_DSYNC was slower on the HDD, and slightly faster on the SSD and NVMe drives. According to https://lwn.net/Articles/400541/ Linux should work correctly on devices that lack FUA support. The unsafety claim that wlad made matches the situation before 2010: https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct-and-barriers

            marko Marko Mäkelä added a comment - For MDEV-28111 (MariaDB Server 10.8.3), I tested O_DIRECT writes of the InnoDB redo log ib_logfile0 with and without O_DSYNC on 3 non-FUA devices today: a SATA HDD, SSD, and a PCIe NVMe drive. In my test, O_DSYNC was slower on the HDD, and slightly faster on the SSD and NVMe drives. According to https://lwn.net/Articles/400541/ Linux should work correctly on devices that lack FUA support. The unsafety claim that wlad made matches the situation before 2010: https://linux-scsi.vger.kernel.narkive.com/yNnBRBPn/o-direct-and-barriers
            marko Marko Mäkelä made changes -

            MDEV-28766 (starting with MariaDB Server 10.8.4) allows O_DIRECT to be enabled or disabled on the InnoDB write-ahead log file ib_logfile0 on Linux and Microsoft Windows:

            SET GLOBAL innodb_log_file_buffering=OFF;
            SET GLOBAL innodb_log_file_buffering=ON;
            

            marko Marko Mäkelä added a comment - MDEV-28766 (starting with MariaDB Server 10.8.4) allows O_DIRECT to be enabled or disabled on the InnoDB write-ahead log file ib_logfile0 on Linux and Microsoft Windows: SET GLOBAL innodb_log_file_buffering= OFF ; SET GLOBAL innodb_log_file_buffering= ON ;
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            At least on bcachefs, the use of fcntl(fd, F_SETFL, O_DIRECT) can lead to data corruption. In MDEV-33379 (to be released as part of MariaDB Server 10.11.8, 11.0.6) the code was refactored so that instead of invoking fcntl(2), we pass the O_DIRECT flag to the open(2) system call.

            marko Marko Mäkelä added a comment - At least on bcachefs, the use of fcntl(fd, F_SETFL, O_DIRECT) can lead to data corruption. In MDEV-33379 (to be released as part of MariaDB Server 10.11.8, 11.0.6) the code was refactored so that instead of invoking fcntl(2) , we pass the O_DIRECT flag to the open(2) system call.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.