[MDEV-28909] Write performance not scale to NVMe SSD Created: 2022-06-20  Updated: 2023-10-05

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.5
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: Tim He Assignee: Axel Schwenke
Resolution: Unresolved Votes: 0
Labels: performance
Environment:

Linux kernel 5.4.0


Attachments: PNG File NVMe worse than SATA.png     PNG File fio-benchmark.png    
Issue Links:
Relates
relates to MDEV-29343 MariaDB 10.6.x slower mysqldump etc. Closed
relates to MDEV-30136 Map innodb_flush_method to new settab... Closed
relates to MDEV-24854 Change innodb_flush_method=O_DIRECT b... Closed

 Description   

Figure "NVMe worse than SATA.png" shows that the TPC-C benchmark performance is even worse in Samsung 980 pro than in Samsung 960 evo when using the default configuration.

innodb_flush_method = fsync   |   O_DIRECT_NO_FSYNC
innodb_doublewrite  = on      |   off 
                       ↑      |    ↑
                    purple    |  yellow

After reducing the frequency of calling fsync, the performance get back to normal (yellow). Then I debug the issue and find that the slowness is not inside the application level: firstly, I use fio to benchmark the ideal limit as shown in figure "fio-benchmark.png".

fio --filename=/dev/nvme2n1 --size=50g  --ioengine=[sync/libaio] --iodepth=[1/32] --numjobs=16 --rw=randwrite --buffered=0 --direct=1 --fsync=[1/0] --bs=[4k/128k] --sync=[none/sync]

Then I use blktrace to further debug:

--bs=4k --fsync=1 --ioengine=libaio --iodepth=32
 
==================== Device Overhead ====================
 
       DEV |       Q2G       G2I       Q2M       I2D       D2C <------ time the I/O is “active” in the driver and on the device
---------- | --------- --------- --------- --------- ---------
 (259, 12) |   0.0158%   0.0000%   0.0007%   0.0000%  92.6055%
---------- | --------- --------- --------- --------- ---------
   Overall |   0.0158%   0.0000%   0.0007%   0.0000%  92.6055%

Using libaio with fsync dramatically damages the random write performance. Well, this is an well-know problem, where fsync can make libaio fall back to synchronous IO. However, from the figure "fio-benchmark.png", we can confirm that using O_SYNC can workaround my problem, while innodb_flush_method surprisingly does not support O_SYNC (even support O_DSYNC). So could it possible to add another option for this parameter in a future version? Thanks!

Another request: since "devices get extremely fast, interrupt-driven work is no longer as efficient as polling for completions — a common theme that underlies the architecture of performance-oriented I/O systems." So why not plan to move to io_uring in some future version? I observe the nvme dirver queues and find that with libaio, the queues are almost empty, and sometimes even are used in a serialized way.



 Comments   
Comment by Marko Mäkelä [ 2022-06-20 ]

Tim He, can you please state the MariaDB Server version that you are using? Is it older than 10.6, where MDEV-24854 changed innodb_flush_method=O_DIRECT to be the default? If it is not older than 10.6, was it linked with liburing (MDEV-24883) or with the older libaio?

What are your other InnoDB configuration parameters?

Note: innodb_doublewrite=off is only safe if writes of innodb_page_size are known to be atomic. I have not seen any Linux documentation on this, but I might assume that on a SSD with a physical block size of 4096 bytes, innodb_page_size=4k could be safe to be used with innodb_doublewrite=off.

Also note: With 4096-byte physical block size, MariaDB Server 10.8 should yield better write performance than earlier versions. See also MDEV-28766.

Comment by Tim He [ 2022-06-22 ]

Version 10.5.13. Other parameters are unchanged.
Nice to see io_uring has already been applied. While in version 10.5, using fsync(2) + libaio can cause performance problem in my SSDs (see "fio-benchmark.png". I have confirmed this issue in two NVMe SSDs), So since O_SYNC + libaio can also guarantee safety, shall MariaDB could consider using adding it to one of the option to innodb_flush_method ?

Comment by Marko Mäkelä [ 2022-08-26 ]

Tim He, were you able to test the performance of MariaDB Server 10.6 or 10.8? Also, did you test a setup where the ib_logfile0 and the InnoDB data files reside on separate devices? Which writes and fsync() are we talking about?

I had written some notes about O_DIRECT and fdatasync() or fsync() in MDEV-24854.

In 10.8, you may also want to check MDEV-28766. In some cases, enabling O_DIRECT on the log file would reduce performance.

Comment by Marko Mäkelä [ 2022-11-01 ]

I think that it is worth running some performance tests, beyond what already was done in MDEV-24854. I think that we must cover various parameters (different working set sizes, innodb_buffer_pool_size and innodb_log_file_size) so that there will be different scenarios covering all the reasons of page writes.

The allowed innodb_flush_method values are as follows:

  • fsync (0, SRV_FSYNC): the default before MDEV-24854: use file system cache and explicit fdatasync() or fsync()
  • O_DSYNC (1, SRV_O_DSYNC): O_DIRECT on data files, and enable O_DSYNC on the redo log
  • littlesync (2, SRV_LITTLESYNC): like O_DIRECT_NO_FSYNC, but using the file system cache (unsafe!)
  • nosync (3, SRV_NOSYNC): like littlesync, but do not invoke fsync() or fdatasync() on the log file
  • O_DIRECT (4, SRV_O_DIRECT): like fsync but bypassing the file system cache for data files (default)
  • O_DIRECT_NO_FSYNC (5, SRV_O_DIRECT_NO_FSYNC): like O_DIRECT, but do not call fsync() or fdatasync() on the data files (unsafe in some cases; see this comment in MDEV-24854)

Starting with MDEV-28766 in 10.8, O_DIRECT can be enabled on the redo log:

SET GLOBAL innodb_log_file_buffering=OFF;

I think that we should deprecate the confusing parameter innodb_flush_method and map the existing values to some combinations of new parameters, which could be changed with SET GLOBAL while the server is running:

  • innodb_data_file_buffering (OFF (default), ON): whether to use the file system cache for data files
  • innodb_write_sync (OFF (default), ON): whether to open the log and the persistent files in O_DSYNC mode to avoid the need for explicit fdatasync() after writes

I do not think that we need to continue supporting innodb_flush_method=nosync on the redo log. A similar effect can already be achieved with non-default values of innodb_flush_log_at_trx_commit.

Comment by Marko Mäkelä [ 2023-01-05 ]

Running performance tests should be easier by using the 4 settable Boolean parameters that were introduced in MDEV-30136, along with deprecating innodb_flush_method. That ticket describes how the 6 values of innodb_flush_method are mapped to or related to the 4 new Boolean parameters and 1 pre-existing Boolean parameter.

Generated at Thu Feb 08 10:04:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.