Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28909

Write performance not scale to NVMe SSD

Details

    Description

      Figure "NVMe worse than SATA.png" shows that the TPC-C benchmark performance is even worse in Samsung 980 pro than in Samsung 960 evo when using the default configuration.

      innodb_flush_method = fsync   |   O_DIRECT_NO_FSYNC
      innodb_doublewrite  = on      |   off 
                             ↑      |    ↑
                          purple    |  yellow
      

      After reducing the frequency of calling fsync, the performance get back to normal (yellow). Then I debug the issue and find that the slowness is not inside the application level: firstly, I use fio to benchmark the ideal limit as shown in figure "fio-benchmark.png".

      fio --filename=/dev/nvme2n1 --size=50g  --ioengine=[sync/libaio] --iodepth=[1/32] --numjobs=16 --rw=randwrite --buffered=0 --direct=1 --fsync=[1/0] --bs=[4k/128k] --sync=[none/sync]
      

      Then I use blktrace to further debug:

      --bs=4k --fsync=1 --ioengine=libaio --iodepth=32
       
      ==================== Device Overhead ====================
       
             DEV |       Q2G       G2I       Q2M       I2D       D2C <------ time the I/O is “active” in the driver and on the device
      ---------- | --------- --------- --------- --------- ---------
       (259, 12) |   0.0158%   0.0000%   0.0007%   0.0000%  92.6055%
      ---------- | --------- --------- --------- --------- ---------
         Overall |   0.0158%   0.0000%   0.0007%   0.0000%  92.6055%
      

      Using libaio with fsync dramatically damages the random write performance. Well, this is an well-know problem, where fsync can make libaio fall back to synchronous IO. However, from the figure "fio-benchmark.png", we can confirm that using O_SYNC can workaround my problem, while innodb_flush_method surprisingly does not support O_SYNC (even support O_DSYNC). So could it possible to add another option for this parameter in a future version? Thanks!

      Another request: since "devices get extremely fast, interrupt-driven work is no longer as efficient as polling for completions — a common theme that underlies the architecture of performance-oriented I/O systems." So why not plan to move to io_uring in some future version? I observe the nvme dirver queues and find that with libaio, the queues are almost empty, and sometimes even are used in a serialized way.

      Attachments

        Issue Links

          Activity

            Tim He, can you please state the MariaDB Server version that you are using? Is it older than 10.6, where MDEV-24854 changed innodb_flush_method=O_DIRECT to be the default? If it is not older than 10.6, was it linked with liburing (MDEV-24883) or with the older libaio?

            What are your other InnoDB configuration parameters?

            Note: innodb_doublewrite=off is only safe if writes of innodb_page_size are known to be atomic. I have not seen any Linux documentation on this, but I might assume that on a SSD with a physical block size of 4096 bytes, innodb_page_size=4k could be safe to be used with innodb_doublewrite=off.

            Also note: With 4096-byte physical block size, MariaDB Server 10.8 should yield better write performance than earlier versions. See also MDEV-28766.

            marko Marko Mäkelä added a comment - Tim He , can you please state the MariaDB Server version that you are using? Is it older than 10.6, where MDEV-24854 changed innodb_flush_method=O_DIRECT to be the default? If it is not older than 10.6, was it linked with liburing ( MDEV-24883 ) or with the older libaio ? What are your other InnoDB configuration parameters? Note: innodb_doublewrite=off is only safe if writes of innodb_page_size are known to be atomic. I have not seen any Linux documentation on this, but I might assume that on a SSD with a physical block size of 4096 bytes, innodb_page_size=4k could be safe to be used with innodb_doublewrite=off . Also note: With 4096-byte physical block size, MariaDB Server 10.8 should yield better write performance than earlier versions. See also MDEV-28766 .
            Tim He Tim He added a comment -

            Version 10.5.13. Other parameters are unchanged.
            Nice to see io_uring has already been applied. While in version 10.5, using fsync(2) + libaio can cause performance problem in my SSDs (see "fio-benchmark.png". I have confirmed this issue in two NVMe SSDs), So since O_SYNC + libaio can also guarantee safety, shall MariaDB could consider using adding it to one of the option to innodb_flush_method ?

            Tim He Tim He added a comment - Version 10.5.13. Other parameters are unchanged. Nice to see io_uring has already been applied. While in version 10.5, using fsync(2) + libaio can cause performance problem in my SSDs (see "fio-benchmark.png". I have confirmed this issue in two NVMe SSDs), So since O_SYNC + libaio can also guarantee safety, shall MariaDB could consider using adding it to one of the option to innodb_flush_method ?

            Tim He, were you able to test the performance of MariaDB Server 10.6 or 10.8? Also, did you test a setup where the ib_logfile0 and the InnoDB data files reside on separate devices? Which writes and fsync() are we talking about?

            I had written some notes about O_DIRECT and fdatasync() or fsync() in MDEV-24854.

            In 10.8, you may also want to check MDEV-28766. In some cases, enabling O_DIRECT on the log file would reduce performance.

            marko Marko Mäkelä added a comment - Tim He , were you able to test the performance of MariaDB Server 10.6 or 10.8? Also, did you test a setup where the ib_logfile0 and the InnoDB data files reside on separate devices? Which writes and fsync() are we talking about? I had written some notes about O_DIRECT and fdatasync() or fsync() in MDEV-24854 . In 10.8, you may also want to check MDEV-28766 . In some cases, enabling O_DIRECT on the log file would reduce performance.

            I think that it is worth running some performance tests, beyond what already was done in MDEV-24854. I think that we must cover various parameters (different working set sizes, innodb_buffer_pool_size and innodb_log_file_size) so that there will be different scenarios covering all the reasons of page writes.

            The allowed innodb_flush_method values are as follows:

            • fsync (0, SRV_FSYNC): the default before MDEV-24854: use file system cache and explicit fdatasync() or fsync()
            • O_DSYNC (1, SRV_O_DSYNC): O_DIRECT on data files, and enable O_DSYNC on the redo log
            • littlesync (2, SRV_LITTLESYNC): like O_DIRECT_NO_FSYNC, but using the file system cache (unsafe!)
            • nosync (3, SRV_NOSYNC): like littlesync, but do not invoke fsync() or fdatasync() on the log file
            • O_DIRECT (4, SRV_O_DIRECT): like fsync but bypassing the file system cache for data files (default)
            • O_DIRECT_NO_FSYNC (5, SRV_O_DIRECT_NO_FSYNC): like O_DIRECT, but do not call fsync() or fdatasync() on the data files (unsafe in some cases; see this comment in MDEV-24854)

            Starting with MDEV-28766 in 10.8, O_DIRECT can be enabled on the redo log:

            SET GLOBAL innodb_log_file_buffering=OFF;
            

            I think that we should deprecate the confusing parameter innodb_flush_method and map the existing values to some combinations of new parameters, which could be changed with SET GLOBAL while the server is running:

            • innodb_data_file_buffering (OFF (default), ON): whether to use the file system cache for data files
            • innodb_write_sync (OFF (default), ON): whether to open the log and the persistent files in O_DSYNC mode to avoid the need for explicit fdatasync() after writes

            I do not think that we need to continue supporting innodb_flush_method=nosync on the redo log. A similar effect can already be achieved with non-default values of innodb_flush_log_at_trx_commit.

            marko Marko Mäkelä added a comment - I think that it is worth running some performance tests, beyond what already was done in MDEV-24854 . I think that we must cover various parameters (different working set sizes, innodb_buffer_pool_size and innodb_log_file_size ) so that there will be different scenarios covering all the reasons of page writes . The allowed innodb_flush_method values are as follows: fsync (0, SRV_FSYNC) : the default before MDEV-24854 : use file system cache and explicit fdatasync() or fsync() O_DSYNC (1, SRV_O_DSYNC ): O_DIRECT on data files, and enable O_DSYNC on the redo log littlesync (2, SRV_LITTLESYNC ): like O_DIRECT_NO_FSYNC , but using the file system cache (unsafe!) nosync (3, SRV_NOSYNC ): like littlesync , but do not invoke fsync() or fdatasync() on the log file O_DIRECT (4, SRV_O_DIRECT ): like fsync but bypassing the file system cache for data files (default) O_DIRECT_NO_FSYNC (5, SRV_O_DIRECT_NO_FSYNC ): like O_DIRECT , but do not call fsync() or fdatasync() on the data files (unsafe in some cases; see this comment in MDEV-24854 ) Starting with MDEV-28766 in 10.8, O_DIRECT can be enabled on the redo log: SET GLOBAL innodb_log_file_buffering= OFF ; I think that we should deprecate the confusing parameter innodb_flush_method and map the existing values to some combinations of new parameters, which could be changed with SET GLOBAL while the server is running: innodb_data_file_buffering ( OFF (default), ON ): whether to use the file system cache for data files innodb_write_sync ( OFF (default), ON ): whether to open the log and the persistent files in O_DSYNC mode to avoid the need for explicit fdatasync() after writes I do not think that we need to continue supporting innodb_flush_method=nosync on the redo log. A similar effect can already be achieved with non-default values of innodb_flush_log_at_trx_commit .

            Running performance tests should be easier by using the 4 settable Boolean parameters that were introduced in MDEV-30136, along with deprecating innodb_flush_method. That ticket describes how the 6 values of innodb_flush_method are mapped to or related to the 4 new Boolean parameters and 1 pre-existing Boolean parameter.

            marko Marko Mäkelä added a comment - Running performance tests should be easier by using the 4 settable Boolean parameters that were introduced in MDEV-30136 , along with deprecating innodb_flush_method . That ticket describes how the 6 values of innodb_flush_method are mapped to or related to the 4 new Boolean parameters and 1 pre-existing Boolean parameter.

            People

              axel Axel Schwenke
              Tim He Tim He
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.