Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28909

Write performance not scale to NVMe SSD

    XMLWordPrintable

Details

    Description

      Figure "NVMe worse than SATA.png" shows that the TPC-C benchmark performance is even worse in Samsung 980 pro than in Samsung 960 evo when using the default configuration.

      innodb_flush_method = fsync   |   O_DIRECT_NO_FSYNC
      innodb_doublewrite  = on      |   off 
                             ↑      |    ↑
                          purple    |  yellow
      

      After reducing the frequency of calling fsync, the performance get back to normal (yellow). Then I debug the issue and find that the slowness is not inside the application level: firstly, I use fio to benchmark the ideal limit as shown in figure "fio-benchmark.png".

      fio --filename=/dev/nvme2n1 --size=50g  --ioengine=[sync/libaio] --iodepth=[1/32] --numjobs=16 --rw=randwrite --buffered=0 --direct=1 --fsync=[1/0] --bs=[4k/128k] --sync=[none/sync]
      

      Then I use blktrace to further debug:

      --bs=4k --fsync=1 --ioengine=libaio --iodepth=32
       
      ==================== Device Overhead ====================
       
             DEV |       Q2G       G2I       Q2M       I2D       D2C <------ time the I/O is “active” in the driver and on the device
      ---------- | --------- --------- --------- --------- ---------
       (259, 12) |   0.0158%   0.0000%   0.0007%   0.0000%  92.6055%
      ---------- | --------- --------- --------- --------- ---------
         Overall |   0.0158%   0.0000%   0.0007%   0.0000%  92.6055%
      

      Using libaio with fsync dramatically damages the random write performance. Well, this is an well-know problem, where fsync can make libaio fall back to synchronous IO. However, from the figure "fio-benchmark.png", we can confirm that using O_SYNC can workaround my problem, while innodb_flush_method surprisingly does not support O_SYNC (even support O_DSYNC). So could it possible to add another option for this parameter in a future version? Thanks!

      Another request: since "devices get extremely fast, interrupt-driven work is no longer as efficient as polling for completions — a common theme that underlies the architecture of performance-oriented I/O systems." So why not plan to move to io_uring in some future version? I observe the nvme dirver queues and find that with libaio, the queues are almost empty, and sometimes even are used in a serialized way.

      Attachments

        Issue Links

          Activity

            People

              axel Axel Schwenke
              Tim He Tim He
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.