Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-39297

innodb_flush_method=O_DIRECT_NO_FSYNC performance regression remains

    XMLWordPrintable

Details

    • Related to performance

    Description

      MDEV-33545 introduced innodb_doublewrite=fast, which aims at a similar performance as the deprecated setting innodb_flush_method=O_DIRECT_NO_SYNC that maps to it. As we can read in the commit message and https://github.com/MariaDB/server/pull/3091, there are some differences:

      The value innodb_doublewrite=fast differs from the previous combination of innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always invoking os_file_flush() on the doublewrite buffer itself in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer when there are multiple doublewrite batches between checkpoints. Typically, once per second, buf_flush_page_cleaner() would write out up to innodb_io_capacity pages and advance the log checkpoint. Also typically, innodb_io_capacity>128, which is the size of the doublewrite buffer in pages. Should os_file_flush_func() not be invoked between doublewrite batches, writes could be reordered in an unsafe way.

      As mdcallag recently pointed out in MDEV-33545 as well as in his blog post, a performance gap remains.

      There are limitations to what can be safely done here.

      If the InnoDB write-ahead log and the system tablespace are located in the same storage device (both normally are stored in the datadir root) and if we are running on a suitable file system such as ext4, then we might omit the fdatasync() on the doublewrite buffer if there had been another fdatasync() to the same file system since the last write of a checkpoint. This is because fdatasync() on ext4 should make all pending writes durable to that file system, also for other files than the one it is being invoked on.

      After the fix of MDEV-38968, log checkpoints are only invoked by buf_flush_page_cleaner(). However, page writes can still be initiated by multiple threads, and we can't remove or replace buf_flush_list_space() easily, because the throttling mechanism of innodb_encryption_threads depends on it. Likewise, log_write_up_to() can be invoked from pretty much any thread.

      A simple change could be to introduce an atomic flag that would be cleared upon completion of any os_file_flush() to the file system where the doublewrite buffer resides. This would allow us to avoid redundant calls in the doublewrite code path:

      diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc
      index fb4a7bc5d99..6818ad8190c 100644
      --- a/storage/innobase/buf/buf0dblwr.cc
      +++ b/storage/innobase/buf/buf0dblwr.cc
      @@ -721,7 +721,8 @@ void buf_dblwr_t::flush_buffered_writes_completed(const IORequest &request)
         log_checkpoint(). Writes to the system tablespace should be rare,
         except when executing DDL or using the non-default settings
         innodb_file_per_table=OFF or innodb_undo_tablespaces=0. */
      -  os_file_flush(request.node->handle);
      +  if (flush_needed.test_and_set())
      +    os_file_flush(request.node->handle);
       
         /* The writes have been flushed to disk now and in recovery we will
         find them in the doublewrite buffer blocks. Next, write the data pages. */
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              marko Marko Mäkelä
              Votes:
              2 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.