[MDEV-39297] innodb_flush_method=O_DIRECT_NO_FSYNC performance regression remains - Jira

XML

Word

Printable

Details

Type: Bug
Status: Confirmed (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 11.4, 11.8, 12.3
Fix Version/s: 11.4, 11.8, 12.3
Component/s: Storage Engine - InnoDB
Labels:
- performance

Bug Category:
Related to performance

Description

~~MDEV-33545~~ introduced innodb_doublewrite=fast, which aims at a similar performance as the deprecated setting innodb_flush_method=O_DIRECT_NO_SYNC that maps to it. As we can read in the commit message and https://github.com/MariaDB/server/pull/3091, there are some differences:

The value innodb_doublewrite=fast differs from the previous combination of innodb_doublewrite=ON and innodb_flush_method=O_DIRECT_NO_FSYNC by always invoking os_file_flush() on the doublewrite buffer itself in buf_dblwr_t::flush_buffered_writes_completed(). This should be safer when there are multiple doublewrite batches between checkpoints. Typically, once per second, buf_flush_page_cleaner() would write out up to innodb_io_capacity pages and advance the log checkpoint. Also typically, innodb_io_capacity>128, which is the size of the doublewrite buffer in pages. Should os_file_flush_func() not be invoked between doublewrite batches, writes could be reordered in an unsafe way.

As mdcallag recently pointed out in ~~MDEV-33545~~ as well as in his blog post, a performance gap remains.

There are limitations to what can be safely done here.

If the InnoDB write-ahead log and the system tablespace are located in the same storage device (both normally are stored in the datadir root) and if we are running on a suitable file system such as ext4, then we might omit the fdatasync() on the doublewrite buffer if there had been another fdatasync() to the same file system since the last write of a checkpoint. This is because fdatasync() on ext4 should make all pending writes durable to that file system, also for other files than the one it is being invoked on.

After the fix of ~~MDEV-38968~~, log checkpoints are only invoked by buf_flush_page_cleaner(). However, page writes can still be initiated by multiple threads, and we can't remove or replace buf_flush_list_space() easily, because the throttling mechanism of innodb_encryption_threads depends on it. Likewise, log_write_up_to() can be invoked from pretty much any thread.

A simple change could be to introduce an atomic flag that would be cleared upon completion of any os_file_flush() to the file system where the doublewrite buffer resides. This would allow us to avoid redundant calls in the doublewrite code path:

diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc

index fb4a7bc5d99..6818ad8190c 100644

--- a/storage/innobase/buf/buf0dblwr.cc

+++ b/storage/innobase/buf/buf0dblwr.cc

@@ -721,7 +721,8 @@ void buf_dblwr_t::flush_buffered_writes_completed(const IORequest &request)

   log_checkpoint(). Writes to the system tablespace should be rare,

   except when executing DDL or using the non-default settings

   innodb_file_per_table=OFF or innodb_undo_tablespaces=0. */

-  os_file_flush(request.node->handle);

+  if (flush_needed.test_and_set())

+    os_file_flush(request.node->handle);

   /* The writes have been flushed to disk now and in recovery we will

   find them in the doublewrite buffer blocks. Next, write the data pages. */

Attachments

Issue Links

is caused by

MDEV-30136 Map innodb_flush_method to new settable Booleans innodb_{log,data}_file_{buffering,write_through}

Closed

relates to

MDEV-33545 Perf regression from removing innodb_flush_method=O_DIRECT_NO_FSYNC

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Marko Mäkelä

Votes:: 2 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2026-04-10 05:03

Updated:: 2026-04-10 05:09

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.