The following functions are calling os_aio_wait_until_no_pending_writes():
- buf_dblwr_t::flush_buffered_writes() (only when the doublewrite buffer is disabled)
- log_flush() (log_flush_task is to be removed in
MDEV-25948)
- fil_close_tablespace() (if an error occurs during ALTER TABLE…IMPORT TABLESPACE): We do wait for the reference count to reach 0, and the reference count is decremented at the end of fil_aio_callback(). There should be no fdatasync() for files that are in the process of being imported.
- row_quiesce_table_start() (FLUSH TABLE…FOR EXPORT): We will already wait until buf_page_write_complete() has been invoked on any page that was dirty.
Based on my analysis above, it does seem that the entire function os_aio_wait_until_no_pending_writes() can be safely removed.
One thing that must remain is the fdatasync() or fsync() or equivalent call on the data files, in case an asynchronous write is not durable by default. (I think it should be, for native asynchronous I/O, but writes via simulated aio are likely not durable, unless O_DIRECT or similar is being used.)
The fil_flush_file_spaces() call before log checkpoint (as well as the single-tablespace equivalent in row_quiesce_table_start() a.k.a. FLUSH TABLES…FOR EXPORT}) seems to be necessary for correctness.
Before the log checkpoint can be safely advanced (and the start of the log up to the checkpoint LSN discarded), we must wait for all page writes up to the checkpoint LSN to have been truly completed.
Thinking aloud: It might be useful to move the log checkpoint to a separate asynchronous task, fired by buf_flush_page_cleaner() at the end of each batch. In that way, the wait for pending writes to complete would not cause delays to the page flushing itself. But, in my benchmarks on Linux ext4 using libaio or io_uring, fil_flush_file_spaces() never showed up in performance profiles. I suspect that fdatasync() is basically a no-op. So, I would not complicate things by trying to optimize this for operating systems for which we lack an asynchronous I/O interface.
When it comes to calls to fil_flush_file_spaces():
- The calls in buf_dblwr are actually overkill, and we would only want some kind of write barrier, to prevent reordering of writes by the storage device (or by the operating system when we use simulated AIO). Until operating systems start to provide a better system call, we cannot change this.
- The call in log_flush() should be unnecessary (but typically no-op on Linux or Microsoft Windows, as noted above).
- Each caller of log_checkpoint_low() already invokes fil_flush_file_spaces(), as they should.
- fil_node_open_file() (part of fil_system.LRU) must flush files before they can be closed. It is the only way to reliably catch write errors; remember Postgres’ fsync issue.
- fil_system_t::close_all() (executed at shutdown) probably does not need the call, because a final log checkpoint should have been executed, but it does not hurt.
- fil_write_flushed_lsn() (executed at shutdown, and should really be removed
MDEV-14425 so that we would refuse to start up without a redo log) requires this call.
The following functions are calling os_aio_wait_until_no_pending_writes():
MDEV-25948)Based on my analysis above, it does seem that the entire function os_aio_wait_until_no_pending_writes() can be safely removed.
One thing that must remain is the fdatasync() or fsync() or equivalent call on the data files, in case an asynchronous write is not durable by default. (I think it should be, for native asynchronous I/O, but writes via simulated aio are likely not durable, unless O_DIRECT or similar is being used.)
The fil_flush_file_spaces() call before log checkpoint (as well as the single-tablespace equivalent in row_quiesce_table_start() a.k.a. FLUSH TABLES…FOR EXPORT}) seems to be necessary for correctness.
Before the log checkpoint can be safely advanced (and the start of the log up to the checkpoint LSN discarded), we must wait for all page writes up to the checkpoint LSN to have been truly completed.
Thinking aloud: It might be useful to move the log checkpoint to a separate asynchronous task, fired by buf_flush_page_cleaner() at the end of each batch. In that way, the wait for pending writes to complete would not cause delays to the page flushing itself. But, in my benchmarks on Linux ext4 using libaio or io_uring, fil_flush_file_spaces() never showed up in performance profiles. I suspect that fdatasync() is basically a no-op. So, I would not complicate things by trying to optimize this for operating systems for which we lack an asynchronous I/O interface.
When it comes to calls to fil_flush_file_spaces():
MDEV-14425so that we would refuse to start up without a redo log) requires this call.