Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25954

Remove superfluous os_aio_wait_until_no_pending_writes() from Innodb

Details

    Description

      There are several os_aio_wait_until_no_pending_writes() in the codebase, not all of them needed. It should be only left where it is needed for correctness, since this call can add nothing but stalls. In the past, it could have been used to workaround certain bugs, like MDEV-24313 / MDEV-25953

      Attachments

        Issue Links

          Activity

            The following functions are calling os_aio_wait_until_no_pending_writes():

            • buf_dblwr_t::flush_buffered_writes() (only when the doublewrite buffer is disabled)
            • log_flush() (log_flush_task is to be removed in MDEV-25948)
            • fil_close_tablespace() (if an error occurs during ALTER TABLE…IMPORT TABLESPACE): We do wait for the reference count to reach 0, and the reference count is decremented at the end of fil_aio_callback(). There should be no fdatasync() for files that are in the process of being imported.
            • row_quiesce_table_start() (FLUSH TABLE…FOR EXPORT): We will already wait until buf_page_write_complete() has been invoked on any page that was dirty.

            Based on my analysis above, it does seem that the entire function os_aio_wait_until_no_pending_writes() can be safely removed.

            One thing that must remain is the fdatasync() or fsync() or equivalent call on the data files, in case an asynchronous write is not durable by default. (I think it should be, for native asynchronous I/O, but writes via simulated aio are likely not durable, unless O_DIRECT or similar is being used.)

            The fil_flush_file_spaces() call before log checkpoint (as well as the single-tablespace equivalent in row_quiesce_table_start() a.k.a. FLUSH TABLES…FOR EXPORT}) seems to be necessary for correctness.

            Before the log checkpoint can be safely advanced (and the start of the log up to the checkpoint LSN discarded), we must wait for all page writes up to the checkpoint LSN to have been truly completed.

            Thinking aloud: It might be useful to move the log checkpoint to a separate asynchronous task, fired by buf_flush_page_cleaner() at the end of each batch. In that way, the wait for pending writes to complete would not cause delays to the page flushing itself. But, in my benchmarks on Linux ext4 using libaio or io_uring, fil_flush_file_spaces() never showed up in performance profiles. I suspect that fdatasync() is basically a no-op. So, I would not complicate things by trying to optimize this for operating systems for which we lack an asynchronous I/O interface.

            When it comes to calls to fil_flush_file_spaces():

            • The calls in buf_dblwr are actually overkill, and we would only want some kind of write barrier, to prevent reordering of writes by the storage device (or by the operating system when we use simulated AIO). Until operating systems start to provide a better system call, we cannot change this.
            • The call in log_flush() should be unnecessary (but typically no-op on Linux or Microsoft Windows, as noted above).
            • Each caller of log_checkpoint_low() already invokes fil_flush_file_spaces(), as they should.
            • fil_node_open_file() (part of fil_system.LRU) must flush files before they can be closed. It is the only way to reliably catch write errors; remember Postgres’ fsync issue.
            • fil_system_t::close_all() (executed at shutdown) probably does not need the call, because a final log checkpoint should have been executed, but it does not hurt.
            • fil_write_flushed_lsn() (executed at shutdown, and should really be removed MDEV-14425 so that we would refuse to start up without a redo log) requires this call.
            marko Marko Mäkelä added a comment - The following functions are calling os_aio_wait_until_no_pending_writes() : buf_dblwr_t::flush_buffered_writes() (only when the doublewrite buffer is disabled) log_flush() ( log_flush_task is to be removed in MDEV-25948 ) fil_close_tablespace() (if an error occurs during ALTER TABLE…IMPORT TABLESPACE ): We do wait for the reference count to reach 0, and the reference count is decremented at the end of fil_aio_callback() . There should be no fdatasync() for files that are in the process of being imported. row_quiesce_table_start() ( FLUSH TABLE…FOR EXPORT ): We will already wait until buf_page_write_complete() has been invoked on any page that was dirty. Based on my analysis above, it does seem that the entire function os_aio_wait_until_no_pending_writes() can be safely removed. One thing that must remain is the fdatasync() or fsync() or equivalent call on the data files, in case an asynchronous write is not durable by default. (I think it should be, for native asynchronous I/O, but writes via simulated aio are likely not durable, unless O_DIRECT or similar is being used.) The fil_flush_file_spaces() call before log checkpoint (as well as the single-tablespace equivalent in row_quiesce_table_start() a.k.a. FLUSH TABLES…FOR EXPORT }) seems to be necessary for correctness. Before the log checkpoint can be safely advanced (and the start of the log up to the checkpoint LSN discarded), we must wait for all page writes up to the checkpoint LSN to have been truly completed. Thinking aloud: It might be useful to move the log checkpoint to a separate asynchronous task, fired by buf_flush_page_cleaner() at the end of each batch. In that way, the wait for pending writes to complete would not cause delays to the page flushing itself. But, in my benchmarks on Linux ext4 using libaio or io_uring , fil_flush_file_spaces() never showed up in performance profiles. I suspect that fdatasync() is basically a no-op. So, I would not complicate things by trying to optimize this for operating systems for which we lack an asynchronous I/O interface. When it comes to calls to fil_flush_file_spaces() : The calls in buf_dblwr are actually overkill, and we would only want some kind of write barrier, to prevent reordering of writes by the storage device (or by the operating system when we use simulated AIO). Until operating systems start to provide a better system call, we cannot change this. The call in log_flush() should be unnecessary (but typically no-op on Linux or Microsoft Windows, as noted above). Each caller of log_checkpoint_low() already invokes fil_flush_file_spaces() , as they should. fil_node_open_file() (part of fil_system.LRU ) must flush files before they can be closed. It is the only way to reliably catch write errors; remember Postgres’ fsync issue . fil_system_t::close_all() (executed at shutdown) probably does not need the call, because a final log checkpoint should have been executed, but it does not hurt. fil_write_flushed_lsn() (executed at shutdown, and should really be removed MDEV-14425 so that we would refuse to start up without a redo log) requires this call.

            It turns out that FLUSH TABLES…FOR EXPORT must wait for the writes to complete, because the request to clean the pages would merely ensure that asynchronous writes will have been initiated. Any other page flushing should be sped up by removing this unnecessary waiting.

            Please review (and also run benchmarks).

            marko Marko Mäkelä added a comment - It turns out that FLUSH TABLES…FOR EXPORT must wait for the writes to complete, because the request to clean the pages would merely ensure that asynchronous writes will have been initiated. Any other page flushing should be sped up by removing this unnecessary waiting. Please review (and also run benchmarks).
            sayap Yap Sok Ann added a comment -

            I thought the waiting is still necessary in buf_dblwr_t::flush_buffered_writes(), so that it serves as write barrier in between the io_submit() system calls and the fsync() system calls? i.e. the flow described in https://stackoverflow.com/questions/66073136/what-will-be-if-i-use-libaio-fsync/66098696#66098696

            If there are still some in-flight I/O from buf_flush_lists() and then fynsc() is called in the middle of it without waiting, potentially the oldest LSN retrieved by log_checkpoint() may not have reached the disk yet.

            sayap Yap Sok Ann added a comment - I thought the waiting is still necessary in buf_dblwr_t::flush_buffered_writes() , so that it serves as write barrier in between the io_submit() system calls and the fsync() system calls? i.e. the flow described in https://stackoverflow.com/questions/66073136/what-will-be-if-i-use-libaio-fsync/66098696#66098696 If there are still some in-flight I/O from buf_flush_lists() and then fynsc() is called in the middle of it without waiting, potentially the oldest LSN retrieved by log_checkpoint() may not have reached the disk yet.
            sayap Yap Sok Ann added a comment - - edited

            On further thought, the waiting in buf_dblwr_t::flush_buffered_writes() does seem unnecessary. However, in log_checkpoint(), shouldn't we get the oldest LSN first, then only do the fsync()? As it is, in between the calls to fil_flush_file_spaces() and get_oldest_modification(), some dirty pages could have been freed without being fsync'ed to the disk yet?

            EDIT: It looks like the logic of log_checkpoint() has been like this since the very beginning (commit c533308a). I don't quite understand how it is safe.

            sayap Yap Sok Ann added a comment - - edited On further thought, the waiting in buf_dblwr_t::flush_buffered_writes() does seem unnecessary. However, in log_checkpoint() , shouldn't we get the oldest LSN first, then only do the fsync() ? As it is, in between the calls to fil_flush_file_spaces() and get_oldest_modification() , some dirty pages could have been freed without being fsync'ed to the disk yet? EDIT: It looks like the logic of log_checkpoint() has been like this since the very beginning (commit c533308a). I don't quite understand how it is safe.

            sayap, I tried to ask something similar in https://stackoverflow.com/questions/62216860/does-fsync-flushfilebuffers-wait-for-outstanding-asynchronous-ios-to-finish/62220924 but did not get an answer that would satisfy me. The question https://stackoverflow.com/questions/66073136/what-will-be-if-i-use-libaio-fsync/66098696#66098696 is more specific.

            In the initial public release of InnoDB code, there was no asynchronous I/O code for Linux. Only "simulated asynchronous I/O" was being used. We still use that outside Linux and Microsoft Windows.

            I can only offer some anecdotal evidence: When monitoring the performance of MariaDB 10.5 or 10.6 with perf record, I never noticed fdatasync() consuming any noticeable amount of time when using asynchronous I/O (io_submit() or io_uring()) in combination with O_DIRECT. If I remember correctly, that was the case even when I tested on a rotational SATA 3.0 hard disk. During the development of MDEV-23855, we did observe that synchronous pwrite() would seriously contend with fsync() or fdatasync(). That is why we made the doublewrite buffer use asynchronous writes.

            I would be very careful before extending the use of asynchronous I/O APIs. We had to dumb down our io_uring() interface in MariaDB 10.6, to avoid -ENOSYS errors on older kernels. We have no idea which kernels our code might run on. It might even be Windows or FreeBSD emulating Linux. Therefore, we probably should avoid changing this within any GA release series.

            All that said, code contributions are welcome.

            marko Marko Mäkelä added a comment - sayap , I tried to ask something similar in https://stackoverflow.com/questions/62216860/does-fsync-flushfilebuffers-wait-for-outstanding-asynchronous-ios-to-finish/62220924 but did not get an answer that would satisfy me. The question https://stackoverflow.com/questions/66073136/what-will-be-if-i-use-libaio-fsync/66098696#66098696 is more specific. In the initial public release of InnoDB code, there was no asynchronous I/O code for Linux. Only "simulated asynchronous I/O" was being used. We still use that outside Linux and Microsoft Windows. I can only offer some anecdotal evidence: When monitoring the performance of MariaDB 10.5 or 10.6 with perf record , I never noticed fdatasync() consuming any noticeable amount of time when using asynchronous I/O ( io_submit() or io_uring() ) in combination with O_DIRECT . If I remember correctly, that was the case even when I tested on a rotational SATA 3.0 hard disk. During the development of MDEV-23855 , we did observe that synchronous pwrite() would seriously contend with fsync() or fdatasync() . That is why we made the doublewrite buffer use asynchronous writes. I would be very careful before extending the use of asynchronous I/O APIs. We had to dumb down our io_uring() interface in MariaDB 10.6, to avoid -ENOSYS errors on older kernels. We have no idea which kernels our code might run on. It might even be Windows or FreeBSD emulating Linux. Therefore, we probably should avoid changing this within any GA release series. All that said, code contributions are welcome .

            We also made doublewrite buffer asynchronous, because it increases concurrency. There are 2 slots in doublewrite, and while one slot is being filled, another one is being written to disk. That is, the stalls occur only when the one slot is full, and needs to be written to disk, and the one that is being written, is still busy (can't be used)

            wlad Vladislav Vaintroub added a comment - We also made doublewrite buffer asynchronous, because it increases concurrency. There are 2 slots in doublewrite, and while one slot is being filled, another one is being written to disk. That is, the stalls occur only when the one slot is full, and needs to be written to disk, and the one that is being written, is still busy (can't be used)

            Before MDEV-23855 there were two 64-page slots in the doublewrite buffer, which were written by a synchronous write in alternating fashion. With MDEV-23855, we have two 128-page memory buffers, and typically (if the two 64-page slots are adjacent in the system tablespace file, which we cannot guarantee if someone upgraded a very old data directory) we will perform an asynchronous write of those 128 pages (or 2 writes of 64+64 pages) while the 128-page memory buffer can be filled with a new batch.

            A further optimization was made in MDEV-25948: we may initiate the doublewrite batch already before the log has been written up to that LSN. Only after the doublewrite batch completes, before writing each individual page to the final location, we will ensure that the log has been durably written at least up to the FIL_PAGE_LSN.

            marko Marko Mäkelä added a comment - Before MDEV-23855 there were two 64-page slots in the doublewrite buffer, which were written by a synchronous write in alternating fashion. With MDEV-23855 , we have two 128-page memory buffers, and typically (if the two 64-page slots are adjacent in the system tablespace file, which we cannot guarantee if someone upgraded a very old data directory) we will perform an asynchronous write of those 128 pages (or 2 writes of 64+64 pages) while the 128-page memory buffer can be filled with a new batch. A further optimization was made in MDEV-25948 : we may initiate the doublewrite batch already before the log has been written up to that LSN. Only after the doublewrite batch completes, before writing each individual page to the final location, we will ensure that the log has been durably written at least up to the FIL_PAGE_LSN .

            People

              marko Marko Mäkelä
              wlad Vladislav Vaintroub
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.