[MDEV-25954] Remove superfluous os_aio_wait_until_no_pending_writes() from Innodb Created: 2021-06-17 Updated: 2021-06-28 Resolved: 2021-06-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.5, 10.6 |
| Fix Version/s: | 10.5.12, 10.6.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Vladislav Vaintroub | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||
| Description |
|
There are several os_aio_wait_until_no_pending_writes() in the codebase, not all of them needed. It should be only left where it is needed for correctness, since this call can add nothing but stalls. In the past, it could have been used to workaround certain bugs, like |
| Comments |
| Comment by Marko Mäkelä [ 2021-06-18 ] |
|
The following functions are calling os_aio_wait_until_no_pending_writes():
Based on my analysis above, it does seem that the entire function os_aio_wait_until_no_pending_writes() can be safely removed. One thing that must remain is the fdatasync() or fsync() or equivalent call on the data files, in case an asynchronous write is not durable by default. (I think it should be, for native asynchronous I/O, but writes via simulated aio are likely not durable, unless O_DIRECT or similar is being used.) The fil_flush_file_spaces() call before log checkpoint (as well as the single-tablespace equivalent in row_quiesce_table_start() a.k.a. FLUSH TABLES…FOR EXPORT}) seems to be necessary for correctness. Before the log checkpoint can be safely advanced (and the start of the log up to the checkpoint LSN discarded), we must wait for all page writes up to the checkpoint LSN to have been truly completed. Thinking aloud: It might be useful to move the log checkpoint to a separate asynchronous task, fired by buf_flush_page_cleaner() at the end of each batch. In that way, the wait for pending writes to complete would not cause delays to the page flushing itself. But, in my benchmarks on Linux ext4 using libaio or io_uring, fil_flush_file_spaces() never showed up in performance profiles. I suspect that fdatasync() is basically a no-op. So, I would not complicate things by trying to optimize this for operating systems for which we lack an asynchronous I/O interface. When it comes to calls to fil_flush_file_spaces():
|
| Comment by Marko Mäkelä [ 2021-06-18 ] |
|
It turns out that FLUSH TABLES…FOR EXPORT must wait for the writes to complete, because the request to clean the pages would merely ensure that asynchronous writes will have been initiated. Any other page flushing should be sped up by removing this unnecessary waiting. Please review (and also run benchmarks). |
| Comment by Yap Sok Ann [ 2021-06-22 ] |
|
I thought the waiting is still necessary in buf_dblwr_t::flush_buffered_writes(), so that it serves as write barrier in between the io_submit() system calls and the fsync() system calls? i.e. the flow described in https://stackoverflow.com/questions/66073136/what-will-be-if-i-use-libaio-fsync/66098696#66098696 If there are still some in-flight I/O from buf_flush_lists() and then fynsc() is called in the middle of it without waiting, potentially the oldest LSN retrieved by log_checkpoint() may not have reached the disk yet. |
| Comment by Yap Sok Ann [ 2021-06-23 ] |
|
On further thought, the waiting in buf_dblwr_t::flush_buffered_writes() does seem unnecessary. However, in log_checkpoint(), shouldn't we get the oldest LSN first, then only do the fsync()? As it is, in between the calls to fil_flush_file_spaces() and get_oldest_modification(), some dirty pages could have been freed without being fsync'ed to the disk yet? EDIT: It looks like the logic of log_checkpoint() has been like this since the very beginning (commit c533308a). I don't quite understand how it is safe. |
| Comment by Marko Mäkelä [ 2021-06-24 ] |
|
sayap, I tried to ask something similar in https://stackoverflow.com/questions/62216860/does-fsync-flushfilebuffers-wait-for-outstanding-asynchronous-ios-to-finish/62220924 but did not get an answer that would satisfy me. The question https://stackoverflow.com/questions/66073136/what-will-be-if-i-use-libaio-fsync/66098696#66098696 is more specific. In the initial public release of InnoDB code, there was no asynchronous I/O code for Linux. Only "simulated asynchronous I/O" was being used. We still use that outside Linux and Microsoft Windows. I can only offer some anecdotal evidence: When monitoring the performance of MariaDB 10.5 or 10.6 with perf record, I never noticed fdatasync() consuming any noticeable amount of time when using asynchronous I/O (io_submit() or io_uring()) in combination with O_DIRECT. If I remember correctly, that was the case even when I tested on a rotational SATA 3.0 hard disk. During the development of I would be very careful before extending the use of asynchronous I/O APIs. We had to dumb down our io_uring() interface in MariaDB 10.6, to avoid -ENOSYS errors on older kernels. We have no idea which kernels our code might run on. It might even be Windows or FreeBSD emulating Linux. Therefore, we probably should avoid changing this within any GA release series. All that said, code contributions are welcome. |
| Comment by Vladislav Vaintroub [ 2021-06-24 ] |
|
We also made doublewrite buffer asynchronous, because it increases concurrency. There are 2 slots in doublewrite, and while one slot is being filled, another one is being written to disk. That is, the stalls occur only when the one slot is full, and needs to be written to disk, and the one that is being written, is still busy (can't be used) |
| Comment by Marko Mäkelä [ 2021-06-24 ] |
|
Before A further optimization was made in |