[MDEV-11378] AliSQL: [Perf] Issue#23 MERGE INNODB AIO REQUEST Created: 2016-11-29  Updated: 2023-12-22

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 11.5

Type: Task Priority: Major
Reporter: Sergey Vojtovich Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 3
Labels: linux, performance

Issue Links:
Relates
relates to MDEV-26547 Restoring InnoDB buffer pool dump is ... Closed
relates to MDEV-30986 Slow full index scan in 10.6 vs 10.5 ... Closed
relates to MDEV-32067 InnoDB linear read ahead had better b... Open
relates to MDEV-16526 Overhaul the InnoDB page flushing Closed
relates to MDEV-31095 Create separate tpool thread for asyn... Closed
Epic Link: AliSQL patches

 Description   

Description:
------------
The InnoDB engine support native AIO and simulated AIO on linux platform.
Native AIO use io_submit that glibc supplied to request IO.
But InnoDB engine requested AIO one by one through io_submit when trigger read-ahead,
so it is a little inefficiency.
 
Solution:
---------
We buffered the AIO requests, then io_submit all.
For example: when linear-ahead. we buffered next 64 pages io requests,
at last, io_submit all io requests.

https://github.com/alibaba/AliSQL/commit/4c9d1c72b9db5f7d2267906e0fa6d66948f5dc6c



 Comments   
Comment by Marko Mäkelä [ 2017-11-17 ]

I think that wlad should review this for Windows and work with someone to evaluate the possible performance benefit on Linux.

Comment by Marko Mäkelä [ 2019-04-30 ]

Sorry for letting this slip. The change does not apply cleanly to MariaDB 10.2 or later. I will port it, so that we can evaluate it.

Comment by Marko Mäkelä [ 2020-12-20 ]

There has not been any progress on this yet, other than that MDEV-23855 is employing asynchronous writes for the doublewrite buffer (usually, all 128 pages written in a single request). In buf_dblwr_t::flush_buffered_writes_completed() it should be possible to submit a single IOCB_CMD_PWRITEV operation instead of multiple single-page IOCB_CMD_PWRITE. It should be even simpler to combine the writes of multiple adjacent pages into a single write request.

It might make sense to always use buf_dblwr.add_to_batch() even when the doublewrite buffer is disabled, to have buf_flush_page() first fill a scatter-gather buffer, which would then be optimized (by combining adjacent writes) in buf_dblwr_t::flush_buffered_writes(). If the doublewrite buffer is disabled, the function would directly submit the final page write requests.

Comment by Marko Mäkelä [ 2021-02-26 ]

As far as I can tell, this is only combining background read-ahead requests. Outside read-ahead, all reads are synchronous, and only page writes are asynchronous.

I would like to see a benchmark that demonstrates the need for this.

In MDEV-24883 we are adding support for the liburing interface, a modern replacement of libaio. That library is supposed to reduce overhead. Maybe it will reduce the need for this? Combining requests would complicate our code and might introduce hard-to-reproduce bugs, because read-ahead may be hard to trigger in tests.

Also, MDEV-24854 changed innodb_flush_method=O_DIRECT to be the default. My understanding is that io_submit() may perform a significant amount of work unless the file is in O_DIRECT mode. I wonder if that was the root cause that this patch attempted to fix.

Comment by Marko Mäkelä [ 2021-09-15 ]

I wonder whether combining requests would make any sense at all with modern storage devices, which should have deep work queues and could combine requests at the low level by themselves. I do not know it, but I could believe that even on HDD a native command queue could implement the ‘elevator algorithm’ for optimizing the head movements.

One reason against combining read requests would seem to be that if we completed the reads of multiple pages at once, then we would be validating page checksums within only one execution thread. If we received read completion callbacks for each individual page, then multiple checksums could be calculated in parallel and we could utilize the I/O capacity better. It still was nowhere near the maximum capacity of a fast NVMe when I tested MDEV-26547.

Comment by Michael Widenius [ 2021-09-15 ]

Modern storage devices does bigger internal reads, but only 'around' the requested page, not forward from the current page.
For example, on SSD with 128K internal reads, if you read a page starting ad 64K, it will read data from 0-128K.
Newer ssd's based on persistent memory will only read exactly what you ask. However if you can read out what you are likely to use, that will be faster than many independent reads even on these kind of devices.
The ONLY way to know is to run benchmarks on a set of devices:
Modern hard disk, modern SSD and on persistent memory devices.
(Hard disks will be used on the cloud for the foreseeable future just because they are MUCH cheaper)

Another thing is that doing one kernel request instead of 64, is still much better!

Comment by Marko Mäkelä [ 2023-04-18 ]

I agree that it could make sense to merge read I/O requests at least when initializing the buffer pool according to the ib_buffer_pool file. Each read request could comprise multiple adjacent pages (say, 64 pages or 1 megabyte per request). Multi-threaded processing would still be possible.

Comment by Vladislav Vaintroub [ 2023-05-29 ]

I guess this all needs some experimentation to prove whether increased complexity here is justified. I'm not entirely convinced that submitting and processing 64x16K asynchronous IO requests in sorted order would be much slower than submitting 1x1MB request, and then processing 64x16k pages be it on NVME, SSD or harddisk .

Comment by Marko Mäkelä [ 2023-08-11 ]

Recent experience in MDEV-30986 suggests that a read-ahead of multiple adjacent pages in a single request could be well worth the added complexity.

When it comes to page writes in buf_flush_page_cleaner(), possibly we could check if buf_dblwr_t::flush_buffered_writes_completed() could submit a single scatter-gather write request when a write of up to 128 pages has completed. Similarly, when the doublewrite buffer is disabled or not needed (MDEV-19738), we might try to include multiple pages in a single write request.

Generated at Thu Feb 08 07:49:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.