[MDEV-28665] aio_uring::thread_routine terminates prematurely, causing InnoDB to hang Created: 2022-05-25 Updated: 2022-07-28 Resolved: 2022-05-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6 |
| Fix Version/s: | 10.6.9, 10.7.5, 10.8.4, 10.9.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jan Lindström (Inactive) | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
| Comments |
| Comment by Marko Mäkelä [ 2022-05-25 ] | |||||||||||||||||||||||||||||||
|
The server error log says:
If the timeout is 10 minutes, the hang should have started somewhere between the two timestamped messages. In the mtr log we can see that the holder of exclusive dict_sys.latch, Thread 16 (Thread 0x7f54c07dc700 (LWP 28624)), is waiting for a block to become available in the buffer pool:
All other InnoDB threads are idle. No thread can signal the buf_pool.done_free that is being waited for. But, it is also interesting that the aio_uring::thread_routine() is absent, that is, no thread is collecting I/O completion events by invoking io_uring_cqe_get_data(). It turns out that this thread can terminate without any notice, in two ways (the two break statements below):
There is no mechanism to directly catch this; we would only notice that InnoDB fails to advance the log checkpoint, or fails to evict dirty pages from the buffer pool. Both events could cause hangs, eventually. Unfortunately, mtr often silently and forcibly kills the server, without giving a proper shutdown a chance to fail (hang). I think that the thread must only be allowed to exit if the AIO subsystem has been closed. In any other case, we should report an error and retry. Especially the EINTR handling looks suspicious to me; any system call should be expected to return EINTR if a signal (even a non-fatal one) was sent to the process. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-05-25 ] | |||||||||||||||||||||||||||||||
|
The following fix is not causing any trouble for me: no hangs during mtr, and also no memory leaks reported by AddressSanitizer (in case the thread would busy-loop) while another thread is terminating the process:
This is simply following the common pattern that any system call could return EINTR in case any signal was delivered to the process. In aio_linux::getevent_thread_routine() we were already EINTR from io_getevents(2) in that way:
| |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-05-25 ] | |||||||||||||||||||||||||||||||
|
wlad pointed out that the io_uring_cqe_get_data() would only return a null pointer if aio_uring::~aio_uring() sends one after setting it by io_uring_sqe_set_data(). So, that is the shutdown mechanism. Other submitted requests will always include a non-null pointer. | |||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-02 ] | |||||||||||||||||||||||||||||||
|
For the record, I had observed a hang when running
and inside GDB, sending SIGINT and then
It turns out that the thread that was executing aio_uring::thread_routine would be the very first thread to terminate in response to that signal. The server would hang, because nothing would invoke any page write completion callback routines after the thread terminated. Upon sending another SIGINT to GDB, we would find a hung buf_flush_buffer_pool(). |