[MDEV-28665] aio_uring::thread_routine terminates prematurely, causing InnoDB to hang Created: 2022-05-25  Updated: 2022-07-28  Resolved: 2022-05-25

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.6
Fix Version/s: 10.6.9, 10.7.5, 10.8.4, 10.9.2

Type: Bug Priority: Critical
Reporter: Jan Lindström (Inactive) Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
Duplicate
is duplicated by MDEV-28725 Sequence of Galera tests fails on Deb... Closed
is duplicated by MDEV-29141 InnoDB hangs some time after performi... Closed

 Description   


 Comments   
Comment by Marko Mäkelä [ 2022-05-25 ]

The server error log says:

10.6 57e66dc7e601fb32067b095b92facb9b0bc494de

2022-05-24 19:26:00 0 [Note] InnoDB: Using liburing
2022-05-24 19:27:09 2 [Warning] WSREP: Ignoring error 'Invalid performance_schema usage' on query. Default database: 'test'. Query: 'TRUNCATE TABLE performance_schema.threads', Error_code: 1683
CURRENT_TEST: galera.galera_truncate_temporary
CURRENT_TEST: galera.galera_var_OSU_method
CURRENT_TEST: galera.galera_var_OSU_method2
CURRENT_TEST: galera.galera_var_auto_inc_control_off
2022-05-24 19:37:30 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/

If the timeout is 10 minutes, the hang should have started somewhere between the two timestamped messages.

In the mtr log we can see that the holder of exclusive dict_sys.latch, Thread 16 (Thread 0x7f54c07dc700 (LWP 28624)), is waiting for a block to become available in the buffer pool:

10.6 57e66dc7e601fb32067b095b92facb9b0bc494de

#2  __pthread_cond_wait (cond=cond@entry=0x55f60ba58e88 <buf_pool+17352>, mutex=mutex@entry=0x55f60ba54ac0 <buf_pool>) at pthread_cond_wait.c:638
No locals.
#3  0x000055f60b133837 in buf_LRU_get_free_block (have_mutex=have_mutex@entry=false) at ./storage/innobase/buf/buf0lru.cc:447

All other InnoDB threads are idle. No thread can signal the buf_pool.done_free that is being waited for. But, it is also interesting that the aio_uring::thread_routine() is absent, that is, no thread is collecting I/O completion events by invoking io_uring_cqe_get_data().

It turns out that this thread can terminate without any notice, in two ways (the two break statements below):

  static void thread_routine(aio_uring *aio)
  {
    for (;;)
    {
      io_uring_cqe *cqe;
      if (int ret= io_uring_wait_cqe(&aio->uring_, &cqe))
      {
        if (ret == -EINTR) // this may occur during shutdown
          break;
        my_printf_error(ER_UNKNOWN_ERROR,
                        "io_uring_wait_cqe() returned %d\n",
                        ME_ERROR_LOG | ME_FATAL, ret);
        abort();
      }
 
      auto *iocb= static_cast<tpool::aiocb*>(io_uring_cqe_get_data(cqe));
      if (!iocb)
        break;

There is no mechanism to directly catch this; we would only notice that InnoDB fails to advance the log checkpoint, or fails to evict dirty pages from the buffer pool. Both events could cause hangs, eventually. Unfortunately, mtr often silently and forcibly kills the server, without giving a proper shutdown a chance to fail (hang).

I think that the thread must only be allowed to exit if the AIO subsystem has been closed. In any other case, we should report an error and retry. Especially the EINTR handling looks suspicious to me; any system call should be expected to return EINTR if a signal (even a non-fatal one) was sent to the process.

Comment by Marko Mäkelä [ 2022-05-25 ]

The following fix is not causing any trouble for me: no hangs during mtr, and also no memory leaks reported by AddressSanitizer (in case the thread would busy-loop) while another thread is terminating the process:

diff --git a/tpool/aio_liburing.cc b/tpool/aio_liburing.cc
index a4a899d08d4..1201d4da49d 100644
--- a/tpool/aio_liburing.cc
+++ b/tpool/aio_liburing.cc
@@ -137,8 +137,8 @@ class aio_uring final : public tpool::aio
       io_uring_cqe *cqe;
       if (int ret= io_uring_wait_cqe(&aio->uring_, &cqe))
       {
-        if (ret == -EINTR) // this may occur during shutdown
-          break;
+        if (ret == -EINTR)
+          continue;
         my_printf_error(ER_UNKNOWN_ERROR,
                         "io_uring_wait_cqe() returned %d\n",
                         ME_ERROR_LOG | ME_FATAL, ret);

This is simply following the common pattern that any system call could return EINTR in case any signal was delivered to the process. In aio_linux::getevent_thread_routine() we were already EINTR from io_getevents(2) in that way:

    for (;;)
    {
      switch (int ret= my_getevents(aio->m_io_ctx, 1, MAX_EVENTS, events)) {
      case -EINTR:
        continue;

Comment by Marko Mäkelä [ 2022-05-25 ]

wlad pointed out that the io_uring_cqe_get_data() would only return a null pointer if aio_uring::~aio_uring() sends one after setting it by io_uring_sqe_set_data(). So, that is the shutdown mechanism. Other submitted requests will always include a non-null pointer.

Comment by Marko Mäkelä [ 2022-06-02 ]

For the record, I had observed a hang when running

./mtr --manual-gdb stress.ddl_innodb

and inside GDB, sending SIGINT and then

signal SIGQUIT

It turns out that the thread that was executing aio_uring::thread_routine would be the very first thread to terminate in response to that signal. The server would hang, because nothing would invoke any page write completion callback routines after the thread terminated. Upon sending another SIGINT to GDB, we would find a hung buf_flush_buffer_pool().

Generated at Thu Feb 08 10:02:32 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.