[MDEV-25760] Assertion failure on io_uring_cqe_get_data() returning -EAGAIN Created: 2021-05-23 Updated: 2021-06-14 Resolved: 2021-06-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 10.6.0, 10.6.1 |
| Fix Version/s: | 10.6.2 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Alexander Krizhanovsky (Inactive) | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
This is almost 100% reproducable crash on MariaDB server 10.6 as of 7e1ec1550ceff29a983bf799622d97b73b79ce43 compiled with -DWITH_URING=yes. I run sysbench-tpcc (https://github.com/Percona-Lab/sysbench-tpcc) prepare on 40-core machine as
against the similar 40-core machine with the mariadb server. After several minutes of the workload the server crashes. Backtrace:
Following patch
produces line
in the error log. |
| Comments |
| Comment by Marko Mäkelä [ 2021-05-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Error code 11 ought to be EAGAIN. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
krizhanovsky, which Linux kernel version and build are you using? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-24 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I use kernel 5.4.105-1.el7.elrepo.x86_64 from the CentOS 7 packages. The distro is outdated, so I build liburing on my own from the git sources (master branch, the last commit 8bfdf56caaca4dc58e230f07d33f875b7d0648c8). I built MariaDB server from source code using branch 10.6 as of commit 7e1ec1550ceff29a983bf799622d97b73b79ce43 . Build options
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-05-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
krizhanovsky, does this only happen for reads, or also for page writes? Our only asynchronous reads are for read-ahead. I see that linear read-ahead is enabled by default. Please try if the following settings would prevent the crashes:
Page writes should always be asynchronous, ever since If disabling the read-ahead cures this, I would like to declare that this is the way how to work around the kernel bug. Maybe we should even consider disabling linear read-ahead on SSD, just like like in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-05-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've made some instrumentation and found that there are only write operations leading to the failure. Switching off read ahead doesn't affect reproducability of the issues. However, I overstated that the issue is 100% reproducable - sometimes I need to make several runs to hit the assertion. I see cb->m_len = 2097152 just before the assertion in io_callback(). Unfortunately, this is a challenge to make BPF work in the outdated environment to trace the kernel - I spend some time today to make bpftrace run, but without success. If the tracing will help in debugging the problem, then I'll try to run it in a chroot'ed distro environment, e.g. from debootstrap. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
2097152 is 128*innodb_page_size (16KiB by default). That should only be initiated by buf_dblwr_t::flush_buffered_writes() and collected by buf_dblwr_t::flush_buffered_writes_completed(). Before | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-06-05 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
A relatively easy workaround for Linux bugs could be to retry with synchronous IO instead once EAGAIN was collected.
krizhanovsky, maybe you can check if that patch helps? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-06-10 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tested the Vlad's patch and it works. This is neither a kernel nor liburing bug. I had a look into Linux 5.4.105 (the one I use in the tests) and the latest 5.4.125 as well as the current Linus tree - in all the kernels the corresponding code is the same. While liburing changed the code involved in the problem a little bit, the main logic remains the same. The problem is caused by the __io_uring_peek_cqe() called from io_uring_wait_eqe():
We firstly read the RB queue head, next we read the cqe->res and process an error if it's non-zero. The function in our case returns 0, so in this code we observer cqe->res = 0, but later, when we check it in io_uring_cqe_get_data() it's -EAGAIN. The kernel function io_cqring_fill_event() firstly moves the tail pointer and next writes the cqe->res:
It's easy to see that the place is racy: we can read old value of cqe->res in liburing and after that read the new value. Actually, this isn't a bug and liburing/examples/io_uring-cp.c implements following:
So I think the right patch should look like (the servers are still down after the recent outage, so I didn't test the patch yet):
This is very close to what Vlad proposed, but in io_uring way. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-06-11 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
krizhanovsky, thank you for your analysis. I have not looked at liburing or the kernel code, but I fully trust your expertise. Please file a pull request, and I will merge it as soon as possible. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-06-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pull request is https://github.com/MariaDB/server/pull/1860 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2021-06-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The patch handles the very rare case of the race in when -EAGAIN error code is written to cqe->res between checking the value in __io_uring_peek_cqe() and reading it again via io_uring_cqe_get_data(). In normal cases io_uring_wait_eqe() returns an error code aio_uring::thread_routine() aborts. If there are too many I/O requests and there is not space in the io_uring ring buffers, then the patch leads to the same error path as any other errors, i.e. aborting the server. If the underlying filesystem/block layers constantly return -EAGAIN and we constantly get into the race condition, then the code will be spinning until the race condition is not satisfied and io_uring_wait_eqe() returns with -EAGAIN. |