[MDEV-29610] uring tmpfs InnoDB: IO Error: 125 during async write of X bytes, for file Y, returned 0 - Jira

Daniel Black added a comment - 2022-09-23 05:10 - edited

new bb kernel version on aarch64-debian-11 builder: 4.18.0-394.el8.aarch64 (that has no failures of the type of this MDEV. No Debug builds are being run which limits, but not eliminates the range of encryption.innodb test

kvm-deb-bullseye-aarch64 5.10.0-17-arm64 (5.10.136)

debian has 5.10.0-18 / 5.10.140-1
fs: Add missing umask strip in vfs_tmpfile
blk-mq: fix io hung due to missing commit_rqs

Can't tell if relent.

bb.org: e.g: https://ci.mariadb.org/28579/logs/aarch64-debian-11/mysqld.1.err.6

2022-09-22 22:26:45 0 [Warning] mariadbd: io_uring_queue_init() failed with ENOMEM: try larger memory locked limit, ulimit -l, or https://mariadb.com/kb/en/systemd/#configuring-limitmemlock under systemd

2022-09-22 22:26:45 0 [Warning] InnoDB: liburing disabled: falling back to innodb_use_native_aio=OFF

Daniel Black added a comment - 2022-09-23 05:10 - edited new bb kernel version on aarch64-debian-11 builder: 4.18.0-394.el8.aarch64 (that has no failures of the type of this MDEV. No Debug builds are being run which limits, but not eliminates the range of encryption.innodb test kvm-deb-bullseye-aarch64 5.10.0-17-arm64 (5.10.136) debian has 5.10.0-18 / 5.10.140-1 fs: Add missing umask strip in vfs_tmpfile blk-mq: fix io hung due to missing commit_rqs Can't tell if relent. bb.org: e.g: https://ci.mariadb.org/28579/logs/aarch64-debian-11/mysqld.1.err.6 2022-09-22 22:26:45 0 [Warning] mariadbd: io_uring_queue_init() failed with ENOMEM: try larger memory locked limit, ulimit -l, or https://mariadb.com/kb/en/systemd/#configuring-limitmemlock under systemd 2022-09-22 22:26:45 0 [Warning] InnoDB: liburing disabled: falling back to innodb_use_native_aio=OFF

Daniel Black added a comment - 2022-09-23 07:05

amd64-sid failure was a Debian kernel (seems bullseye too) 5.10.0-13-amd64, the latest crashing fault, (~1 month before failure (2 weeks ago)) on bz-bbw2 .

Daniel Black added a comment - 2022-09-23 07:05 amd64-sid failure was a Debian kernel (seems bullseye too) 5.10.0-13-amd64, the latest crashing fault, (~1 month before failure (2 weeks ago)) on bz-bbw2 .

Daniel Black added a comment - 2022-09-23 21:59

With memlock limits raised on bb.org, more bb workers like ppc64le-ubuntu-2204, ppc64le-rhel-9 are showing same test assertions, common aspect (though very early). All have Debian 5.10 kernels

Daniel Black added a comment - 2022-09-23 21:59 With memlock limits raised on bb.org, more bb workers like ppc64le-ubuntu-2204 , ppc64le-rhel-9 are showing same test assertions, common aspect (though very early). All have Debian 5.10 kernels

Daniel Black added a comment - 2022-09-23 22:04

Newer kernel version do exist in bb - like hz-bbw5 (5.18.0-0.deb11 / Debian 5.18.16-1~bpo11+1 (2022-08-12))

Daniel Black added a comment - 2022-09-23 22:04 Newer kernel version do exist in bb - like hz-bbw5 (5.18.0-0.deb11 / Debian 5.18.16-1~bpo11+1 (2022-08-12))

Daniel Black added a comment - 2022-09-23 22:23

aarch64-fedora-36 also debian kernel 5.10

Daniel Black added a comment - 2022-09-23 22:23 aarch64-fedora-36 also debian kernel 5.10

Elena Stepanova added a comment - 2022-10-01 21:56 - edited

Got it on my local Debian 11 (amd64), coincidentally(?) next day after upgrade from 5.10.0-14 to 5.10.0-18. One-time occurrence so far, on MariaDB 10.11, uring, tmpfs.

Elena Stepanova added a comment - 2022-10-01 21:56 - edited Got it on my local Debian 11 (amd64), coincidentally(?) next day after upgrade from 5.10.0-14 to 5.10.0-18. One-time occurrence so far, on MariaDB 10.11, uring, tmpfs.

Marko Mäkelä added a comment - 2022-12-22 13:15

I am not sure if this is related, but I though to mention it. mleich asked me to check a core dump of a server that had been hung for 15 minutes, doing crash recovery. The culprit seemed to be that 2 asynchronous page reads via io_uring never completed. I suppose that this was run on /dev/shm, like the majority of his tests.

The kernel is much older, though:

marko@pluto:~$ cat /proc/version

Linux version 5.4.0-94-generic (buildd@lcy02-amd64-053) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022

If I remember correctly, this is with a manually built liburing, because Ubuntu 20.04 does not include a package of it.

Here is the interesting part of the stack traces:

#7  0x0000556a3937b1ad in safe_cond_wait (cond=cond@entry=0x556a3a913890 <recv_sys+176>, mp=mp@entry=0x556a3a9137e0 <recv_sys>, file=file@entry=0x556a39c81920 "/data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc", line=line@entry=3479) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/mysys/thr_mutex.c:494

#8  0x0000556a38a7a731 in recv_sys_t::apply (this=0x556a3a9137e0 <recv_sys>, last_batch=last_batch@entry=true) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc:3479

#9  0x0000556a38d9f021 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/srv/srv0start.cc:1565

#10 0x0000556a388c9025 in innodb_init (p=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/handler/ha_innodb.cc:4157

This is waiting for recv_sys.cond and for recv_sys.pages to become empty. There were two pending page read requests:

(gdb) p recv_sys.pages

$8 = std::map with 2 elements = {[{m_id = 167503724544}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd98d2558, tail = 0x7f2bd994d868}}, [{m_id = 167503724547}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd9984220, tail = 0x7f2bd9a75600}}}

These reads were for tablespace 39 (./test/t2.ibd), pages 0 and 3. One thread was waiting in io_uring_wait_cqe(), and the buf_flush_page_cleaner() thread was idle, as expected.

Marko Mäkelä added a comment - 2022-12-22 13:15 I am not sure if this is related, but I though to mention it. mleich asked me to check a core dump of a server that had been hung for 15 minutes, doing crash recovery. The culprit seemed to be that 2 asynchronous page reads via io_uring never completed. I suppose that this was run on /dev/shm , like the majority of his tests. The kernel is much older, though: marko@pluto:~$ cat /proc/version Linux version 5.4.0-94-generic (buildd@lcy02-amd64-053) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022 If I remember correctly, this is with a manually built liburing , because Ubuntu 20.04 does not include a package of it. Here is the interesting part of the stack traces: #7 0x0000556a3937b1ad in safe_cond_wait (cond=cond@entry=0x556a3a913890 <recv_sys+176>, mp=mp@entry=0x556a3a9137e0 <recv_sys>, file=file@entry=0x556a39c81920 "/data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc", line=line@entry=3479) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/mysys/thr_mutex.c:494 #8 0x0000556a38a7a731 in recv_sys_t::apply (this=0x556a3a9137e0 <recv_sys>, last_batch=last_batch@entry=true) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc:3479 #9 0x0000556a38d9f021 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/srv/srv0start.cc:1565 #10 0x0000556a388c9025 in innodb_init (p=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/handler/ha_innodb.cc:4157 This is waiting for recv_sys.cond and for recv_sys.pages to become empty. There were two pending page read requests: (gdb) p recv_sys.pages $8 = std::map with 2 elements = {[{m_id = 167503724544}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd98d2558, tail = 0x7f2bd994d868}}, [{m_id = 167503724547}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd9984220, tail = 0x7f2bd9a75600}}} These reads were for tablespace 39 ( ./test/t2.ibd ), pages 0 and 3. One thread was waiting in io_uring_wait_cqe() , and the buf_flush_page_cleaner() thread was idle, as expected.

Daniel Black added a comment - 2023-03-09 22:01

Debian kernel bug.

Daniel Black added a comment - 2023-03-09 22:01 Debian kernel bug.

Faustin Lammler added a comment - 2023-03-15 10:20

danblack we have upgraded our 2 main ARM64 builder to the latest stable kernel 5.10.0-20-arm64 > 5.10.0-21-arm64.

Also as discussed with marko, we could probably also use backported kernels on that builders, branch 6.0 or 6.1.

What do you think?

Faustin Lammler added a comment - 2023-03-15 10:20 danblack we have upgraded our 2 main ARM64 builder to the latest stable kernel 5.10.0-20-arm64 > 5.10.0-21-arm64. Also as discussed with marko , we could probably also use backported kernels on that builders, branch 6.0 or 6.1. What do you think?

Daniel Black added a comment - 2023-03-15 15:16

Good option, may as well execute it.

Daniel Black added a comment - 2023-03-15 15:16 Good option, may as well execute it.

Marko Mäkelä added a comment - 2023-03-21 12:17

I just reproduced this on my system (AMD64, not ARMv8):

Linux version 6.1.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.15-1 (2023-03-05)

I was running 48 concurrent instances of the test innodb.undo_truncate,4k on a 10.6 RelWithDebInfo build.

innodb.undo_truncate '2,4k,innodb'       w9 [ 38 fail ]  Found warnings/errors in server log file!

        Test ended at 2023-03-21 14:02:02

line

2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0

2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0

Attempting backtrace. You can use the following information to find out

^ Found warnings in /dev/shm/10.6g/mysql-test/var/9/log/mysqld.1.err

Actually that fatal error was output before the server had been restarted, but mtr ignored those crashes and did not preserve any core dumps for them. The stack trace is due to a different error that was not listed in the summary above:

2023-03-21 14:02:01 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 6815744 outside the bounds of the file: .//undo001

10.6 535a38c4559160c7a46f0d9ece52495b83718d45
#7 0x000055f590502435 in fil_invalid_page_access_msg (is_read=<optimized out>, len=94513198740384, offset=6815744, name=<optimized out>) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2661
#8 fil_space_t::io (this=this@entry=0x55f591a333a0, type=@0x7ffc60d53080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=<optimized out>, offset@entry=6815744,
len=len@entry=4096, buf=<optimized out>, bpage=bpage@entry=0x0) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2737
#9 0x000055f5904f50c4 in fil_space_t::flush_freed (this=this@entry=0x55f591a333a0, writable=writable@entry=true) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1036
#10 0x000055f5904f635b in buf_flush_space (id=1) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1187
#11 buf_do_flush_list_batch (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1449
#12 buf_flush_list_holding_mutex (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1547
#13 buf_flush_list_holding_mutex (max_n=<optimized out>, lsn=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1528
#14 0x000055f5904f867a in buf_flush_list (max_n=2000, lsn=lsn@entry=18446744073709551615) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1573
#15 0x000055f5904f890f in buf_flush_buffer_pool () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2465
#16 0x000055f5905aed2e in logs_empty_and_mark_files_at_shutdown () at /mariadb/10.6/storage/innobase/log/log0log.cc:1177

Marko Mäkelä added a comment - 2023-03-21 12:17 I just reproduced this on my system (AMD64, not ARMv8): Linux version 6.1.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.15-1 (2023-03-05) I was running 48 concurrent instances of the test innodb.undo_truncate,4k on a 10.6 RelWithDebInfo build. innodb.undo_truncate '2,4k,innodb' w9 [ 38 fail ] Found warnings/errors in server log file! Test ended at 2023-03-21 14:02:02 line 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 Attempting backtrace. You can use the following information to find out ^ Found warnings in /dev/shm/10.6g/mysql-test/var/9/log/mysqld.1.err Actually that fatal error was output before the server had been restarted, but mtr ignored those crashes and did not preserve any core dumps for them. The stack trace is due to a different error that was not listed in the summary above: 2023-03-21 14:02:01 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 6815744 outside the bounds of the file: .//undo001 10.6 535a38c4559160c7a46f0d9ece52495b83718d45 #7 0x000055f590502435 in fil_invalid_page_access_msg (is_read=<optimized out>, len=94513198740384, offset=6815744, name=<optimized out>) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2661 #8 fil_space_t::io (this=this@entry=0x55f591a333a0, type=@0x7ffc60d53080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=<optimized out>, offset@entry=6815744, len=len@entry=4096, buf=<optimized out>, bpage=bpage@entry=0x0) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2737 #9 0x000055f5904f50c4 in fil_space_t::flush_freed (this=this@entry=0x55f591a333a0, writable=writable@entry=true) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1036 #10 0x000055f5904f635b in buf_flush_space (id=1) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1187 #11 buf_do_flush_list_batch (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1449 #12 buf_flush_list_holding_mutex (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1547 #13 buf_flush_list_holding_mutex (max_n=<optimized out>, lsn=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1528 #14 0x000055f5904f867a in buf_flush_list (max_n=2000, lsn=lsn@entry=18446744073709551615) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1573 #15 0x000055f5904f890f in buf_flush_buffer_pool () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2465 #16 0x000055f5905aed2e in logs_empty_and_mark_files_at_shutdown () at /mariadb/10.6/storage/innobase/log/log0log.cc:1177

Marko Mäkelä added a comment - 2023-03-21 12:23

Update: it looks like the fatal error 125 may have been output from another thread while some debug information was being produced for the first fatal error. There was no actual stack trace output for either thread:

2023-03-21 14:01:59 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown

2023-03-21 14:01:59 0 [Note] InnoDB: FTS optimize thread exiting.

2023-03-21 14:01:59 0 [Note] InnoDB: Starting shutdown...

2023-03-21 14:01:59 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6g/mysql-test/var/9/mysqld.1/data/ib_buffer_pool

2023-03-21 14:01:59 0 [Note] InnoDB: Restricted to 1478 pages due to innodb_buf_pool_dump_pct=25

2023-03-21 14:01:59 0 [Note] InnoDB: Buffer pool(s) dump completed at 230321 14:01:59

2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0

230321 14:01:59 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.6.13-MariaDB-log source revision: 142587071b22254a87b1378312b1cebaa8f3f11e

key_buffer_size=1048576

read_buffer_size=131072

2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0

max_used_connections=3

max_threads=153

thread_count=0

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63667 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

The source code revision is not being reported as 535a38c4559160c7a46f0d9ece52495b83718d45 because there were some unrelated code changes applied.

Marko Mäkelä added a comment - 2023-03-21 12:23 Update: it looks like the fatal error 125 may have been output from another thread while some debug information was being produced for the first fatal error. There was no actual stack trace output for either thread: 2023-03-21 14:01:59 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2023-03-21 14:01:59 0 [Note] InnoDB: FTS optimize thread exiting. 2023-03-21 14:01:59 0 [Note] InnoDB: Starting shutdown... 2023-03-21 14:01:59 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6g/mysql-test/var/9/mysqld.1/data/ib_buffer_pool 2023-03-21 14:01:59 0 [Note] InnoDB: Restricted to 1478 pages due to innodb_buf_pool_dump_pct=25 2023-03-21 14:01:59 0 [Note] InnoDB: Buffer pool(s) dump completed at 230321 14:01:59 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 230321 14:01:59 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.6.13-MariaDB-log source revision: 142587071b22254a87b1378312b1cebaa8f3f11e key_buffer_size=1048576 read_buffer_size=131072 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 max_used_connections=3 max_threads=153 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63667 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... The source code revision is not being reported as 535a38c4559160c7a46f0d9ece52495b83718d45 because there were some unrelated code changes applied.

Marko Mäkelä added a comment - 2023-03-21 13:50

I do not know if it is related to this, but I also observed 2 shutdown hangs in a RelWithDebInfo build (on write_slots) when running several innodb.undo_truncate,4k in parallel:

10.6 535a38c4559160c7a46f0d9ece52495b83718d45
#9 io_slots::wait (this=0x5621f0b76a50) at /mariadb/10.6/storage/innobase/os/os0file.cc:117
#10 io_slots::~io_slots (this=0x5621f0b76a50, __in_chrg=<optimized out>) at /mariadb/10.6/storage/innobase/os/os0file.cc:132
#11 os_aio_free () at /mariadb/10.6/storage/innobase/os/os0file.cc:3660
#12 0x00005621ee609e2d in innodb_shutdown () at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1978

In one of the hangs that I checked, the buf_dblwr indicated no pending page writes. Log writes never are asynchronous (until ~~MDEV-26603~~ is done).

Marko Mäkelä added a comment - 2023-03-21 13:50 I do not know if it is related to this, but I also observed 2 shutdown hangs in a RelWithDebInfo build (on write_slots ) when running several innodb.undo_truncate,4k in parallel: 10.6 535a38c4559160c7a46f0d9ece52495b83718d45 #9 io_slots::wait (this=0x5621f0b76a50) at /mariadb/10.6/storage/innobase/os/os0file.cc:117 #10 io_slots::~io_slots (this=0x5621f0b76a50, __in_chrg=<optimized out>) at /mariadb/10.6/storage/innobase/os/os0file.cc:132 #11 os_aio_free () at /mariadb/10.6/storage/innobase/os/os0file.cc:3660 #12 0x00005621ee609e2d in innodb_shutdown () at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1978 In one of the hangs that I checked, the buf_dblwr indicated no pending page writes. Log writes never are asynchronous (until MDEV-26603 is done).

Daniel Black added a comment - 2023-03-23 00:56

Got email from Jens Axobe

>> I did see that one, and I think that's either specific to the old
>> 5.10 stable kernel (eg -21 won't have it), because it's cancelation
>> off thread exit.
>
> possible
>
>> Either that, or you're doing fork/exit with IO inflight.
>
> exit with IO inlight might be possible. fork much less likely.

It's exit, from the trace that I saw. Easily reproducible with just
adding a delay to the worker handling.

> I came in as https://jira.mariadb.org/browse/MDEV-29610 which I note
> Marko claims 6.1.15-1 and by the build date it might be the same era
(JIRA edit - my mistake - its a much later build)
> as -20, so since 6.1.16+ is out I assume its the same fix.
>
> thanks for the pointer

Yeah, if it's exit with IO pending, it'll be in any kernel. You should
not do that...

Daniel Black added a comment - 2023-03-23 00:56 Got email from Jens Axobe >> I did see that one, and I think that's either specific to the old >> 5.10 stable kernel (eg -21 won't have it), because it's cancelation >> off thread exit. > > possible > >> Either that, or you're doing fork/exit with IO inflight. > > exit with IO inlight might be possible. fork much less likely. It's exit, from the trace that I saw. Easily reproducible with just adding a delay to the worker handling. > I came in as https://jira.mariadb.org/browse/MDEV-29610 which I note > Marko claims 6.1.15-1 and by the build date it might be the same era (JIRA edit - my mistake - its a much later build) > as -20, so since 6.1.16+ is out I assume its the same fix. > > thanks for the pointer Yeah, if it's exit with IO pending, it'll be in any kernel. You should not do that...

Marko Mäkelä added a comment - 2023-05-19 11:13

danblack, any news on this? I see that 125 is ECANCELED in /usr/include/asm-generic/errno.h.

mleich got this on 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux on an ext4 file system that was loopback mounted in a file that resided in /dev/shm. (Such indirection makes sense when using the older libaio; it does not support files residing directly on /dev/shm.)

Marko Mäkelä added a comment - 2023-05-19 11:13 danblack , any news on this? I see that 125 is ECANCELED in /usr/include/asm-generic/errno.h . mleich got this on 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux on an ext4 file system that was loopback mounted in a file that resided in /dev/shm . (Such indirection makes sense when using the older libaio ; it does not support files residing directly on /dev/shm .)

Marko Mäkelä added a comment - 2023-05-19 11:16

Yes, "doing fork/exit with IO inflight" seems to apply here:

# 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] /data/Server_bin/bb-10.6-MDEV-31234_RelWithDebInfo/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown

# 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] InnoDB: FTS optimize thread exiting.

# 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async lru write of 4096 bytes, for file 9, returned 0

danblack, do you have a patch that makes it easier to hit this bug?

Marko Mäkelä added a comment - 2023-05-19 11:16 Yes, "doing fork/exit with IO inflight" seems to apply here: # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] /data/Server_bin/bb-10.6-MDEV-31234_RelWithDebInfo/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] InnoDB: FTS optimize thread exiting. # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async lru write of 4096 bytes, for file 9, returned 0 danblack , do you have a patch that makes it easier to hit this bug?

Marko Mäkelä added a comment - 2023-08-01 12:29

In ~~MDEV-27593~~ I changed the wording of the message and made it a non-fatal error. Actually, I accidentally made it Note instead of ERROR, so this should not cause any mtr test failures anymore:

diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc

index 906ce79e8fd..99f212b6eeb 100644

--- a/storage/innobase/os/os0file.cc

+++ b/storage/innobase/os/os0file.cc

@@ -3459,8 +3459,11 @@ static void write_io_callback(void *c)

   const IORequest &request= *static_cast<const IORequest*>

     (static_cast<const void*>(cb->m_userdata));

+  if (!ut_rnd_interval(100))

+    cb->m_err= 42;

   if (UNIV_UNLIKELY(cb->m_err != 0))

-    ib::info () << "IO Error: " << cb->m_err

+    ib::error() << "IO Error: " << cb->m_err

                 << "during write of "

                 << cb->m_len << " bytes, for file "

                 << request.node->name << "(" << cb->m_fh << "), returned "

10.6 72928e640e62559f4d4b82e9ff1102bb72bd5828 with patch
var/9/log/mysqld.1.err:2023-08-01 15:27:09 0 [Note] InnoDB: IO Error: 42during write of 16384 bytes, for file ./ibdata1(11), returned 16384

Marko Mäkelä added a comment - 2023-08-01 12:29 In MDEV-27593 I changed the wording of the message and made it a non-fatal error. Actually, I accidentally made it Note instead of ERROR , so this should not cause any mtr test failures anymore: diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc index 906ce79e8fd..99f212b6eeb 100644 --- a/storage/innobase/os/os0file.cc +++ b/storage/innobase/os/os0file.cc @@ -3459,8 +3459,11 @@ static void write_io_callback(void *c) const IORequest &request= *static_cast<const IORequest*> (static_cast<const void*>(cb->m_userdata)); + if (!ut_rnd_interval(100)) + cb->m_err= 42; + if (UNIV_UNLIKELY(cb->m_err != 0)) - ib::info () << "IO Error: " << cb->m_err + ib::error() << "IO Error: " << cb->m_err << "during write of " << cb->m_len << " bytes, for file " << request.node->name << "(" << cb->m_fh << "), returned " 10.6 72928e640e62559f4d4b82e9ff1102bb72bd5828 with patch var/9/log/mysqld.1.err:2023-08-01 15:27:09 0 [Note] InnoDB: IO Error: 42during write of 16384 bytes, for file ./ibdata1(11), returned 16384

Marko Mäkelä added a comment - 2023-10-27 05:09

I got some logs of failures that occurred after ~~MDEV-27593~~ was pushed into 10.6 (2003-08-01). I found only one failure that includes that fix:

bb-11.0-ycp-mdev-31996 d07012f4d84e57f1895f848b029fabd220b5d5be ppc64le-ubuntu-2204
encryption.innodb-checksum-algorithm '32k,ctr,innodb,strict_crc32' w8 [ fail ]
…
2023-10-06 3:37:06 4 [Note] InnoDB: Importing tablespace for table 'test/tp' that was exported from host 'c8473bd36d6b'
2023-10-06 3:37:06 4 [Note] InnoDB: Phase I - Update all pages
2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk
2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk - done!
2023-10-06 3:37:06 4 [Note] InnoDB: Phase III - Flush changes to disk
2023-10-06 3:37:06 4 [Note] InnoDB: Phase IV - Flush complete
2023-10-06 3:37:06 4 [Note] InnoDB: `test`.`tp` autoinc value set to 2
2023-10-06 3:37:07 0 [Note] InnoDB: IO Error: 125during write of 491520 bytes, for file ./ibdata1(11), returned 0
----------SERVER LOG END-------------
mysqltest failed but provided no output

The server apparently hung here. The reason could be unrelated to the io_uring error, such as ~~MDEV-32588~~.

Marko Mäkelä added a comment - 2023-10-27 05:09 I got some logs of failures that occurred after MDEV-27593 was pushed into 10.6 (2003-08-01). I found only one failure that includes that fix: bb-11.0-ycp-mdev-31996 d07012f4d84e57f1895f848b029fabd220b5d5be ppc64le-ubuntu-2204 encryption.innodb-checksum-algorithm '32k,ctr,innodb,strict_crc32' w8 [ fail ] … 2023-10-06 3:37:06 4 [Note] InnoDB: Importing tablespace for table 'test/tp' that was exported from host 'c8473bd36d6b' 2023-10-06 3:37:06 4 [Note] InnoDB: Phase I - Update all pages 2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk 2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk - done! 2023-10-06 3:37:06 4 [Note] InnoDB: Phase III - Flush changes to disk 2023-10-06 3:37:06 4 [Note] InnoDB: Phase IV - Flush complete 2023-10-06 3:37:06 4 [Note] InnoDB: `test`.`tp` autoinc value set to 2 2023-10-06 3:37:07 0 [Note] InnoDB: IO Error: 125during write of 491520 bytes, for file ./ibdata1(11), returned 0 ----------SERVER LOG END------------- mysqltest failed but provided no output The server apparently hung here. The reason could be unrelated to the io_uring error, such as MDEV-32588 .

Marko Mäkelä added a comment - 2023-12-04 06:48

A Linux kernel bug that was behind ~~MDEV-30728~~ is specific to the POWER ISA. Therefore, this issue (errors on io_uring write completion callback, observed on ARMv8 and AMD64) should be unaffected by that fix.

Marko Mäkelä added a comment - 2023-12-04 06:48 A Linux kernel bug that was behind MDEV-30728 is specific to the POWER ISA. Therefore, this issue (errors on io_uring write completion callback, observed on ARMv8 and AMD64) should be unaffected by that fix.

MariaDB Server

uring tmpfs InnoDB: IO Error: 125 during async write of X bytes, for file Y, returned 0

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration