Details
-
Bug
-
Status: Stalled (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
-
None
Description
Almost always on kvm-deb-bullseye-aarch64 (old bb) worker. tmpfs based directory.
Usually encryption.innodb* tests though mariabackup.huge_lsn too.
Notable non-aarch64 exception:
- amd64-sid - bb-10.6-andrei-MDEV-24660 433e66da4 - encryption.innochecksum - final commit bc12478a9a2ac77937c18f54d6c6563d5c1980f9 differs with removing mysql_bin_log.stop_background_thread().
Attachments
Issue Links
- relates to
-
MDBF-358 buildbot: raise memlock limits so uring is actually tested in containers
-
- Closed
-
-
MDEV-24883 add io_uring support for tpool
-
- Closed
-
-
MDEV-27593 Crashing on I/O error is unhelpful
-
- Closed
-
- links to
Activity
amd64-sid failure was a Debian kernel (seems bullseye too) 5.10.0-13-amd64, the latest crashing fault, (~1 month before failure (2 weeks ago)) on bz-bbw2 .
With memlock limits raised on bb.org, more bb workers like ppc64le-ubuntu-2204, ppc64le-rhel-9 are showing same test assertions, common aspect (though very early). All have Debian 5.10 kernels
Newer kernel version do exist in bb - like hz-bbw5 (5.18.0-0.deb11 / Debian 5.18.16-1~bpo11+1 (2022-08-12))
Got it on my local Debian 11 (amd64), coincidentally(?) next day after upgrade from 5.10.0-14 to 5.10.0-18. One-time occurrence so far, on MariaDB 10.11, uring, tmpfs.
I am not sure if this is related, but I though to mention it. mleich asked me to check a core dump of a server that had been hung for 15 minutes, doing crash recovery. The culprit seemed to be that 2 asynchronous page reads via io_uring never completed. I suppose that this was run on /dev/shm, like the majority of his tests.
The kernel is much older, though:
marko@pluto:~$ cat /proc/version
|
Linux version 5.4.0-94-generic (buildd@lcy02-amd64-053) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022
|
If I remember correctly, this is with a manually built liburing, because Ubuntu 20.04 does not include a package of it.
Here is the interesting part of the stack traces:
#7 0x0000556a3937b1ad in safe_cond_wait (cond=cond@entry=0x556a3a913890 <recv_sys+176>, mp=mp@entry=0x556a3a9137e0 <recv_sys>, file=file@entry=0x556a39c81920 "/data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc", line=line@entry=3479) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/mysys/thr_mutex.c:494
|
#8 0x0000556a38a7a731 in recv_sys_t::apply (this=0x556a3a9137e0 <recv_sys>, last_batch=last_batch@entry=true) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc:3479
|
#9 0x0000556a38d9f021 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/srv/srv0start.cc:1565
|
#10 0x0000556a388c9025 in innodb_init (p=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/handler/ha_innodb.cc:4157
|
This is waiting for recv_sys.cond and for recv_sys.pages to become empty. There were two pending page read requests:
(gdb) p recv_sys.pages
|
$8 = std::map with 2 elements = {[{m_id = 167503724544}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd98d2558, tail = 0x7f2bd994d868}}, [{m_id = 167503724547}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd9984220, tail = 0x7f2bd9a75600}}}
|
These reads were for tablespace 39 (./test/t2.ibd), pages 0 and 3. One thread was waiting in io_uring_wait_cqe(), and the buf_flush_page_cleaner() thread was idle, as expected.
danblack we have upgraded our 2 main ARM64 builder to the latest stable kernel 5.10.0-20-arm64 > 5.10.0-21-arm64.
Also as discussed with marko, we could probably also use backported kernels on that builders, branch 6.0 or 6.1.
What do you think?
I just reproduced this on my system (AMD64, not ARMv8):
Linux version 6.1.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.15-1 (2023-03-05)
|
I was running 48 concurrent instances of the test innodb.undo_truncate,4k on a 10.6 RelWithDebInfo build.
innodb.undo_truncate '2,4k,innodb' w9 [ 38 fail ] Found warnings/errors in server log file!
|
Test ended at 2023-03-21 14:02:02
|
line
|
2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
|
2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
|
Attempting backtrace. You can use the following information to find out
|
^ Found warnings in /dev/shm/10.6g/mysql-test/var/9/log/mysqld.1.err
|
Actually that fatal error was output before the server had been restarted, but mtr ignored those crashes and did not preserve any core dumps for them. The stack trace is due to a different error that was not listed in the summary above:
2023-03-21 14:02:01 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 6815744 outside the bounds of the file: .//undo001
|
10.6 535a38c4559160c7a46f0d9ece52495b83718d45 |
#7 0x000055f590502435 in fil_invalid_page_access_msg (is_read=<optimized out>, len=94513198740384, offset=6815744, name=<optimized out>) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2661
|
#8 fil_space_t::io (this=this@entry=0x55f591a333a0, type=@0x7ffc60d53080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=<optimized out>, offset@entry=6815744,
|
len=len@entry=4096, buf=<optimized out>, bpage=bpage@entry=0x0) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2737
|
#9 0x000055f5904f50c4 in fil_space_t::flush_freed (this=this@entry=0x55f591a333a0, writable=writable@entry=true) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1036
|
#10 0x000055f5904f635b in buf_flush_space (id=1) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1187
|
#11 buf_do_flush_list_batch (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1449
|
#12 buf_flush_list_holding_mutex (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1547
|
#13 buf_flush_list_holding_mutex (max_n=<optimized out>, lsn=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1528
|
#14 0x000055f5904f867a in buf_flush_list (max_n=2000, lsn=lsn@entry=18446744073709551615) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1573
|
#15 0x000055f5904f890f in buf_flush_buffer_pool () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2465
|
#16 0x000055f5905aed2e in logs_empty_and_mark_files_at_shutdown () at /mariadb/10.6/storage/innobase/log/log0log.cc:1177
|
Update: it looks like the fatal error 125 may have been output from another thread while some debug information was being produced for the first fatal error. There was no actual stack trace output for either thread:
2023-03-21 14:01:59 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
|
2023-03-21 14:01:59 0 [Note] InnoDB: FTS optimize thread exiting.
|
2023-03-21 14:01:59 0 [Note] InnoDB: Starting shutdown...
|
2023-03-21 14:01:59 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6g/mysql-test/var/9/mysqld.1/data/ib_buffer_pool
|
2023-03-21 14:01:59 0 [Note] InnoDB: Restricted to 1478 pages due to innodb_buf_pool_dump_pct=25
|
2023-03-21 14:01:59 0 [Note] InnoDB: Buffer pool(s) dump completed at 230321 14:01:59
|
2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
|
230321 14:01:59 [ERROR] mysqld got signal 6 ;
|
This could be because you hit a bug. It is also possible that this binary
|
or one of the libraries it was linked against is corrupt, improperly built,
|
or misconfigured. This error can also be caused by malfunctioning hardware.
|
|
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
|
We will try our best to scrape up some info that will hopefully help
|
diagnose the problem, but since we have already crashed,
|
something is definitely wrong and this may fail.
|
|
Server version: 10.6.13-MariaDB-log source revision: 142587071b22254a87b1378312b1cebaa8f3f11e
|
key_buffer_size=1048576
|
read_buffer_size=131072
|
2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
|
max_used_connections=3
|
max_threads=153
|
thread_count=0
|
It is possible that mysqld could use up to
|
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63667 K bytes of memory
|
Hope that's ok; if not, decrease some variables in the equation.
|
|
Thread pointer: 0x0
|
Attempting backtrace. You can use the following information to find out
|
where mysqld died. If you see no messages after this, something went
|
terribly wrong...
|
The source code revision is not being reported as 535a38c4559160c7a46f0d9ece52495b83718d45 because there were some unrelated code changes applied.
I do not know if it is related to this, but I also observed 2 shutdown hangs in a RelWithDebInfo build (on write_slots) when running several innodb.undo_truncate,4k in parallel:
10.6 535a38c4559160c7a46f0d9ece52495b83718d45 |
#9 io_slots::wait (this=0x5621f0b76a50) at /mariadb/10.6/storage/innobase/os/os0file.cc:117
|
#10 io_slots::~io_slots (this=0x5621f0b76a50, __in_chrg=<optimized out>) at /mariadb/10.6/storage/innobase/os/os0file.cc:132
|
#11 os_aio_free () at /mariadb/10.6/storage/innobase/os/os0file.cc:3660
|
#12 0x00005621ee609e2d in innodb_shutdown () at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1978
|
In one of the hangs that I checked, the buf_dblwr indicated no pending page writes. Log writes never are asynchronous (until MDEV-26603 is done).
Got email from Jens Axobe
>> I did see that one, and I think that's either specific to the old
>> 5.10 stable kernel (eg -21 won't have it), because it's cancelation
>> off thread exit.
>
> possible
>
>> Either that, or you're doing fork/exit with IO inflight.
>
> exit with IO inlight might be possible. fork much less likely.It's exit, from the trace that I saw. Easily reproducible with just
adding a delay to the worker handling.> I came in as https://jira.mariadb.org/browse/MDEV-29610 which I note
> Marko claims 6.1.15-1 and by the build date it might be the same era
(JIRA edit - my mistake - its a much later build)
> as -20, so since 6.1.16+ is out I assume its the same fix.
>
> thanks for the pointerYeah, if it's exit with IO pending, it'll be in any kernel. You should
not do that...
danblack, any news on this? I see that 125 is ECANCELED in /usr/include/asm-generic/errno.h.
mleich got this on 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux on an ext4 file system that was loopback mounted in a file that resided in /dev/shm. (Such indirection makes sense when using the older libaio; it does not support files residing directly on /dev/shm.)
Yes, "doing fork/exit with IO inflight" seems to apply here:
# 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] /data/Server_bin/bb-10.6-MDEV-31234_RelWithDebInfo/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
|
# 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] InnoDB: FTS optimize thread exiting.
|
# 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async lru write of 4096 bytes, for file 9, returned 0
|
danblack, do you have a patch that makes it easier to hit this bug?
In MDEV-27593 I changed the wording of the message and made it a non-fatal error. Actually, I accidentally made it Note instead of ERROR, so this should not cause any mtr test failures anymore:
diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
|
index 906ce79e8fd..99f212b6eeb 100644
|
--- a/storage/innobase/os/os0file.cc
|
+++ b/storage/innobase/os/os0file.cc
|
@@ -3459,8 +3459,11 @@ static void write_io_callback(void *c)
|
const IORequest &request= *static_cast<const IORequest*>
|
(static_cast<const void*>(cb->m_userdata));
|
|
+ if (!ut_rnd_interval(100))
|
+ cb->m_err= 42;
|
+
|
if (UNIV_UNLIKELY(cb->m_err != 0))
|
- ib::info () << "IO Error: " << cb->m_err
|
+ ib::error() << "IO Error: " << cb->m_err
|
<< "during write of "
|
<< cb->m_len << " bytes, for file "
|
<< request.node->name << "(" << cb->m_fh << "), returned " |
10.6 72928e640e62559f4d4b82e9ff1102bb72bd5828 with patch |
var/9/log/mysqld.1.err:2023-08-01 15:27:09 0 [Note] InnoDB: IO Error: 42during write of 16384 bytes, for file ./ibdata1(11), returned 16384
|
I got some logs of failures that occurred after MDEV-27593 was pushed into 10.6 (2003-08-01). I found only one failure that includes that fix:
bb-11.0-ycp-mdev-31996 d07012f4d84e57f1895f848b029fabd220b5d5be ppc64le-ubuntu-2204 |
encryption.innodb-checksum-algorithm '32k,ctr,innodb,strict_crc32' w8 [ fail ]
|
…
|
2023-10-06 3:37:06 4 [Note] InnoDB: Importing tablespace for table 'test/tp' that was exported from host 'c8473bd36d6b'
|
2023-10-06 3:37:06 4 [Note] InnoDB: Phase I - Update all pages
|
2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk
|
2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk - done!
|
2023-10-06 3:37:06 4 [Note] InnoDB: Phase III - Flush changes to disk
|
2023-10-06 3:37:06 4 [Note] InnoDB: Phase IV - Flush complete
|
2023-10-06 3:37:06 4 [Note] InnoDB: `test`.`tp` autoinc value set to 2
|
2023-10-06 3:37:07 0 [Note] InnoDB: IO Error: 125during write of 491520 bytes, for file ./ibdata1(11), returned 0
|
----------SERVER LOG END-------------
|
mysqltest failed but provided no output
|
The server apparently hung here. The reason could be unrelated to the io_uring error, such as MDEV-32588.
A Linux kernel bug that was behind MDEV-30728 is specific to the POWER ISA. Therefore, this issue (errors on io_uring write completion callback, observed on ARMv8 and AMD64) should be unaffected by that fix.
new bb kernel version on aarch64-debian-11 builder: 4.18.0-394.el8.aarch64 (that has no failures of the type of this MDEV. No Debug builds are being run which limits, but not eliminates the range of encryption.innodb test
kvm-deb-bullseye-aarch64 5.10.0-17-arm64 (5.10.136)
debian has 5.10.0-18 / 5.10.140-1
fs: Add missing umask strip in vfs_tmpfile
blk-mq: fix io hung due to missing commit_rqs
Can't tell if relent.
bb.org: e.g: https://ci.mariadb.org/28579/logs/aarch64-debian-11/mysqld.1.err.6
2022-09-22 22:26:45 0 [Warning] mariadbd: io_uring_queue_init() failed with ENOMEM: try larger memory locked limit, ulimit -l, or https://mariadb.com/kb/en/systemd/#configuring-limitmemlock under systemd
2022-09-22 22:26:45 0 [Warning] InnoDB: liburing disabled: falling back to innodb_use_native_aio=OFF