Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29610

uring tmpfs InnoDB: IO Error: 125 during async write of X bytes, for file Y, returned 0

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
    • 10.6, 10.11
    • None

    Description

      Almost always on kvm-deb-bullseye-aarch64 (old bb) worker. tmpfs based directory.

      Usually encryption.innodb* tests though mariabackup.huge_lsn too.

      Notable non-aarch64 exception:

      Attachments

        Issue Links

          Activity

            danblack Daniel Black created issue -
            danblack Daniel Black added a comment - - edited

            new bb kernel version on aarch64-debian-11 builder: 4.18.0-394.el8.aarch64 (that has no failures of the type of this MDEV. No Debug builds are being run which limits, but not eliminates the range of encryption.innodb test

            kvm-deb-bullseye-aarch64 5.10.0-17-arm64 (5.10.136)

            debian has 5.10.0-18 / 5.10.140-1
            fs: Add missing umask strip in vfs_tmpfile
            blk-mq: fix io hung due to missing commit_rqs

            Can't tell if relent.

            bb.org: e.g: https://ci.mariadb.org/28579/logs/aarch64-debian-11/mysqld.1.err.6

            2022-09-22 22:26:45 0 [Warning] mariadbd: io_uring_queue_init() failed with ENOMEM: try larger memory locked limit, ulimit -l, or https://mariadb.com/kb/en/systemd/#configuring-limitmemlock under systemd
            2022-09-22 22:26:45 0 [Warning] InnoDB: liburing disabled: falling back to innodb_use_native_aio=OFF
            

            danblack Daniel Black added a comment - - edited new bb kernel version on aarch64-debian-11 builder: 4.18.0-394.el8.aarch64 (that has no failures of the type of this MDEV. No Debug builds are being run which limits, but not eliminates the range of encryption.innodb test kvm-deb-bullseye-aarch64 5.10.0-17-arm64 (5.10.136) debian has 5.10.0-18 / 5.10.140-1 fs: Add missing umask strip in vfs_tmpfile blk-mq: fix io hung due to missing commit_rqs Can't tell if relent. bb.org: e.g: https://ci.mariadb.org/28579/logs/aarch64-debian-11/mysqld.1.err.6 2022-09-22 22:26:45 0 [Warning] mariadbd: io_uring_queue_init() failed with ENOMEM: try larger memory locked limit, ulimit -l, or https://mariadb.com/kb/en/systemd/#configuring-limitmemlock under systemd 2022-09-22 22:26:45 0 [Warning] InnoDB: liburing disabled: falling back to innodb_use_native_aio=OFF
            danblack Daniel Black added a comment -

            amd64-sid failure was a Debian kernel (seems bullseye too) 5.10.0-13-amd64, the latest crashing fault, (~1 month before failure (2 weeks ago)) on bz-bbw2 .

            danblack Daniel Black added a comment - amd64-sid failure was a Debian kernel (seems bullseye too) 5.10.0-13-amd64, the latest crashing fault, (~1 month before failure (2 weeks ago)) on bz-bbw2 .
            danblack Daniel Black made changes -
            Field Original Value New Value
            danblack Daniel Black added a comment -

            With memlock limits raised on bb.org, more bb workers like ppc64le-ubuntu-2204, ppc64le-rhel-9 are showing same test assertions, common aspect (though very early). All have Debian 5.10 kernels

            danblack Daniel Black added a comment - With memlock limits raised on bb.org, more bb workers like ppc64le-ubuntu-2204 , ppc64le-rhel-9 are showing same test assertions, common aspect (though very early). All have Debian 5.10 kernels
            danblack Daniel Black added a comment -

            Newer kernel version do exist in bb - like hz-bbw5 (5.18.0-0.deb11 / Debian 5.18.16-1~bpo11+1 (2022-08-12))

            danblack Daniel Black added a comment - Newer kernel version do exist in bb - like hz-bbw5 (5.18.0-0.deb11 / Debian 5.18.16-1~bpo11+1 (2022-08-12))
            danblack Daniel Black added a comment - aarch64-fedora-36 also debian kernel 5.10
            danblack Daniel Black made changes -
            elenst Elena Stepanova added a comment - - edited

            Got it on my local Debian 11 (amd64), coincidentally(?) next day after upgrade from 5.10.0-14 to 5.10.0-18. One-time occurrence so far, on MariaDB 10.11, uring, tmpfs.

            elenst Elena Stepanova added a comment - - edited Got it on my local Debian 11 (amd64), coincidentally(?) next day after upgrade from 5.10.0-14 to 5.10.0-18. One-time occurrence so far, on MariaDB 10.11, uring, tmpfs.

            I am not sure if this is related, but I though to mention it. mleich asked me to check a core dump of a server that had been hung for 15 minutes, doing crash recovery. The culprit seemed to be that 2 asynchronous page reads via io_uring never completed. I suppose that this was run on /dev/shm, like the majority of his tests.

            The kernel is much older, though:

            marko@pluto:~$ cat /proc/version
            Linux version 5.4.0-94-generic (buildd@lcy02-amd64-053) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022
            

            If I remember correctly, this is with a manually built liburing, because Ubuntu 20.04 does not include a package of it.

            Here is the interesting part of the stack traces:

            #7  0x0000556a3937b1ad in safe_cond_wait (cond=cond@entry=0x556a3a913890 <recv_sys+176>, mp=mp@entry=0x556a3a9137e0 <recv_sys>, file=file@entry=0x556a39c81920 "/data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc", line=line@entry=3479) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/mysys/thr_mutex.c:494
            #8  0x0000556a38a7a731 in recv_sys_t::apply (this=0x556a3a9137e0 <recv_sys>, last_batch=last_batch@entry=true) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc:3479
            #9  0x0000556a38d9f021 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/srv/srv0start.cc:1565
            #10 0x0000556a388c9025 in innodb_init (p=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/handler/ha_innodb.cc:4157
            

            This is waiting for recv_sys.cond and for recv_sys.pages to become empty. There were two pending page read requests:

            (gdb) p recv_sys.pages
            $8 = std::map with 2 elements = {[{m_id = 167503724544}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd98d2558, tail = 0x7f2bd994d868}}, [{m_id = 167503724547}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd9984220, tail = 0x7f2bd9a75600}}}
            

            These reads were for tablespace 39 (./test/t2.ibd), pages 0 and 3. One thread was waiting in io_uring_wait_cqe(), and the buf_flush_page_cleaner() thread was idle, as expected.

            marko Marko Mäkelä added a comment - I am not sure if this is related, but I though to mention it. mleich asked me to check a core dump of a server that had been hung for 15 minutes, doing crash recovery. The culprit seemed to be that 2 asynchronous page reads via io_uring never completed. I suppose that this was run on /dev/shm , like the majority of his tests. The kernel is much older, though: marko@pluto:~$ cat /proc/version Linux version 5.4.0-94-generic (buildd@lcy02-amd64-053) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC 2022 If I remember correctly, this is with a manually built liburing , because Ubuntu 20.04 does not include a package of it. Here is the interesting part of the stack traces: #7 0x0000556a3937b1ad in safe_cond_wait (cond=cond@entry=0x556a3a913890 <recv_sys+176>, mp=mp@entry=0x556a3a9137e0 <recv_sys>, file=file@entry=0x556a39c81920 "/data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc", line=line@entry=3479) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/mysys/thr_mutex.c:494 #8 0x0000556a38a7a731 in recv_sys_t::apply (this=0x556a3a9137e0 <recv_sys>, last_batch=last_batch@entry=true) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/log/log0recv.cc:3479 #9 0x0000556a38d9f021 in srv_start (create_new_db=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/srv/srv0start.cc:1565 #10 0x0000556a388c9025 in innodb_init (p=<optimized out>) at /data/Server/bb-10.11-MDEV-30216-MDEV-26790/storage/innobase/handler/ha_innodb.cc:4157 This is waiting for recv_sys.cond and for recv_sys.pages to become empty. There were two pending page read requests: (gdb) p recv_sys.pages $8 = std::map with 2 elements = {[{m_id = 167503724544}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd98d2558, tail = 0x7f2bd994d868}}, [{m_id = 167503724547}] = {state = page_recv_t::RECV_BEING_READ, last_offset = 1, log = {head = 0x7f2bd9984220, tail = 0x7f2bd9a75600}}} These reads were for tablespace 39 ( ./test/t2.ibd ), pages 0 and 3. One thread was waiting in io_uring_wait_cqe() , and the buf_flush_page_cleaner() thread was idle, as expected.
            danblack Daniel Black added a comment -

            Debian kernel bug.

            danblack Daniel Black added a comment - Debian kernel bug.
            danblack Daniel Black made changes -
            Component/s N/A [ 14411 ]
            Component/s Storage Engine - InnoDB [ 10129 ]
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Not a Bug [ 6 ]
            Status Open [ 1 ] Closed [ 6 ]

            danblack we have upgraded our 2 main ARM64 builder to the latest stable kernel 5.10.0-20-arm64 > 5.10.0-21-arm64.

            Also as discussed with marko, we could probably also use backported kernels on that builders, branch 6.0 or 6.1.

            What do you think?

            faust Faustin Lammler added a comment - danblack we have upgraded our 2 main ARM64 builder to the latest stable kernel 5.10.0-20-arm64 > 5.10.0-21-arm64. Also as discussed with marko , we could probably also use backported kernels on that builders, branch 6.0 or 6.1. What do you think?
            danblack Daniel Black added a comment -

            Good option, may as well execute it.

            danblack Daniel Black added a comment - Good option, may as well execute it.

            I just reproduced this on my system (AMD64, not ARMv8):

            Linux version 6.1.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.15-1 (2023-03-05)
            

            I was running 48 concurrent instances of the test innodb.undo_truncate,4k on a 10.6 RelWithDebInfo build.

            innodb.undo_truncate '2,4k,innodb'       w9 [ 38 fail ]  Found warnings/errors in server log file!
                    Test ended at 2023-03-21 14:02:02
            line
            2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
            2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
            Attempting backtrace. You can use the following information to find out
            ^ Found warnings in /dev/shm/10.6g/mysql-test/var/9/log/mysqld.1.err
            

            Actually that fatal error was output before the server had been restarted, but mtr ignored those crashes and did not preserve any core dumps for them. The stack trace is due to a different error that was not listed in the summary above:

            2023-03-21 14:02:01 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 6815744 outside the bounds of the file: .//undo001
            

            10.6 535a38c4559160c7a46f0d9ece52495b83718d45

            #7  0x000055f590502435 in fil_invalid_page_access_msg (is_read=<optimized out>, len=94513198740384, offset=6815744, name=<optimized out>) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2661
            #8  fil_space_t::io (this=this@entry=0x55f591a333a0, type=@0x7ffc60d53080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=<optimized out>, offset@entry=6815744, 
                len=len@entry=4096, buf=<optimized out>, bpage=bpage@entry=0x0) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2737
            #9  0x000055f5904f50c4 in fil_space_t::flush_freed (this=this@entry=0x55f591a333a0, writable=writable@entry=true) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1036
            #10 0x000055f5904f635b in buf_flush_space (id=1) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1187
            #11 buf_do_flush_list_batch (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1449
            #12 buf_flush_list_holding_mutex (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1547
            #13 buf_flush_list_holding_mutex (max_n=<optimized out>, lsn=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1528
            #14 0x000055f5904f867a in buf_flush_list (max_n=2000, lsn=lsn@entry=18446744073709551615) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1573
            #15 0x000055f5904f890f in buf_flush_buffer_pool () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2465
            #16 0x000055f5905aed2e in logs_empty_and_mark_files_at_shutdown () at /mariadb/10.6/storage/innobase/log/log0log.cc:1177
            

            marko Marko Mäkelä added a comment - I just reproduced this on my system (AMD64, not ARMv8): Linux version 6.1.0-6-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.15-1 (2023-03-05) I was running 48 concurrent instances of the test innodb.undo_truncate,4k on a 10.6 RelWithDebInfo build. innodb.undo_truncate '2,4k,innodb' w9 [ 38 fail ] Found warnings/errors in server log file! Test ended at 2023-03-21 14:02:02 line 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 Attempting backtrace. You can use the following information to find out ^ Found warnings in /dev/shm/10.6g/mysql-test/var/9/log/mysqld.1.err Actually that fatal error was output before the server had been restarted, but mtr ignored those crashes and did not preserve any core dumps for them. The stack trace is due to a different error that was not listed in the summary above: 2023-03-21 14:02:01 0 [ERROR] [FATAL] InnoDB: Trying to write 4096 bytes at 6815744 outside the bounds of the file: .//undo001 10.6 535a38c4559160c7a46f0d9ece52495b83718d45 #7 0x000055f590502435 in fil_invalid_page_access_msg (is_read=<optimized out>, len=94513198740384, offset=6815744, name=<optimized out>) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2661 #8 fil_space_t::io (this=this@entry=0x55f591a333a0, type=@0x7ffc60d53080: {bpage = 0x0, slot = 0x0, node = 0x0, type = IORequest::WRITE_ASYNC}, offset=<optimized out>, offset@entry=6815744, len=len@entry=4096, buf=<optimized out>, bpage=bpage@entry=0x0) at /mariadb/10.6/storage/innobase/fil/fil0fil.cc:2737 #9 0x000055f5904f50c4 in fil_space_t::flush_freed (this=this@entry=0x55f591a333a0, writable=writable@entry=true) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1036 #10 0x000055f5904f635b in buf_flush_space (id=1) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1187 #11 buf_do_flush_list_batch (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1449 #12 buf_flush_list_holding_mutex (lsn=<optimized out>, max_n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1547 #13 buf_flush_list_holding_mutex (max_n=<optimized out>, lsn=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1528 #14 0x000055f5904f867a in buf_flush_list (max_n=2000, lsn=lsn@entry=18446744073709551615) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1573 #15 0x000055f5904f890f in buf_flush_buffer_pool () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2465 #16 0x000055f5905aed2e in logs_empty_and_mark_files_at_shutdown () at /mariadb/10.6/storage/innobase/log/log0log.cc:1177

            Update: it looks like the fatal error 125 may have been output from another thread while some debug information was being produced for the first fatal error. There was no actual stack trace output for either thread:

            2023-03-21 14:01:59 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
            2023-03-21 14:01:59 0 [Note] InnoDB: FTS optimize thread exiting.
            2023-03-21 14:01:59 0 [Note] InnoDB: Starting shutdown...
            2023-03-21 14:01:59 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6g/mysql-test/var/9/mysqld.1/data/ib_buffer_pool
            2023-03-21 14:01:59 0 [Note] InnoDB: Restricted to 1478 pages due to innodb_buf_pool_dump_pct=25
            2023-03-21 14:01:59 0 [Note] InnoDB: Buffer pool(s) dump completed at 230321 14:01:59
            2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
            230321 14:01:59 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.
             
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed, 
            something is definitely wrong and this may fail.
             
            Server version: 10.6.13-MariaDB-log source revision: 142587071b22254a87b1378312b1cebaa8f3f11e
            key_buffer_size=1048576
            read_buffer_size=131072
            2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0
            max_used_connections=3
            max_threads=153
            thread_count=0
            It is possible that mysqld could use up to 
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63667 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            

            The source code revision is not being reported as 535a38c4559160c7a46f0d9ece52495b83718d45 because there were some unrelated code changes applied.

            marko Marko Mäkelä added a comment - Update: it looks like the fatal error 125 may have been output from another thread while some debug information was being produced for the first fatal error. There was no actual stack trace output for either thread: 2023-03-21 14:01:59 0 [Note] /dev/shm/10.6g/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2023-03-21 14:01:59 0 [Note] InnoDB: FTS optimize thread exiting. 2023-03-21 14:01:59 0 [Note] InnoDB: Starting shutdown... 2023-03-21 14:01:59 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/10.6g/mysql-test/var/9/mysqld.1/data/ib_buffer_pool 2023-03-21 14:01:59 0 [Note] InnoDB: Restricted to 1478 pages due to innodb_buf_pool_dump_pct=25 2023-03-21 14:01:59 0 [Note] InnoDB: Buffer pool(s) dump completed at 230321 14:01:59 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 230321 14:01:59 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.6.13-MariaDB-log source revision: 142587071b22254a87b1378312b1cebaa8f3f11e key_buffer_size=1048576 read_buffer_size=131072 2023-03-21 14:01:59 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async write of 4096 bytes, for file 76, returned 0 max_used_connections=3 max_threads=153 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63667 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... The source code revision is not being reported as 535a38c4559160c7a46f0d9ece52495b83718d45 because there were some unrelated code changes applied.

            I do not know if it is related to this, but I also observed 2 shutdown hangs in a RelWithDebInfo build (on write_slots) when running several innodb.undo_truncate,4k in parallel:

            10.6 535a38c4559160c7a46f0d9ece52495b83718d45

            #9  io_slots::wait (this=0x5621f0b76a50) at /mariadb/10.6/storage/innobase/os/os0file.cc:117
            #10 io_slots::~io_slots (this=0x5621f0b76a50, __in_chrg=<optimized out>) at /mariadb/10.6/storage/innobase/os/os0file.cc:132
            #11 os_aio_free () at /mariadb/10.6/storage/innobase/os/os0file.cc:3660
            #12 0x00005621ee609e2d in innodb_shutdown () at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1978
            

            In one of the hangs that I checked, the buf_dblwr indicated no pending page writes. Log writes never are asynchronous (until MDEV-26603 is done).

            marko Marko Mäkelä added a comment - I do not know if it is related to this, but I also observed 2 shutdown hangs in a RelWithDebInfo build (on write_slots ) when running several innodb.undo_truncate,4k in parallel: 10.6 535a38c4559160c7a46f0d9ece52495b83718d45 #9 io_slots::wait (this=0x5621f0b76a50) at /mariadb/10.6/storage/innobase/os/os0file.cc:117 #10 io_slots::~io_slots (this=0x5621f0b76a50, __in_chrg=<optimized out>) at /mariadb/10.6/storage/innobase/os/os0file.cc:132 #11 os_aio_free () at /mariadb/10.6/storage/innobase/os/os0file.cc:3660 #12 0x00005621ee609e2d in innodb_shutdown () at /mariadb/10.6/storage/innobase/srv/srv0start.cc:1978 In one of the hangs that I checked, the buf_dblwr indicated no pending page writes. Log writes never are asynchronous (until MDEV-26603 is done).
            danblack Daniel Black added a comment -

            Got email from Jens Axobe

            >> I did see that one, and I think that's either specific to the old
            >> 5.10 stable kernel (eg -21 won't have it), because it's cancelation
            >> off thread exit.
            >
            > possible
            >
            >> Either that, or you're doing fork/exit with IO inflight.
            >
            > exit with IO inlight might be possible. fork much less likely.

            It's exit, from the trace that I saw. Easily reproducible with just
            adding a delay to the worker handling.

            > I came in as https://jira.mariadb.org/browse/MDEV-29610 which I note
            > Marko claims 6.1.15-1 and by the build date it might be the same era
            (JIRA edit - my mistake - its a much later build)
            > as -20, so since 6.1.16+ is out I assume its the same fix.
            >
            > thanks for the pointer

            Yeah, if it's exit with IO pending, it'll be in any kernel. You should
            not do that...

            danblack Daniel Black added a comment - Got email from Jens Axobe >> I did see that one, and I think that's either specific to the old >> 5.10 stable kernel (eg -21 won't have it), because it's cancelation >> off thread exit. > > possible > >> Either that, or you're doing fork/exit with IO inflight. > > exit with IO inlight might be possible. fork much less likely. It's exit, from the trace that I saw. Easily reproducible with just adding a delay to the worker handling. > I came in as https://jira.mariadb.org/browse/MDEV-29610 which I note > Marko claims 6.1.15-1 and by the build date it might be the same era (JIRA edit - my mistake - its a much later build) > as -20, so since 6.1.16+ is out I assume its the same fix. > > thanks for the pointer Yeah, if it's exit with IO pending, it'll be in any kernel. You should not do that...

            danblack, any news on this? I see that 125 is ECANCELED in /usr/include/asm-generic/errno.h.

            mleich got this on 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux on an ext4 file system that was loopback mounted in a file that resided in /dev/shm. (Such indirection makes sense when using the older libaio; it does not support files residing directly on /dev/shm.)

            marko Marko Mäkelä added a comment - danblack , any news on this? I see that 125 is ECANCELED in /usr/include/asm-generic/errno.h . mleich got this on 5.15.0-52-generic #58-Ubuntu SMP Thu Oct 13 08:03:55 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux on an ext4 file system that was loopback mounted in a file that resided in /dev/shm . (Such indirection makes sense when using the older libaio ; it does not support files residing directly on /dev/shm .)

            Yes, "doing fork/exit with IO inflight" seems to apply here:

            # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] /data/Server_bin/bb-10.6-MDEV-31234_RelWithDebInfo/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
            # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] InnoDB: FTS optimize thread exiting.
            # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async lru write of 4096 bytes, for file 9, returned 0
            

            danblack, do you have a patch that makes it easier to hit this bug?

            marko Marko Mäkelä added a comment - Yes, "doing fork/exit with IO inflight" seems to apply here: # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] /data/Server_bin/bb-10.6-MDEV-31234_RelWithDebInfo/bin/mysqld (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [Note] InnoDB: FTS optimize thread exiting. # 2023-05-17T11:03:59 [216405] | 2023-05-17 11:03:55 0 [ERROR] [FATAL] InnoDB: IO Error: 125 during async lru write of 4096 bytes, for file 9, returned 0 danblack , do you have a patch that makes it easier to hit this bug?
            marko Marko Mäkelä made changes -
            Assignee Daniel Black [ danblack ]
            Resolution Not a Bug [ 6 ]
            Status Closed [ 6 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -
            Issue Type Task [ 3 ] Bug [ 1 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            Component/s N/A [ 14411 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.10 [ 27530 ]
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 11.0 [ 28320 ]
            Fix Version/s 11.1 [ 28549 ]
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s N/A [ 14700 ]
            Affects Version/s 10.6 [ 24028 ]
            Affects Version/s 10.7 [ 24805 ]
            Affects Version/s 10.8 [ 26121 ]
            Affects Version/s 10.9 [ 26905 ]
            Affects Version/s 10.10 [ 27530 ]
            Affects Version/s 10.11 [ 27614 ]
            Affects Version/s 11.0 [ 28320 ]
            Affects Version/s 11.1 [ 28549 ]
            Affects Version/s 11.2 [ 28603 ]
            marko Marko Mäkelä made changes -
            julien.fritsch Julien Fritsch made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            In MDEV-27593 I changed the wording of the message and made it a non-fatal error. Actually, I accidentally made it Note instead of ERROR, so this should not cause any mtr test failures anymore:

            diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc
            index 906ce79e8fd..99f212b6eeb 100644
            --- a/storage/innobase/os/os0file.cc
            +++ b/storage/innobase/os/os0file.cc
            @@ -3459,8 +3459,11 @@ static void write_io_callback(void *c)
               const IORequest &request= *static_cast<const IORequest*>
                 (static_cast<const void*>(cb->m_userdata));
             
            +  if (!ut_rnd_interval(100))
            +    cb->m_err= 42;
            +
               if (UNIV_UNLIKELY(cb->m_err != 0))
            -    ib::info () << "IO Error: " << cb->m_err
            +    ib::error() << "IO Error: " << cb->m_err
                             << "during write of "
                             << cb->m_len << " bytes, for file "
                             << request.node->name << "(" << cb->m_fh << "), returned "
            

            10.6 72928e640e62559f4d4b82e9ff1102bb72bd5828 with patch

            var/9/log/mysqld.1.err:2023-08-01 15:27:09 0 [Note] InnoDB: IO Error: 42during write of 16384 bytes, for file ./ibdata1(11), returned 16384
            

            marko Marko Mäkelä added a comment - In MDEV-27593 I changed the wording of the message and made it a non-fatal error. Actually, I accidentally made it Note instead of ERROR , so this should not cause any mtr test failures anymore: diff --git a/storage/innobase/os/os0file.cc b/storage/innobase/os/os0file.cc index 906ce79e8fd..99f212b6eeb 100644 --- a/storage/innobase/os/os0file.cc +++ b/storage/innobase/os/os0file.cc @@ -3459,8 +3459,11 @@ static void write_io_callback(void *c) const IORequest &request= *static_cast<const IORequest*> (static_cast<const void*>(cb->m_userdata)); + if (!ut_rnd_interval(100)) + cb->m_err= 42; + if (UNIV_UNLIKELY(cb->m_err != 0)) - ib::info () << "IO Error: " << cb->m_err + ib::error() << "IO Error: " << cb->m_err << "during write of " << cb->m_len << " bytes, for file " << request.node->name << "(" << cb->m_fh << "), returned " 10.6 72928e640e62559f4d4b82e9ff1102bb72bd5828 with patch var/9/log/mysqld.1.err:2023-08-01 15:27:09 0 [Note] InnoDB: IO Error: 42during write of 16384 bytes, for file ./ibdata1(11), returned 16384
            danblack Daniel Black made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            danblack Daniel Black made changes -
            Assignee Daniel Black [ danblack ]

            I got some logs of failures that occurred after MDEV-27593 was pushed into 10.6 (2003-08-01). I found only one failure that includes that fix:

            bb-11.0-ycp-mdev-31996 d07012f4d84e57f1895f848b029fabd220b5d5be ppc64le-ubuntu-2204

            encryption.innodb-checksum-algorithm '32k,ctr,innodb,strict_crc32' w8 [ fail ]
            …
            2023-10-06  3:37:06 4 [Note] InnoDB: Importing tablespace for table 'test/tp' that was exported from host 'c8473bd36d6b'
            2023-10-06  3:37:06 4 [Note] InnoDB: Phase I - Update all pages
            2023-10-06  3:37:06 4 [Note] InnoDB: Sync to disk
            2023-10-06  3:37:06 4 [Note] InnoDB: Sync to disk - done!
            2023-10-06  3:37:06 4 [Note] InnoDB: Phase III - Flush changes to disk
            2023-10-06  3:37:06 4 [Note] InnoDB: Phase IV - Flush complete
            2023-10-06  3:37:06 4 [Note] InnoDB: `test`.`tp` autoinc value set to 2
            2023-10-06  3:37:07 0 [Note] InnoDB: IO Error: 125during write of 491520 bytes, for file ./ibdata1(11), returned 0
            ----------SERVER LOG END-------------
            mysqltest failed but provided no output
            

            The server apparently hung here. The reason could be unrelated to the io_uring error, such as MDEV-32588.

            marko Marko Mäkelä added a comment - I got some logs of failures that occurred after MDEV-27593 was pushed into 10.6 (2003-08-01). I found only one failure that includes that fix: bb-11.0-ycp-mdev-31996 d07012f4d84e57f1895f848b029fabd220b5d5be ppc64le-ubuntu-2204 encryption.innodb-checksum-algorithm '32k,ctr,innodb,strict_crc32' w8 [ fail ] … 2023-10-06 3:37:06 4 [Note] InnoDB: Importing tablespace for table 'test/tp' that was exported from host 'c8473bd36d6b' 2023-10-06 3:37:06 4 [Note] InnoDB: Phase I - Update all pages 2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk 2023-10-06 3:37:06 4 [Note] InnoDB: Sync to disk - done! 2023-10-06 3:37:06 4 [Note] InnoDB: Phase III - Flush changes to disk 2023-10-06 3:37:06 4 [Note] InnoDB: Phase IV - Flush complete 2023-10-06 3:37:06 4 [Note] InnoDB: `test`.`tp` autoinc value set to 2 2023-10-06 3:37:07 0 [Note] InnoDB: IO Error: 125during write of 491520 bytes, for file ./ibdata1(11), returned 0 ----------SERVER LOG END------------- mysqltest failed but provided no output The server apparently hung here. The reason could be unrelated to the io_uring error, such as MDEV-32588 .
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.9 [ 26905 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.10 [ 27530 ]

            A Linux kernel bug that was behind MDEV-30728 is specific to the POWER ISA. Therefore, this issue (errors on io_uring write completion callback, observed on ARMv8 and AMD64) should be unaffected by that fix.

            marko Marko Mäkelä added a comment - A Linux kernel bug that was behind MDEV-30728 is specific to the POWER ISA. Therefore, this issue (errors on io_uring write completion callback, observed on ARMv8 and AMD64) should be unaffected by that fix.
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 11.0 [ 28320 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 11.1 [ 28549 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 11.2(EOL) [ 28603 ]

            People

              Unassigned Unassigned
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.