Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33189

Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Incomplete
    • 10.11.6, 11.2(EOL), 11.4
    • N/A
    • None
    • Debian GNU/Linux 11 (bullseye)
      Dell PowerEdge R750
      XFS filesystem

    Description

      We experienced a one-time server crash in production, so far not reproducible.

      We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

      The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

      The server itself crashed with:

      InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
      InnoDB: File './ibdata1' is corrupted
      

      and two assertion failures in trx0undo.cc and buf0lru.cc. All subsequent restart attempts failed so we switched the application over to the replica database.

      We did not attempt any forced recovery. The assertion failures:

      InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
      InnoDB: Failing assertion: rollback
      231221 14:24:48 [ERROR] mysqld got signal 6 ;
      

      The backtrace only gave one line before having the next assertion failure.

      stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
      InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
      InnoDB: Failing assertion: !block->page.in_file()
      

      See attachment db-syslog.2023-12-21.txt for all the relevant syslog entries.

      We have preserved the corrupt 716 MiB ibdata1 (750780416 B) file for further inspection, should the need arise.

      Attachments

        Issue Links

          Activity

            wschemmel Wolfgang Schemmel created issue -
            wschemmel Wolfgang Schemmel made changes -
            Field Original Value New Value
            Description We experienced a one-time server crash in production, so far not reproducible.

            We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

            The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

            The server itself crashed with:

            {noformat}
            InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
            InnoDB: File './ibdata1' is corrupted
            {noformat}

            and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database.

            We did not attempt any forced recovery. The assertion failures:

            {noformat}
            InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
            InnoDB: Failing assertion: rollback
            231221 14:24:48 [ERROR] mysqld got signal 6 ;
            {noformat}

            The backtrace only gave one line before having the next assertion failure.

            {noformat}
            stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
            InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
            InnoDB: Failing assertion: !block->page.in_file()
            {noformat}

            See attachment for all the relevant syslog entries.
            We experienced a one-time server crash in production, so far not reproducible.

            We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

            The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

            The server itself crashed with:

            {noformat}
            InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
            InnoDB: File './ibdata1' is corrupted
            {noformat}

            and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database.

            We did not attempt any forced recovery. The assertion failures:

            {noformat}
            InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
            InnoDB: Failing assertion: rollback
            231221 14:24:48 [ERROR] mysqld got signal 6 ;
            {noformat}

            The backtrace only gave one line before having the next assertion failure.

            {noformat}
            stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
            InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
            InnoDB: Failing assertion: !block->page.in_file()
            {noformat}

            See attachment [^db-syslog.2023-12-21.txt] for all the relevant syslog entries.
            wschemmel Wolfgang Schemmel made changes -
            Description We experienced a one-time server crash in production, so far not reproducible.

            We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

            The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

            The server itself crashed with:

            {noformat}
            InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
            InnoDB: File './ibdata1' is corrupted
            {noformat}

            and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database.

            We did not attempt any forced recovery. The assertion failures:

            {noformat}
            InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
            InnoDB: Failing assertion: rollback
            231221 14:24:48 [ERROR] mysqld got signal 6 ;
            {noformat}

            The backtrace only gave one line before having the next assertion failure.

            {noformat}
            stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
            InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
            InnoDB: Failing assertion: !block->page.in_file()
            {noformat}

            See attachment [^db-syslog.2023-12-21.txt] for all the relevant syslog entries.
            We experienced a one-time server crash in production, so far not reproducible.

            We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

            The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

            The server itself crashed with:

            {noformat}
            InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
            InnoDB: File './ibdata1' is corrupted
            {noformat}

            and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database.

            We did not attempt any forced recovery. The assertion failures:

            {noformat}
            InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
            InnoDB: Failing assertion: rollback
            231221 14:24:48 [ERROR] mysqld got signal 6 ;
            {noformat}

            The backtrace only gave one line before having the next assertion failure.

            {noformat}
            stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
            InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
            InnoDB: Failing assertion: !block->page.in_file()
            {noformat}

            See attachment [^db-syslog.2023-12-21.txt] for all the relevant syslog entries.

            We have preserved the corrupt 716 MiB {{ibdata1}} (750780416 B) file for further inspection, should the need arise.

            The byte offset 70368744161280 corresponds to the page number 0xffffffff when using the default innodb_page_size=16k.

            Now that MDEV-13542 has been fixed, it should be relatively easy to fix the crash on this corrupted data, if you can provide access to the corrupted data directory. Hopefully you can reproduce this with a test workload that does not include any confidential data.

            Without having access to the data directory, it is hard to say what could cause this corruption. Ever since MDEV-24854 (making innodb_flush_method=O_DIRECT the default) was implemented in MariaDB Server 10.6, we get reports of corruption on some file systems. I do not remember if there were any reports of corruption on XFS yet. The Linux kernel version may be relevant. You could try if innodb_flush_method=fsync works better. If this build uses io_uring, then you might want to test innodb_use_native_aio=0 as well, to check if there could be some bug in io_uring that affects XFS.

            Most of our internal testing takes place on Linux ext4 or tmpfs.

            marko Marko Mäkelä added a comment - The byte offset 70368744161280 corresponds to the page number 0xffffffff when using the default innodb_page_size=16k . Now that MDEV-13542 has been fixed, it should be relatively easy to fix the crash on this corrupted data, if you can provide access to the corrupted data directory. Hopefully you can reproduce this with a test workload that does not include any confidential data. Without having access to the data directory, it is hard to say what could cause this corruption. Ever since MDEV-24854 (making innodb_flush_method=O_DIRECT the default) was implemented in MariaDB Server 10.6, we get reports of corruption on some file systems. I do not remember if there were any reports of corruption on XFS yet. The Linux kernel version may be relevant. You could try if innodb_flush_method=fsync works better. If this build uses io_uring , then you might want to test innodb_use_native_aio=0 as well, to check if there could be some bug in io_uring that affects XFS. Most of our internal testing takes place on Linux ext4 or tmpfs.
            marko Marko Mäkelä made changes -
            Assignee Marko Mäkelä [ marko ]
            Status Open [ 1 ] Needs Feedback [ 10501 ]

            I happened to reproduce this by running a CMAKE_BUILD_TYPE=RelWithDebInfo build:

            st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22

            CURRENT_TEST: stress.ddl_innodb
            mysqltest: In included file "./suite/stress/include/ddl5.inc": 
            included from /mariadb/11.1/mysql-test/suite/stress/t/ddl_innodb.test at line 42:
            At line 71: query '$drop_table' failed: <Unknown> (2013): Lost connection to server during query
            

            In the server error log I can see the following:

            st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22

            CURRENT_TEST: stress.ddl_innodb
            2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
            2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted
            2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
            2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
            2024-01-10 14:17:44 0x7f7c7c2a56c0  InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263
            InnoDB: Failing assertion: undo_page
            

            Here, an assertion is failing during a DROP TABLE t1 operation that was rolled back, supposedly due to some corruption:

            st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22

            #7  0x0000559040129713 in ut_dbg_assertion_failed (expr=expr@entry=0x55903f35b378 "undo_page", file=file@entry=0x55903f3dced8 "/mariadb/11.1/storage/innobase/trx/trx0purge.cc", line=line@entry=263) at /mariadb/11.1/storage/innobase/ut/ut0dbg.cc:60
            #8  0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263
            #9  0x0000559040123d51 in trx_t::write_serialisation_history (this=this@entry=0x7f7c84988180, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1161
            #10 0x0000559040122880 in trx_t::commit_low (this=this@entry=0x7f7c84988180, mtr=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1570
            #11 0x000055904012019e in trx_t::commit_persist (this=this@entry=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1598
            #12 0x00005590401201e8 in trx_t::commit (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1607
            #13 0x000055904011a469 in trx_t::rollback_finish (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:65
            #14 trx_t::rollback_low (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:132
            #15 0x0000559040118db4 in trx_t::rollback (savept=0x0, this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:178
            #16 trx_t::rollback (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:165
            #17 0x000055903ff7898e in ha_innobase::delete_table (this=<optimized out>, name=<optimized out>) at /mariadb/11.1/storage/innobase/handler/ha_innodb.cc:13621
            #18 0x000055903fa0d6bf in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /mariadb/11.1/sql/handler.cc:575
            #19 0x000055903fa12b6e in ha_delete_table (thd=thd@entry=0x7f7c4c000c68, hton=hton@entry=0x5590416ed448, path=path@entry=0x7f7c7c2a3860 "./test/t1", db=db@entry=0x7f7c7c2a3650, alias=alias@entry=0x7f7c7c2a3660, generate_warning=generate_warning@entry=true) at /mariadb/11.1/sql/handler.cc:3208
            

            Now that MDEV-30100 has been fixed, I would not expect a rollback of a transaction to get here. But, I see that trx->undo_no is 1, that is, one undo log record of the transaction was not removed during the rollback. I will try to get some more information from this crash. My options are limited, because due to MDEV-10814, the buffer pool is not included in the core dump.

            marko Marko Mäkelä added a comment - I happened to reproduce this by running a CMAKE_BUILD_TYPE=RelWithDebInfo build: st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22 CURRENT_TEST: stress.ddl_innodb mysqltest: In included file "./suite/stress/include/ddl5.inc": included from /mariadb/11.1/mysql-test/suite/stress/t/ddl_innodb.test at line 42: At line 71: query '$drop_table' failed: <Unknown> (2013): Lost connection to server during query In the server error log I can see the following: st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22 CURRENT_TEST: stress.ddl_innodb 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 0x7f7c7c2a56c0 InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263 InnoDB: Failing assertion: undo_page Here, an assertion is failing during a DROP TABLE t1 operation that was rolled back, supposedly due to some corruption: st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22 #7 0x0000559040129713 in ut_dbg_assertion_failed (expr=expr@entry=0x55903f35b378 "undo_page", file=file@entry=0x55903f3dced8 "/mariadb/11.1/storage/innobase/trx/trx0purge.cc", line=line@entry=263) at /mariadb/11.1/storage/innobase/ut/ut0dbg.cc:60 #8 0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263 #9 0x0000559040123d51 in trx_t::write_serialisation_history (this=this@entry=0x7f7c84988180, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1161 #10 0x0000559040122880 in trx_t::commit_low (this=this@entry=0x7f7c84988180, mtr=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1570 #11 0x000055904012019e in trx_t::commit_persist (this=this@entry=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1598 #12 0x00005590401201e8 in trx_t::commit (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1607 #13 0x000055904011a469 in trx_t::rollback_finish (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:65 #14 trx_t::rollback_low (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:132 #15 0x0000559040118db4 in trx_t::rollback (savept=0x0, this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:178 #16 trx_t::rollback (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:165 #17 0x000055903ff7898e in ha_innobase::delete_table (this=<optimized out>, name=<optimized out>) at /mariadb/11.1/storage/innobase/handler/ha_innodb.cc:13621 #18 0x000055903fa0d6bf in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /mariadb/11.1/sql/handler.cc:575 #19 0x000055903fa12b6e in ha_delete_table (thd=thd@entry=0x7f7c4c000c68, hton=hton@entry=0x5590416ed448, path=path@entry=0x7f7c7c2a3860 "./test/t1", db=db@entry=0x7f7c7c2a3650, alias=alias@entry=0x7f7c7c2a3660, generate_warning=generate_warning@entry=true) at /mariadb/11.1/sql/handler.cc:3208 Now that MDEV-30100 has been fixed, I would not expect a rollback of a transaction to get here. But, I see that trx->undo_no is 1, that is, one undo log record of the transaction was not removed during the rollback. I will try to get some more information from this crash. My options are limited, because due to MDEV-10814 , the buffer pool is not included in the core dump.
            marko Marko Mäkelä made changes -
            Status Needs Feedback [ 10501 ] Open [ 1 ]

            Here is some more information:

            #8  0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263
            263	  ut_a(undo_page);
            (gdb) p *undo
            $5 = {id = 0, state = 1, trx_id = 448556, xid = {formatID = -1, gtrid_length = 0, bqual_length = 0, data = '\000' <repeats 127 times>}, dict_operation = true, rseg = 0x5590403db9c0 <trx_sys+22016>, 
              hdr_page_no = 4294967295, last_page_no = 4294967295, hdr_offset = 86, size = 1, top_page_no = 4294967295, top_offset = 272, top_undo_no = 0, guess_block = 0x7f7c84085aa0, undo_list = {prev = 0x0, next = 0x0}}
            (gdb) p trx.id
            $6 = 448556
            (gdb) p *undo.rseg
            $7 = {space = 0x5590417c3b90, latch = {writer = {lock = std::atomic<unsigned int> = { 2147483649 }}, readers = std::atomic<unsigned int> = { 2147483648 }}, page_no = 16, history_size = 0, needs_purge = 448557, 
              ref = std::atomic<unsigned int> = { 2 }, curr_size = 2, undo_list = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list}, undo_cached = {count = 0, start = 0x0, end = 0x0, 
                node = &trx_undo_t::undo_list}, last_page_no = 4294967295, last_commit_and_offset = 24206847997564973}
            

            Based on this, I checked the rollback segment header page 16 (undo->rseg->space->id==3, pointing to the file undo003):

            od -Ax -t x1 -j 0x40000 -N 0x4000 var/log/stress.ddl_innodb-innodb/mysqld.1/data/undo003
            

            040000 00 00 00 00 00 00 00 10 ff ff ff ff ff ff ff ff
            040010 00 00 00 00 0f 59 6a 41 00 06 00 00 00 00 00 00
            040020 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00
            040030 00 00 ff ff ff ff 00 00 ff ff ff ff 00 00 00 00
            040040 00 03 00 00 00 02 09 f2 ff ff ff ff ff ff ff ff
            040050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
            *
            041040 ff ff ff ff ff ff ff ff 00 00 00 00 00 06 d5 4d
            041050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
            *
            043ff0 00 00 00 00 00 00 00 00 0f 59 6a 41 34 3d 8c 83
            044000
            

            We can see that most page numbers really are FIL_NULL or 0xffffffff in this page.

            Because this test is not killing or restarting the server, I think that we must look at the preceding tests that did, and possibly other tests that could have corrupted the data directory even earlier:

            CURRENT_TEST: main.xa
            CURRENT_TEST: main.xtradb_mrr
            CURRENT_TEST: merge.merge_innodb
            CURRENT_TEST: perfschema.dml_file_instances
            CURRENT_TEST: perfschema.innodb_events_transactions_history_long
            CURRENT_TEST: perfschema.innodb_table_io
            CURRENT_TEST: perfschema.misc
            CURRENT_TEST: perfschema.misc_session_status
            CURRENT_TEST: perfschema.rollback_table_io
            CURRENT_TEST: perfschema.table_name
            CURRENT_TEST: perfschema.threads_innodb
            CURRENT_TEST: plugins.compression_load
            CURRENT_TEST: plugins.processlist
            CURRENT_TEST: sql_sequence.kill
            CURRENT_TEST: sql_sequence.mysqldump
            CURRENT_TEST: sql_sequence.read_only
            2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
            2024-01-10 14:16:20 0 [Note] InnoDB: FTS optimize thread exiting.
            2024-01-10 14:16:20 0 [Note] InnoDB: Starting shutdown...
            2024-01-10 14:16:20 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360
            2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd: Shutdown complete
            2024-01-10 14:16:21 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1360
            2024-01-10 14:16:22 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360
            2024-01-10 14:16:23 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1361
            2024-01-10 14:16:23 0 [Note] /dev/shm/11.2/sql/mariadbd: ready for connections.
            Version: '11.2.3-MariaDB-log'  socket: '/dev/shm/11.2/mysql-test/var/tmp/51/mysqld.1.sock'  port: 16940  Source distribution
            CURRENT_TEST: sql_sequence.rebuild
            CURRENT_TEST: stress.ddl_innodb
            2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
            2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted
            2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
            2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
            2024-01-10 14:17:44 0x7f7c7c2a56c0  InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263
            InnoDB: Failing assertion: undo_page
            

            The full list of tests that were possibly run on this data directory was as follows.

            binlog_encryption.rpl_stm_relay_ign_space
            encryption.tempfiles
            encryption.innodb-spatial-index
            encryption.innodb-compressed-blob
            rpl.rpl_partition_innodb
            rpl.rpl_partition_memory
            rpl.rpl_partition_myisam
            rpl.rpl_stm_lcase_tblnames
            rpl.rpl_multi_update2
            rpl.rpl_gtid_master_promote
            rpl.rpl_start_alter_chain_basic
            main.ipv4_as_ipv6
            binlog.binlog_mysqlbinlog-cp932
            binlog.flashback-largebinlog
            binlog.binlog_mysqlbinlog_row
            binlog.binlog_mysqlbinlog_row_myisam
            binlog.binlog_row_ctype_ucs
            binlog.binlog_row_insert_select
            binlog.binlog_row_mysqlbinlog_options
            binlog.binlog_stm_ctype_ucs
            binlog.binlog_stm_insert_select
            binlog.binlog_stm_row
            binlog.binlog_tmp_table
            binlog.binlog_tmp_table_row
            main.ctype_cp932_binlog_row
            main.mysqlbinlog_row_compressed
            mariabackup.binlog
            innodb.restart
            innodb_zip.innochecksum_3
            innodb.encryption_threads_shutdown
            innodb.innodb-mdev-7513
            sys_vars.innodb_write_io_threads_basic
            innodb.innodb_autoinc_lock_mode_zero
            main.statistics_json
            main.type_blob
            main.type_num_innodb
            main.type_temporal_innodb
            main.type_time_hires
            main.type_timestamp_hires
            main.union_innodb
            main.unique
            main.unsafe_binlog_innodb
            main.update
            main.update_innodb
            main.xa
            main.xtradb_mrr
            merge.merge_innodb
            perfschema.dml_file_instances
            perfschema.innodb_events_transactions_history_long
            perfschema.innodb_table_io
            perfschema.misc
            perfschema.misc_session_status
            perfschema.rollback_table_io
            perfschema.table_name
            perfschema.threads_innodb
            plugins.compression_load
            plugins.processlist
            sql_sequence.kill
            sql_sequence.mysqldump
            sql_sequence.read_only
            sql_sequence.rebuild
            stress.ddl_innodb
            stress.ddl_innodb
            binlog_encryption.rpl_stm_relay_ign_space
            encryption.tempfiles
            encryption.innodb-spatial-index
            encryption.innodb-compressed-blob
            rpl.rpl_partition_innodb
            rpl.rpl_partition_memory
            rpl.rpl_partition_myisam
            rpl.rpl_stm_lcase_tblnames
            rpl.rpl_multi_update2
            rpl.rpl_gtid_master_promote
            rpl.rpl_start_alter_chain_basic
            main.ipv4_as_ipv6
            binlog.binlog_mysqlbinlog-cp932
            binlog.flashback-largebinlog
            binlog.binlog_mysqlbinlog_row
            binlog.binlog_mysqlbinlog_row_myisam
            binlog.binlog_row_ctype_ucs
            binlog.binlog_row_insert_select
            binlog.binlog_row_mysqlbinlog_options
            binlog.binlog_stm_ctype_ucs
            binlog.binlog_stm_insert_select
            binlog.binlog_stm_row
            binlog.binlog_tmp_table
            binlog.binlog_tmp_table_row
            main.ctype_cp932_binlog_row
            main.mysqlbinlog_row_compressed
            mariabackup.binlog
            innodb.restart
            innodb_zip.innochecksum_3
            innodb.encryption_threads_shutdown
            innodb.innodb-mdev-7513
            sys_vars.innodb_write_io_threads_basic
            innodb.innodb_autoinc_lock_mode_zero
            main.statistics_json
            main.type_blob
            main.type_num_innodb
            main.type_temporal_innodb
            main.type_time_hires
            main.type_timestamp_hires
            main.union_innodb
            main.unique
            main.unsafe_binlog_innodb
            main.update
            main.update_innodb
            main.xa
            main.xtradb_mrr
            merge.merge_innodb
            perfschema.dml_file_instances
            perfschema.innodb_events_transactions_history_long
            perfschema.innodb_table_io
            perfschema.misc
            perfschema.misc_session_status
            perfschema.rollback_table_io
            perfschema.table_name
            perfschema.threads_innodb
            plugins.compression_load
            plugins.processlist
            sql_sequence.kill
            sql_sequence.mysqldump
            sql_sequence.read_only
            sql_sequence.rebuild
            stress.ddl_innodb
            binlog_encryption.rpl_stm_relay_ign_space
            rpl.rpl_partition_innodb
            rpl.rpl_partition_memory
            rpl.rpl_partition_myisam
            rpl.rpl_stm_lcase_tblnames
            rpl.rpl_multi_update2
            rpl.rpl_gtid_master_promote
            rpl.rpl_start_alter_chain_basic
            rpl.rpl_gtid_master_promote
            rpl.rpl_start_alter_chain_basic
            rpl.rpl_gtid_master_promote
            rpl.rpl_start_alter_chain_basic
            rpl.rpl_gtid_master_promote
            binlog_encryption.rpl_stm_relay_ign_space
            encryption.tempfiles
            encryption.innodb-spatial-index
            encryption.innodb-compressed-blob
            rpl.rpl_partition_innodb
            rpl.rpl_partition_memory
            rpl.rpl_partition_myisam
            rpl.rpl_stm_lcase_tblnames
            rpl.rpl_multi_update2
            rpl.rpl_gtid_master_promote
            rpl.rpl_start_alter_chain_basic
            main.ipv4_as_ipv6
            binlog.binlog_mysqlbinlog-cp932
            binlog.flashback-largebinlog
            binlog.binlog_mysqlbinlog_row
            binlog.binlog_mysqlbinlog_row_myisam
            binlog.binlog_row_ctype_ucs
            binlog.binlog_row_insert_select
            binlog.binlog_row_mysqlbinlog_options
            binlog.binlog_stm_ctype_ucs
            binlog.binlog_stm_insert_select
            binlog.binlog_stm_row
            binlog.binlog_tmp_table
            binlog.binlog_tmp_table_row
            main.ctype_cp932_binlog_row
            main.mysqlbinlog_row_compressed
            mariabackup.binlog
            innodb.restart
            innodb_zip.innochecksum_3
            innodb.encryption_threads_shutdown
            innodb.innodb-mdev-7513
            sys_vars.innodb_write_io_threads_basic
            innodb.innodb_autoinc_lock_mode_zero
            main.statistics_json
            main.type_blob
            main.type_num_innodb
            main.type_temporal_innodb
            main.type_time_hires
            main.type_timestamp_hires
            main.union_innodb
            main.unique
            main.unsafe_binlog_innodb
            main.update
            main.update_innodb
            main.xa
            main.xtradb_mrr
            merge.merge_innodb
            perfschema.dml_file_instances
            perfschema.innodb_events_transactions_history_long
            perfschema.innodb_table_io
            perfschema.misc
            perfschema.misc_session_status
            perfschema.rollback_table_io
            perfschema.table_name
            perfschema.threads_innodb
            plugins.compression_load
            plugins.processlist
            sql_sequence.kill
            sql_sequence.mysqldump
            sql_sequence.read_only
            sql_sequence.rebuild
            stress.ddl_innodb
            

            The database might have been rebootstrapped in between. This just gives some idea how we might reproduce this.

            marko Marko Mäkelä added a comment - Here is some more information: #8 0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263 263 ut_a(undo_page); (gdb) p *undo $5 = {id = 0, state = 1, trx_id = 448556, xid = {formatID = -1, gtrid_length = 0, bqual_length = 0, data = '\000' <repeats 127 times>}, dict_operation = true, rseg = 0x5590403db9c0 <trx_sys+22016>, hdr_page_no = 4294967295, last_page_no = 4294967295, hdr_offset = 86, size = 1, top_page_no = 4294967295, top_offset = 272, top_undo_no = 0, guess_block = 0x7f7c84085aa0, undo_list = {prev = 0x0, next = 0x0}} (gdb) p trx.id $6 = 448556 (gdb) p *undo.rseg $7 = {space = 0x5590417c3b90, latch = {writer = {lock = std::atomic<unsigned int> = { 2147483649 }}, readers = std::atomic<unsigned int> = { 2147483648 }}, page_no = 16, history_size = 0, needs_purge = 448557, ref = std::atomic<unsigned int> = { 2 }, curr_size = 2, undo_list = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list}, undo_cached = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list}, last_page_no = 4294967295, last_commit_and_offset = 24206847997564973} Based on this, I checked the rollback segment header page 16 ( undo->rseg->space->id==3 , pointing to the file undo003 ): od -Ax -t x1 -j 0x40000 -N 0x4000 var/log/stress.ddl_innodb-innodb/mysqld.1/data/undo003 040000 00 00 00 00 00 00 00 10 ff ff ff ff ff ff ff ff 040010 00 00 00 00 0f 59 6a 41 00 06 00 00 00 00 00 00 040020 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 040030 00 00 ff ff ff ff 00 00 ff ff ff ff 00 00 00 00 040040 00 03 00 00 00 02 09 f2 ff ff ff ff ff ff ff ff 040050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 041040 ff ff ff ff ff ff ff ff 00 00 00 00 00 06 d5 4d 041050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 043ff0 00 00 00 00 00 00 00 00 0f 59 6a 41 34 3d 8c 83 044000 We can see that most page numbers really are FIL_NULL or 0xffffffff in this page. Because this test is not killing or restarting the server, I think that we must look at the preceding tests that did, and possibly other tests that could have corrupted the data directory even earlier: CURRENT_TEST: main.xa CURRENT_TEST: main.xtradb_mrr CURRENT_TEST: merge.merge_innodb CURRENT_TEST: perfschema.dml_file_instances CURRENT_TEST: perfschema.innodb_events_transactions_history_long CURRENT_TEST: perfschema.innodb_table_io CURRENT_TEST: perfschema.misc CURRENT_TEST: perfschema.misc_session_status CURRENT_TEST: perfschema.rollback_table_io CURRENT_TEST: perfschema.table_name CURRENT_TEST: perfschema.threads_innodb CURRENT_TEST: plugins.compression_load CURRENT_TEST: plugins.processlist CURRENT_TEST: sql_sequence.kill CURRENT_TEST: sql_sequence.mysqldump CURRENT_TEST: sql_sequence.read_only 2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-01-10 14:16:20 0 [Note] InnoDB: FTS optimize thread exiting. 2024-01-10 14:16:20 0 [Note] InnoDB: Starting shutdown... … 2024-01-10 14:16:20 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360 2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd: Shutdown complete … 2024-01-10 14:16:21 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1360 … 2024-01-10 14:16:22 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360 … 2024-01-10 14:16:23 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1361 … 2024-01-10 14:16:23 0 [Note] /dev/shm/11.2/sql/mariadbd: ready for connections. Version: '11.2.3-MariaDB-log' socket: '/dev/shm/11.2/mysql-test/var/tmp/51/mysqld.1.sock' port: 16940 Source distribution CURRENT_TEST: sql_sequence.rebuild CURRENT_TEST: stress.ddl_innodb 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 0x7f7c7c2a56c0 InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263 InnoDB: Failing assertion: undo_page The full list of tests that were possibly run on this data directory was as follows. binlog_encryption.rpl_stm_relay_ign_space encryption.tempfiles encryption.innodb-spatial-index encryption.innodb-compressed-blob rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic main.ipv4_as_ipv6 binlog.binlog_mysqlbinlog-cp932 binlog.flashback-largebinlog binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam binlog.binlog_row_ctype_ucs binlog.binlog_row_insert_select binlog.binlog_row_mysqlbinlog_options binlog.binlog_stm_ctype_ucs binlog.binlog_stm_insert_select binlog.binlog_stm_row binlog.binlog_tmp_table binlog.binlog_tmp_table_row main.ctype_cp932_binlog_row main.mysqlbinlog_row_compressed mariabackup.binlog innodb.restart innodb_zip.innochecksum_3 innodb.encryption_threads_shutdown innodb.innodb-mdev-7513 sys_vars.innodb_write_io_threads_basic innodb.innodb_autoinc_lock_mode_zero main.statistics_json main.type_blob main.type_num_innodb main.type_temporal_innodb main.type_time_hires main.type_timestamp_hires main.union_innodb main.unique main.unsafe_binlog_innodb main.update main.update_innodb main.xa main.xtradb_mrr merge.merge_innodb perfschema.dml_file_instances perfschema.innodb_events_transactions_history_long perfschema.innodb_table_io perfschema.misc perfschema.misc_session_status perfschema.rollback_table_io perfschema.table_name perfschema.threads_innodb plugins.compression_load plugins.processlist sql_sequence.kill sql_sequence.mysqldump sql_sequence.read_only sql_sequence.rebuild stress.ddl_innodb stress.ddl_innodb binlog_encryption.rpl_stm_relay_ign_space encryption.tempfiles encryption.innodb-spatial-index encryption.innodb-compressed-blob rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic main.ipv4_as_ipv6 binlog.binlog_mysqlbinlog-cp932 binlog.flashback-largebinlog binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam binlog.binlog_row_ctype_ucs binlog.binlog_row_insert_select binlog.binlog_row_mysqlbinlog_options binlog.binlog_stm_ctype_ucs binlog.binlog_stm_insert_select binlog.binlog_stm_row binlog.binlog_tmp_table binlog.binlog_tmp_table_row main.ctype_cp932_binlog_row main.mysqlbinlog_row_compressed mariabackup.binlog innodb.restart innodb_zip.innochecksum_3 innodb.encryption_threads_shutdown innodb.innodb-mdev-7513 sys_vars.innodb_write_io_threads_basic innodb.innodb_autoinc_lock_mode_zero main.statistics_json main.type_blob main.type_num_innodb main.type_temporal_innodb main.type_time_hires main.type_timestamp_hires main.union_innodb main.unique main.unsafe_binlog_innodb main.update main.update_innodb main.xa main.xtradb_mrr merge.merge_innodb perfschema.dml_file_instances perfschema.innodb_events_transactions_history_long perfschema.innodb_table_io perfschema.misc perfschema.misc_session_status perfschema.rollback_table_io perfschema.table_name perfschema.threads_innodb plugins.compression_load plugins.processlist sql_sequence.kill sql_sequence.mysqldump sql_sequence.read_only sql_sequence.rebuild stress.ddl_innodb binlog_encryption.rpl_stm_relay_ign_space rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic rpl.rpl_gtid_master_promote binlog_encryption.rpl_stm_relay_ign_space encryption.tempfiles encryption.innodb-spatial-index encryption.innodb-compressed-blob rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic main.ipv4_as_ipv6 binlog.binlog_mysqlbinlog-cp932 binlog.flashback-largebinlog binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam binlog.binlog_row_ctype_ucs binlog.binlog_row_insert_select binlog.binlog_row_mysqlbinlog_options binlog.binlog_stm_ctype_ucs binlog.binlog_stm_insert_select binlog.binlog_stm_row binlog.binlog_tmp_table binlog.binlog_tmp_table_row main.ctype_cp932_binlog_row main.mysqlbinlog_row_compressed mariabackup.binlog innodb.restart innodb_zip.innochecksum_3 innodb.encryption_threads_shutdown innodb.innodb-mdev-7513 sys_vars.innodb_write_io_threads_basic innodb.innodb_autoinc_lock_mode_zero main.statistics_json main.type_blob main.type_num_innodb main.type_temporal_innodb main.type_time_hires main.type_timestamp_hires main.union_innodb main.unique main.unsafe_binlog_innodb main.update main.update_innodb main.xa main.xtradb_mrr merge.merge_innodb perfschema.dml_file_instances perfschema.innodb_events_transactions_history_long perfschema.innodb_table_io perfschema.misc perfschema.misc_session_status perfschema.rollback_table_io perfschema.table_name perfschema.threads_innodb plugins.compression_load plugins.processlist sql_sequence.kill sql_sequence.mysqldump sql_sequence.read_only sql_sequence.rebuild stress.ddl_innodb The database might have been rebootstrapped in between. This just gives some idea how we might reproduce this.
            axel Axel Schwenke made changes -
            Attachment crash-11.4_g5.err [ 72799 ]
            axel Axel Schwenke made changes -
            Attachment crash-11.4_my.cnf [ 72800 ]
            axel Axel Schwenke added a comment -

            I have seen this type of crash for an OLTP benchmark of 11.4, commit be6d48fd538. I attach the errorlog and the my.cnf for the incident: crash-11.4_g5.err, crash-11.4_my.cnf

            That benchmark was originally done for MDEV-31273 (if it matters).

            axel Axel Schwenke added a comment - I have seen this type of crash for an OLTP benchmark of 11.4, commit be6d48fd538 . I attach the errorlog and the my.cnf for the incident: crash-11.4_g5.err , crash-11.4_my.cnf That benchmark was originally done for MDEV-31273 (if it matters).

            axel, thank you. I assume that your benchmark does not involve any crash recovery. Nevertheless, mleich mentioned that he was so far unable to reproduce this by running some 200 instances of the test stress.ddl_innodb concurrently. I see that crash-11.4_my.cnf is specifying the MDEV-29986 default value innodb_undo_tablespaces=3. The start LSN in crash-11.4_g5.err corresponds to starting up right after a server bootstrap. That is, we can exclude any involvement of the MDEV-19229 or MDEV-14795 logic here.

            marko Marko Mäkelä added a comment - axel , thank you. I assume that your benchmark does not involve any crash recovery . Nevertheless, mleich mentioned that he was so far unable to reproduce this by running some 200 instances of the test stress.ddl_innodb concurrently. I see that crash-11.4_my.cnf is specifying the MDEV-29986 default value innodb_undo_tablespaces=3 . The start LSN in crash-11.4_g5.err corresponds to starting up right after a server bootstrap. That is, we can exclude any involvement of the MDEV-19229 or MDEV-14795 logic here.
            marko Marko Mäkelä added a comment - - edited

            By luck, I hit a debug assertion failure on a 10.6 based build:

            mariadbd: /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1787: ulint buf_flush_LRU(ulint, bool): Assertion `((&(&buf_pool.mutex)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&buf_pool.mutex)->m_mutex)->thread))' failed.
            

            The fix should be simple:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index f07ecedf535..df976f55901 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -2511,6 +2511,8 @@ static void buf_flush_page_cleaner()
                 }
                 else if (buf_flush_async_lsn <= oldest_lsn)
                   goto check_oldest_and_set_idle;
            +    else
            +      mysql_mutex_lock(&buf_pool.mutex);
             
                 n= n >= n_flushed ? n - n_flushed : 0;
                 goto LRU_flush;
            

            This bug had been introduced in MDEV-26827. Apparently, this branch before the goto LRU_flush is very rarely executed in our tests with debug instrumented builds. It remains to be seen if the corruption would be fixed by this.

            marko Marko Mäkelä added a comment - - edited By luck, I hit a debug assertion failure on a 10.6 based build: mariadbd: /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1787: ulint buf_flush_LRU(ulint, bool): Assertion `((&(&buf_pool.mutex)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&buf_pool.mutex)->m_mutex)->thread))' failed. The fix should be simple: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index f07ecedf535..df976f55901 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2511,6 +2511,8 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + mysql_mutex_lock(&buf_pool.mutex); n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; This bug had been introduced in MDEV-26827 . Apparently, this branch before the goto LRU_flush is very rarely executed in our tests with debug instrumented builds. It remains to be seen if the corruption would be fixed by this.

            To reach the code in my above patch, I think that we need innodb_adaptive_flushing=ON (which is the default) and a rather large innodb_log_file_size (not so much flushing pressure due to the checkpoint age getting close to the log capacity). I think that for testing purposes, it would be good to issue a message to the error log to prove that the code is being covered during the workload:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index f07ecedf535..8d6f2e96b5e 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -2511,6 +2511,11 @@ static void buf_flush_page_cleaner()
                 }
                 else if (buf_flush_async_lsn <= oldest_lsn)
                   goto check_oldest_and_set_idle;
            +    else
            +    {
            +      ib::info() << "MDEV-33189";
            +      mysql_mutex_lock(&buf_pool.mutex);
            +    }
             
                 n= n >= n_flushed ? n - n_flushed : 0;
                 goto LRU_flush;
            

            Also, users who are affected by this could try if setting innodb_adaptive_flushing=OFF makes the corruption go away.

            marko Marko Mäkelä added a comment - To reach the code in my above patch, I think that we need innodb_adaptive_flushing=ON (which is the default) and a rather large innodb_log_file_size (not so much flushing pressure due to the checkpoint age getting close to the log capacity). I think that for testing purposes, it would be good to issue a message to the error log to prove that the code is being covered during the workload: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index f07ecedf535..8d6f2e96b5e 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2511,6 +2511,11 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + { + ib::info() << "MDEV-33189" ; + mysql_mutex_lock(&buf_pool.mutex); + } n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; Also, users who are affected by this could try if setting innodb_adaptive_flushing=OFF makes the corruption go away.
            marko Marko Mäkelä added a comment - - edited

            I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch:

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index ebcf430d532..311ccdff802 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner()
                 }
                 else if (buf_flush_async_lsn <= oldest_lsn)
                   goto check_oldest_and_set_idle;
            +    else
            +    {
            +      abort();
            +      mysql_mutex_lock(&buf_pool.mutex);
            +    }
             
                 n= n >= n_flushed ? n - n_flushed : 0;
                 goto LRU_flush;
            diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h
            index f4183e4c61a..85c2f662760 100644
            --- a/storage/innobase/include/ut0new.h
            +++ b/storage/innobase/include/ut0new.h
            @@ -234,7 +234,7 @@ struct ut_new_pfx_t {
             #endif
             };
             
            -#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)
            +#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)
             static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump)
             {
             	ut_a(ptr != NULL);
            

            The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool.

            On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb. The code was unreachable with the first run with the mtr default innodb_log_file_size=10m. Second attempt: innodb_log_file_size=100m. Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m). Fourth attempt: additionally set innodb_io_capacity=10000. Fifth attempt: changed innodb_buffer_pool_size=100m. All attempts to cover this piece of code were unsuccessful.

            I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes).

            The test atomic.rename_table, which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

            marko Marko Mäkelä added a comment - - edited I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index ebcf430d532..311ccdff802 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + { + abort(); + mysql_mutex_lock(&buf_pool.mutex); + } n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h index f4183e4c61a..85c2f662760 100644 --- a/storage/innobase/include/ut0new.h +++ b/storage/innobase/include/ut0new.h @@ -234,7 +234,7 @@ struct ut_new_pfx_t { #endif }; -#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) +#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump) { ut_a(ptr != NULL); The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool. On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb . The code was unreachable with the first run with the mtr default innodb_log_file_size=10m . Second attempt: innodb_log_file_size=100m . Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m ). Fourth attempt: additionally set innodb_io_capacity=10000 . Fifth attempt: changed innodb_buffer_pool_size=100m . All attempts to cover this piece of code were unsuccessful. I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes). The test atomic.rename_table , which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.
            marko Marko Mäkelä made changes -
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 11.0 [ 28320 ]
            Fix Version/s 11.1 [ 28549 ]
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s 11.3 [ 28565 ]
            Fix Version/s 11.4 [ 29301 ]
            Affects Version/s 11.2 [ 28603 ]
            Affects Version/s 11.4 [ 29301 ]
            Priority Major [ 3 ] Blocker [ 1 ]
            marko Marko Mäkelä made changes -

            I filed MDEV-33275 for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

            marko Marko Mäkelä added a comment - I filed MDEV-33275 for the observed bug that might explain this. We do not have enough data to say if it really fixes this.
            marko Marko Mäkelä made changes -
            Priority Blocker [ 1 ] Critical [ 2 ]

            On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after MDEV-26827 had been pushed:

            bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d

            stress.ddl_innodb 'innodb'               w9 [ fail ]
                    Test ended at 2023-06-14 04:49:36
             
            CURRENT_TEST: stress.ddl_innodb
            mysqltest: In included file "./suite/stress/include/ddl4.inc": 
            included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41:
            At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query
            Server log from this test:
            ----------SERVER LOG START-----------
            2023-06-14 04:49:35 0xffff9c218060  InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354
            InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS
            

            marko Marko Mäkelä added a comment - On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after MDEV-26827 had been pushed: bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d stress.ddl_innodb 'innodb' w9 [ fail ] Test ended at 2023-06-14 04:49:36   CURRENT_TEST: stress.ddl_innodb mysqltest: In included file "./suite/stress/include/ddl4.inc": included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41: At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query … Server log from this test: ----------SERVER LOG START----------- 2023-06-14 04:49:35 0xffff9c218060 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354 InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS

            The fix of MDEV-33275 was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions?

            There also is the open bug MDEV-33363, which could potentially explain this.

            marko Marko Mäkelä added a comment - The fix of MDEV-33275 was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions? There also is the open bug MDEV-33363 , which could potentially explain this.
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] Needs Feedback [ 10501 ]
            serg Sergei Golubchik made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.11 [ 27614 ]
            Fix Version/s 11.0 [ 28320 ]
            Fix Version/s 11.1 [ 28549 ]
            Fix Version/s 11.3 [ 28565 ]
            Fix Version/s 11.2 [ 28603 ]
            Fix Version/s 11.4 [ 29301 ]
            Resolution Incomplete [ 4 ]
            Status Needs Feedback [ 10501 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -

            Unfortunately, we still have some corruption going on. See the linked related tickets.

            marko Marko Mäkelä added a comment - Unfortunately, we still have some corruption going on. See the linked related tickets.
            Gavriil Papadakis Gavriil Papadakis made changes -

            People

              marko Marko Mäkelä
              wschemmel Wolfgang Schemmel
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.