[MDEV-33189] Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery - Jira

Wolfgang Schemmel created issue - 2024-01-05 13:35

Wolfgang Schemmel made changes - 2024-01-05 13:36

Field	Original Value	New Value
Description	We experienced a one-time server crash in production, so far not reproducible. We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem. The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years. The server itself crashed with: {noformat} InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1 InnoDB: File './ibdata1' is corrupted {noformat} and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database. We did not attempt any forced recovery. The assertion failures: {noformat} InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416 InnoDB: Failing assertion: rollback 231221 14:24:48 [ERROR] mysqld got signal 6 ; {noformat} The backtrace only gave one line before having the next assertion failure. {noformat} stack_bottom = 0x7f614d088cd8 thread_stack 0x49000 InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285 InnoDB: Failing assertion: !block->page.in_file() {noformat} See attachment for all the relevant syslog entries.	We experienced a one-time server crash in production, so far not reproducible. We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem. The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years. The server itself crashed with: {noformat} InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1 InnoDB: File './ibdata1' is corrupted {noformat} and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database. We did not attempt any forced recovery. The assertion failures: {noformat} InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416 InnoDB: Failing assertion: rollback 231221 14:24:48 [ERROR] mysqld got signal 6 ; {noformat} The backtrace only gave one line before having the next assertion failure. {noformat} stack_bottom = 0x7f614d088cd8 thread_stack 0x49000 InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285 InnoDB: Failing assertion: !block->page.in_file() {noformat} See attachment [^db-syslog.2023-12-21.txt] for all the relevant syslog entries.

Wolfgang Schemmel made changes - 2024-01-05 13:38

Description

We experienced a one-time server crash in production, so far not reproducible.

We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

The server itself crashed with:

{noformat}
InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
InnoDB: File './ibdata1' is corrupted
{noformat}

and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database.

We did not attempt any forced recovery. The assertion failures:

{noformat}
InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
InnoDB: Failing assertion: rollback
231221 14:24:48 [ERROR] mysqld got signal 6 ;
{noformat}

The backtrace only gave one line before having the next assertion failure.

{noformat}
stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
InnoDB: Failing assertion: !block->page.in_file()
{noformat}

See attachment [^db-syslog.2023-12-21.txt] for all the relevant syslog entries.

We experienced a one-time server crash in production, so far not reproducible.

We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

The server itself crashed with:

{noformat}
InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
InnoDB: File './ibdata1' is corrupted
{noformat}

and two assertion failures in {{trx0undo.cc}} and {{buf0lru.cc}}. All subsequent restart attempts failed so we switched the application over to the replica database.

We did not attempt any forced recovery. The assertion failures:

{noformat}
InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
InnoDB: Failing assertion: rollback
231221 14:24:48 [ERROR] mysqld got signal 6 ;
{noformat}

The backtrace only gave one line before having the next assertion failure.

{noformat}
stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
InnoDB: Failing assertion: !block->page.in_file()
{noformat}

See attachment [^db-syslog.2023-12-21.txt] for all the relevant syslog entries.

We have preserved the corrupt 716 MiB {{ibdata1}} (750780416 B) file for further inspection, should the need arise.

Marko Mäkelä added a comment - 2024-01-05 13:56

The byte offset 70368744161280 corresponds to the page number 0xffffffff when using the default innodb_page_size=16k.

Now that ~~MDEV-13542~~ has been fixed, it should be relatively easy to fix the crash on this corrupted data, if you can provide access to the corrupted data directory. Hopefully you can reproduce this with a test workload that does not include any confidential data.

Without having access to the data directory, it is hard to say what could cause this corruption. Ever since ~~MDEV-24854~~ (making innodb_flush_method=O_DIRECT the default) was implemented in MariaDB Server 10.6, we get reports of corruption on some file systems. I do not remember if there were any reports of corruption on XFS yet. The Linux kernel version may be relevant. You could try if innodb_flush_method=fsync works better. If this build uses io_uring, then you might want to test innodb_use_native_aio=0 as well, to check if there could be some bug in io_uring that affects XFS.

Most of our internal testing takes place on Linux ext4 or tmpfs.

Marko Mäkelä added a comment - 2024-01-05 13:56 The byte offset 70368744161280 corresponds to the page number 0xffffffff when using the default innodb_page_size=16k . Now that MDEV-13542 has been fixed, it should be relatively easy to fix the crash on this corrupted data, if you can provide access to the corrupted data directory. Hopefully you can reproduce this with a test workload that does not include any confidential data. Without having access to the data directory, it is hard to say what could cause this corruption. Ever since MDEV-24854 (making innodb_flush_method=O_DIRECT the default) was implemented in MariaDB Server 10.6, we get reports of corruption on some file systems. I do not remember if there were any reports of corruption on XFS yet. The Linux kernel version may be relevant. You could try if innodb_flush_method=fsync works better. If this build uses io_uring , then you might want to test innodb_use_native_aio=0 as well, to check if there could be some bug in io_uring that affects XFS. Most of our internal testing takes place on Linux ext4 or tmpfs.

Marko Mäkelä made changes - 2024-01-05 13:56

Assignee		Marko Mäkelä [ marko ]
Status	Open [ 1 ]	Needs Feedback [ 10501 ]

Marko Mäkelä added a comment - 2024-01-10 12:39

I happened to reproduce this by running a CMAKE_BUILD_TYPE=RelWithDebInfo build:

st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22
CURRENT_TEST: stress.ddl_innodb
mysqltest: In included file "./suite/stress/include/ddl5.inc":
included from /mariadb/11.1/mysql-test/suite/stress/t/ddl_innodb.test at line 42:
At line 71: query '$drop_table' failed: <Unknown> (2013): Lost connection to server during query

In the server error log I can see the following:

st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22
CURRENT_TEST: stress.ddl_innodb
2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted
2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003
2024-01-10 14:17:44 0x7f7c7c2a56c0 InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263
InnoDB: Failing assertion: undo_page

Here, an assertion is failing during a DROP TABLE t1 operation that was rolled back, supposedly due to some corruption:

st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22
#7 0x0000559040129713 in ut_dbg_assertion_failed (expr=expr@entry=0x55903f35b378 "undo_page", file=file@entry=0x55903f3dced8 "/mariadb/11.1/storage/innobase/trx/trx0purge.cc", line=line@entry=263) at /mariadb/11.1/storage/innobase/ut/ut0dbg.cc:60
#8 0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263
#9 0x0000559040123d51 in trx_t::write_serialisation_history (this=this@entry=0x7f7c84988180, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1161
#10 0x0000559040122880 in trx_t::commit_low (this=this@entry=0x7f7c84988180, mtr=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1570
#11 0x000055904012019e in trx_t::commit_persist (this=this@entry=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1598
#12 0x00005590401201e8 in trx_t::commit (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1607
#13 0x000055904011a469 in trx_t::rollback_finish (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:65
#14 trx_t::rollback_low (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:132
#15 0x0000559040118db4 in trx_t::rollback (savept=0x0, this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:178
#16 trx_t::rollback (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:165
#17 0x000055903ff7898e in ha_innobase::delete_table (this=<optimized out>, name=<optimized out>) at /mariadb/11.1/storage/innobase/handler/ha_innodb.cc:13621
#18 0x000055903fa0d6bf in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /mariadb/11.1/sql/handler.cc:575
#19 0x000055903fa12b6e in ha_delete_table (thd=thd@entry=0x7f7c4c000c68, hton=hton@entry=0x5590416ed448, path=path@entry=0x7f7c7c2a3860 "./test/t1", db=db@entry=0x7f7c7c2a3650, alias=alias@entry=0x7f7c7c2a3660, generate_warning=generate_warning@entry=true) at /mariadb/11.1/sql/handler.cc:3208

Now that ~~MDEV-30100~~ has been fixed, I would not expect a rollback of a transaction to get here. But, I see that trx->undo_no is 1, that is, one undo log record of the transaction was not removed during the rollback. I will try to get some more information from this crash. My options are limited, because due to ~~MDEV-10814~~, the buffer pool is not included in the core dump.

Marko Mäkelä added a comment - 2024-01-10 12:39 I happened to reproduce this by running a CMAKE_BUILD_TYPE=RelWithDebInfo build: st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22 CURRENT_TEST: stress.ddl_innodb mysqltest: In included file "./suite/stress/include/ddl5.inc": included from /mariadb/11.1/mysql-test/suite/stress/t/ddl_innodb.test at line 42: At line 71: query '$drop_table' failed: <Unknown> (2013): Lost connection to server during query In the server error log I can see the following: st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22 CURRENT_TEST: stress.ddl_innodb 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 0x7f7c7c2a56c0 InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263 InnoDB: Failing assertion: undo_page Here, an assertion is failing during a DROP TABLE t1 operation that was rolled back, supposedly due to some corruption: st-11.2-merge e4cb1e3295f7e6f0e5287d97884d6149a2390d22 #7 0x0000559040129713 in ut_dbg_assertion_failed (expr=expr@entry=0x55903f35b378 "undo_page", file=file@entry=0x55903f3dced8 "/mariadb/11.1/storage/innobase/trx/trx0purge.cc", line=line@entry=263) at /mariadb/11.1/storage/innobase/ut/ut0dbg.cc:60 #8 0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263 #9 0x0000559040123d51 in trx_t::write_serialisation_history (this=this@entry=0x7f7c84988180, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1161 #10 0x0000559040122880 in trx_t::commit_low (this=this@entry=0x7f7c84988180, mtr=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1570 #11 0x000055904012019e in trx_t::commit_persist (this=this@entry=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1598 #12 0x00005590401201e8 in trx_t::commit (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0trx.cc:1607 #13 0x000055904011a469 in trx_t::rollback_finish (this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:65 #14 trx_t::rollback_low (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:132 #15 0x0000559040118db4 in trx_t::rollback (savept=0x0, this=0x7f7c84988180) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:178 #16 trx_t::rollback (this=this@entry=0x7f7c84988180, savept=savept@entry=0x0) at /mariadb/11.1/storage/innobase/trx/trx0roll.cc:165 #17 0x000055903ff7898e in ha_innobase::delete_table (this=<optimized out>, name=<optimized out>) at /mariadb/11.1/storage/innobase/handler/ha_innodb.cc:13621 #18 0x000055903fa0d6bf in hton_drop_table (hton=<optimized out>, path=<optimized out>) at /mariadb/11.1/sql/handler.cc:575 #19 0x000055903fa12b6e in ha_delete_table (thd=thd@entry=0x7f7c4c000c68, hton=hton@entry=0x5590416ed448, path=path@entry=0x7f7c7c2a3860 "./test/t1", db=db@entry=0x7f7c7c2a3650, alias=alias@entry=0x7f7c7c2a3660, generate_warning=generate_warning@entry=true) at /mariadb/11.1/sql/handler.cc:3208 Now that MDEV-30100 has been fixed, I would not expect a rollback of a transaction to get here. But, I see that trx->undo_no is 1, that is, one undo log record of the transaction was not removed during the rollback. I will try to get some more information from this crash. My options are limited, because due to MDEV-10814 , the buffer pool is not included in the core dump.

Marko Mäkelä made changes - 2024-01-10 12:39

Status

Needs Feedback [ 10501 ]

Open [ 1 ]

Marko Mäkelä added a comment - 2024-01-10 13:22

Here is some more information:

#8  0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263

263	  ut_a(undo_page);

(gdb) p *undo

$5 = {id = 0, state = 1, trx_id = 448556, xid = {formatID = -1, gtrid_length = 0, bqual_length = 0, data = '\000' <repeats 127 times>}, dict_operation = true, rseg = 0x5590403db9c0 <trx_sys+22016>,

  hdr_page_no = 4294967295, last_page_no = 4294967295, hdr_offset = 86, size = 1, top_page_no = 4294967295, top_offset = 272, top_undo_no = 0, guess_block = 0x7f7c84085aa0, undo_list = {prev = 0x0, next = 0x0}}

(gdb) p trx.id

$6 = 448556

(gdb) p *undo.rseg

$7 = {space = 0x5590417c3b90, latch = {writer = {lock = std::atomic<unsigned int> = { 2147483649 }}, readers = std::atomic<unsigned int> = { 2147483648 }}, page_no = 16, history_size = 0, needs_purge = 448557,

  ref = std::atomic<unsigned int> = { 2 }, curr_size = 2, undo_list = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list}, undo_cached = {count = 0, start = 0x0, end = 0x0,

    node = &trx_undo_t::undo_list}, last_page_no = 4294967295, last_commit_and_offset = 24206847997564973}

Based on this, I checked the rollback segment header page 16 (undo->rseg->space->id==3, pointing to the file undo003):

od -Ax -t x1 -j 0x40000 -N 0x4000 var/log/stress.ddl_innodb-innodb/mysqld.1/data/undo003

040000 00 00 00 00 00 00 00 10 ff ff ff ff ff ff ff ff

040010 00 00 00 00 0f 59 6a 41 00 06 00 00 00 00 00 00

040020 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00

040030 00 00 ff ff ff ff 00 00 ff ff ff ff 00 00 00 00

040040 00 03 00 00 00 02 09 f2 ff ff ff ff ff ff ff ff

040050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

041040 ff ff ff ff ff ff ff ff 00 00 00 00 00 06 d5 4d

041050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

043ff0 00 00 00 00 00 00 00 00 0f 59 6a 41 34 3d 8c 83

We can see that most page numbers really are FIL_NULL or 0xffffffff in this page.

Because this test is not killing or restarting the server, I think that we must look at the preceding tests that did, and possibly other tests that could have corrupted the data directory even earlier:

CURRENT_TEST: main.xa

CURRENT_TEST: main.xtradb_mrr

CURRENT_TEST: merge.merge_innodb

CURRENT_TEST: perfschema.dml_file_instances

CURRENT_TEST: perfschema.innodb_events_transactions_history_long

CURRENT_TEST: perfschema.innodb_table_io

CURRENT_TEST: perfschema.misc

CURRENT_TEST: perfschema.misc_session_status

CURRENT_TEST: perfschema.rollback_table_io

CURRENT_TEST: perfschema.table_name

CURRENT_TEST: perfschema.threads_innodb

CURRENT_TEST: plugins.compression_load

CURRENT_TEST: plugins.processlist

CURRENT_TEST: sql_sequence.kill

CURRENT_TEST: sql_sequence.mysqldump

CURRENT_TEST: sql_sequence.read_only

2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown

2024-01-10 14:16:20 0 [Note] InnoDB: FTS optimize thread exiting.

2024-01-10 14:16:20 0 [Note] InnoDB: Starting shutdown...

…

2024-01-10 14:16:20 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360

2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd: Shutdown complete

…

2024-01-10 14:16:21 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1360

…

2024-01-10 14:16:22 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360

…

2024-01-10 14:16:23 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1361

…

2024-01-10 14:16:23 0 [Note] /dev/shm/11.2/sql/mariadbd: ready for connections.

Version: '11.2.3-MariaDB-log'  socket: '/dev/shm/11.2/mysql-test/var/tmp/51/mysqld.1.sock'  port: 16940  Source distribution

CURRENT_TEST: sql_sequence.rebuild

CURRENT_TEST: stress.ddl_innodb

2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003

2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted

2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003

2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003

2024-01-10 14:17:44 0x7f7c7c2a56c0  InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263

InnoDB: Failing assertion: undo_page

The full list of tests that were possibly run on this data directory was as follows.

binlog_encryption.rpl_stm_relay_ign_space

encryption.tempfiles

encryption.innodb-spatial-index

encryption.innodb-compressed-blob

rpl.rpl_partition_innodb

rpl.rpl_partition_memory

rpl.rpl_partition_myisam

rpl.rpl_stm_lcase_tblnames

rpl.rpl_multi_update2

rpl.rpl_gtid_master_promote

rpl.rpl_start_alter_chain_basic

main.ipv4_as_ipv6

binlog.binlog_mysqlbinlog-cp932

binlog.flashback-largebinlog

binlog.binlog_mysqlbinlog_row

binlog.binlog_mysqlbinlog_row_myisam

binlog.binlog_row_ctype_ucs

binlog.binlog_row_insert_select

binlog.binlog_row_mysqlbinlog_options

binlog.binlog_stm_ctype_ucs

binlog.binlog_stm_insert_select

binlog.binlog_stm_row

binlog.binlog_tmp_table

binlog.binlog_tmp_table_row

main.ctype_cp932_binlog_row

main.mysqlbinlog_row_compressed

mariabackup.binlog

innodb.restart

innodb_zip.innochecksum_3

innodb.encryption_threads_shutdown

innodb.innodb-mdev-7513

sys_vars.innodb_write_io_threads_basic

innodb.innodb_autoinc_lock_mode_zero

main.statistics_json

main.type_blob

main.type_num_innodb

main.type_temporal_innodb

main.type_time_hires

main.type_timestamp_hires

main.union_innodb

main.unique

main.unsafe_binlog_innodb

main.update

main.update_innodb

main.xa

main.xtradb_mrr

merge.merge_innodb

perfschema.dml_file_instances

perfschema.innodb_events_transactions_history_long

perfschema.innodb_table_io

perfschema.misc

perfschema.misc_session_status

perfschema.rollback_table_io

perfschema.table_name

perfschema.threads_innodb

plugins.compression_load

plugins.processlist

sql_sequence.kill

sql_sequence.mysqldump

sql_sequence.read_only

sql_sequence.rebuild

stress.ddl_innodb

stress.ddl_innodb

binlog_encryption.rpl_stm_relay_ign_space

encryption.tempfiles

encryption.innodb-spatial-index

encryption.innodb-compressed-blob

rpl.rpl_partition_innodb

rpl.rpl_partition_memory

rpl.rpl_partition_myisam

rpl.rpl_stm_lcase_tblnames

rpl.rpl_multi_update2

rpl.rpl_gtid_master_promote

rpl.rpl_start_alter_chain_basic

main.ipv4_as_ipv6

binlog.binlog_mysqlbinlog-cp932

binlog.flashback-largebinlog

binlog.binlog_mysqlbinlog_row

binlog.binlog_mysqlbinlog_row_myisam

binlog.binlog_row_ctype_ucs

binlog.binlog_row_insert_select

binlog.binlog_row_mysqlbinlog_options

binlog.binlog_stm_ctype_ucs

binlog.binlog_stm_insert_select

binlog.binlog_stm_row

binlog.binlog_tmp_table

binlog.binlog_tmp_table_row

main.ctype_cp932_binlog_row

main.mysqlbinlog_row_compressed

mariabackup.binlog

innodb.restart

innodb_zip.innochecksum_3

innodb.encryption_threads_shutdown

innodb.innodb-mdev-7513

sys_vars.innodb_write_io_threads_basic

innodb.innodb_autoinc_lock_mode_zero

main.statistics_json

main.type_blob

main.type_num_innodb

main.type_temporal_innodb

main.type_time_hires

main.type_timestamp_hires

main.union_innodb

main.unique

main.unsafe_binlog_innodb

main.update

main.update_innodb

main.xa

main.xtradb_mrr

merge.merge_innodb

perfschema.dml_file_instances

perfschema.innodb_events_transactions_history_long

perfschema.innodb_table_io

perfschema.misc

perfschema.misc_session_status

perfschema.rollback_table_io

perfschema.table_name

perfschema.threads_innodb

plugins.compression_load

plugins.processlist

sql_sequence.kill

sql_sequence.mysqldump

sql_sequence.read_only

sql_sequence.rebuild

stress.ddl_innodb

binlog_encryption.rpl_stm_relay_ign_space

rpl.rpl_partition_innodb

rpl.rpl_partition_memory

rpl.rpl_partition_myisam

rpl.rpl_stm_lcase_tblnames

rpl.rpl_multi_update2

rpl.rpl_gtid_master_promote

rpl.rpl_start_alter_chain_basic

rpl.rpl_gtid_master_promote

rpl.rpl_start_alter_chain_basic

rpl.rpl_gtid_master_promote

rpl.rpl_start_alter_chain_basic

rpl.rpl_gtid_master_promote

binlog_encryption.rpl_stm_relay_ign_space

encryption.tempfiles

encryption.innodb-spatial-index

encryption.innodb-compressed-blob

rpl.rpl_partition_innodb

rpl.rpl_partition_memory

rpl.rpl_partition_myisam

rpl.rpl_stm_lcase_tblnames

rpl.rpl_multi_update2

rpl.rpl_gtid_master_promote

rpl.rpl_start_alter_chain_basic

main.ipv4_as_ipv6

binlog.binlog_mysqlbinlog-cp932

binlog.flashback-largebinlog

binlog.binlog_mysqlbinlog_row

binlog.binlog_mysqlbinlog_row_myisam

binlog.binlog_row_ctype_ucs

binlog.binlog_row_insert_select

binlog.binlog_row_mysqlbinlog_options

binlog.binlog_stm_ctype_ucs

binlog.binlog_stm_insert_select

binlog.binlog_stm_row

binlog.binlog_tmp_table

binlog.binlog_tmp_table_row

main.ctype_cp932_binlog_row

main.mysqlbinlog_row_compressed

mariabackup.binlog

innodb.restart

innodb_zip.innochecksum_3

innodb.encryption_threads_shutdown

innodb.innodb-mdev-7513

sys_vars.innodb_write_io_threads_basic

innodb.innodb_autoinc_lock_mode_zero

main.statistics_json

main.type_blob

main.type_num_innodb

main.type_temporal_innodb

main.type_time_hires

main.type_timestamp_hires

main.union_innodb

main.unique

main.unsafe_binlog_innodb

main.update

main.update_innodb

main.xa

main.xtradb_mrr

merge.merge_innodb

perfschema.dml_file_instances

perfschema.innodb_events_transactions_history_long

perfschema.innodb_table_io

perfschema.misc

perfschema.misc_session_status

perfschema.rollback_table_io

perfschema.table_name

perfschema.threads_innodb

plugins.compression_load

plugins.processlist

sql_sequence.kill

sql_sequence.mysqldump

sql_sequence.read_only

sql_sequence.rebuild

stress.ddl_innodb

The database might have been rebootstrapped in between. This just gives some idea how we might reproduce this.

Marko Mäkelä added a comment - 2024-01-10 13:22 Here is some more information: #8 0x000055904010cdaf in trx_purge_add_undo_to_history (trx=trx@entry=0x7f7c84988180, undo=@0x7f7c84988ac0: 0x7f7c4c1876d0, mtr=mtr@entry=0x7f7c7c2a2300) at /mariadb/11.1/storage/innobase/trx/trx0purge.cc:263 263 ut_a(undo_page); (gdb) p *undo $5 = {id = 0, state = 1, trx_id = 448556, xid = {formatID = -1, gtrid_length = 0, bqual_length = 0, data = '\000' <repeats 127 times>}, dict_operation = true, rseg = 0x5590403db9c0 <trx_sys+22016>, hdr_page_no = 4294967295, last_page_no = 4294967295, hdr_offset = 86, size = 1, top_page_no = 4294967295, top_offset = 272, top_undo_no = 0, guess_block = 0x7f7c84085aa0, undo_list = {prev = 0x0, next = 0x0}} (gdb) p trx.id $6 = 448556 (gdb) p *undo.rseg $7 = {space = 0x5590417c3b90, latch = {writer = {lock = std::atomic<unsigned int> = { 2147483649 }}, readers = std::atomic<unsigned int> = { 2147483648 }}, page_no = 16, history_size = 0, needs_purge = 448557, ref = std::atomic<unsigned int> = { 2 }, curr_size = 2, undo_list = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list}, undo_cached = {count = 0, start = 0x0, end = 0x0, node = &trx_undo_t::undo_list}, last_page_no = 4294967295, last_commit_and_offset = 24206847997564973} Based on this, I checked the rollback segment header page 16 ( undo->rseg->space->id==3 , pointing to the file undo003 ): od -Ax -t x1 -j 0x40000 -N 0x4000 var/log/stress.ddl_innodb-innodb/mysqld.1/data/undo003 040000 00 00 00 00 00 00 00 10 ff ff ff ff ff ff ff ff 040010 00 00 00 00 0f 59 6a 41 00 06 00 00 00 00 00 00 040020 00 00 00 00 00 03 00 00 00 00 00 00 00 00 00 00 040030 00 00 ff ff ff ff 00 00 ff ff ff ff 00 00 00 00 040040 00 03 00 00 00 02 09 f2 ff ff ff ff ff ff ff ff 040050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff * 041040 ff ff ff ff ff ff ff ff 00 00 00 00 00 06 d5 4d 041050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 * 043ff0 00 00 00 00 00 00 00 00 0f 59 6a 41 34 3d 8c 83 044000 We can see that most page numbers really are FIL_NULL or 0xffffffff in this page. Because this test is not killing or restarting the server, I think that we must look at the preceding tests that did, and possibly other tests that could have corrupted the data directory even earlier: CURRENT_TEST: main.xa CURRENT_TEST: main.xtradb_mrr CURRENT_TEST: merge.merge_innodb CURRENT_TEST: perfschema.dml_file_instances CURRENT_TEST: perfschema.innodb_events_transactions_history_long CURRENT_TEST: perfschema.innodb_table_io CURRENT_TEST: perfschema.misc CURRENT_TEST: perfschema.misc_session_status CURRENT_TEST: perfschema.rollback_table_io CURRENT_TEST: perfschema.table_name CURRENT_TEST: perfschema.threads_innodb CURRENT_TEST: plugins.compression_load CURRENT_TEST: plugins.processlist CURRENT_TEST: sql_sequence.kill CURRENT_TEST: sql_sequence.mysqldump CURRENT_TEST: sql_sequence.read_only 2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2024-01-10 14:16:20 0 [Note] InnoDB: FTS optimize thread exiting. 2024-01-10 14:16:20 0 [Note] InnoDB: Starting shutdown... … 2024-01-10 14:16:20 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360 2024-01-10 14:16:20 0 [Note] /dev/shm/11.2/sql/mariadbd: Shutdown complete … 2024-01-10 14:16:21 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1360 … 2024-01-10 14:16:22 0 [Note] InnoDB: Shutdown completed; log sequence number 3071905; transaction id 1360 … 2024-01-10 14:16:23 0 [Note] InnoDB: log sequence number 3071905 (memory-mapped); transaction id 1361 … 2024-01-10 14:16:23 0 [Note] /dev/shm/11.2/sql/mariadbd: ready for connections. Version: '11.2.3-MariaDB-log' socket: '/dev/shm/11.2/mysql-test/var/tmp/51/mysqld.1.sock' port: 16940 Source distribution CURRENT_TEST: sql_sequence.rebuild CURRENT_TEST: stress.ddl_innodb 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: File './/undo003' is corrupted 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo003 2024-01-10 14:17:44 0x7f7c7c2a56c0 InnoDB: Assertion failure in file /mariadb/11.1/storage/innobase/trx/trx0purge.cc line 263 InnoDB: Failing assertion: undo_page The full list of tests that were possibly run on this data directory was as follows. binlog_encryption.rpl_stm_relay_ign_space encryption.tempfiles encryption.innodb-spatial-index encryption.innodb-compressed-blob rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic main.ipv4_as_ipv6 binlog.binlog_mysqlbinlog-cp932 binlog.flashback-largebinlog binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam binlog.binlog_row_ctype_ucs binlog.binlog_row_insert_select binlog.binlog_row_mysqlbinlog_options binlog.binlog_stm_ctype_ucs binlog.binlog_stm_insert_select binlog.binlog_stm_row binlog.binlog_tmp_table binlog.binlog_tmp_table_row main.ctype_cp932_binlog_row main.mysqlbinlog_row_compressed mariabackup.binlog innodb.restart innodb_zip.innochecksum_3 innodb.encryption_threads_shutdown innodb.innodb-mdev-7513 sys_vars.innodb_write_io_threads_basic innodb.innodb_autoinc_lock_mode_zero main.statistics_json main.type_blob main.type_num_innodb main.type_temporal_innodb main.type_time_hires main.type_timestamp_hires main.union_innodb main.unique main.unsafe_binlog_innodb main.update main.update_innodb main.xa main.xtradb_mrr merge.merge_innodb perfschema.dml_file_instances perfschema.innodb_events_transactions_history_long perfschema.innodb_table_io perfschema.misc perfschema.misc_session_status perfschema.rollback_table_io perfschema.table_name perfschema.threads_innodb plugins.compression_load plugins.processlist sql_sequence.kill sql_sequence.mysqldump sql_sequence.read_only sql_sequence.rebuild stress.ddl_innodb stress.ddl_innodb binlog_encryption.rpl_stm_relay_ign_space encryption.tempfiles encryption.innodb-spatial-index encryption.innodb-compressed-blob rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic main.ipv4_as_ipv6 binlog.binlog_mysqlbinlog-cp932 binlog.flashback-largebinlog binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam binlog.binlog_row_ctype_ucs binlog.binlog_row_insert_select binlog.binlog_row_mysqlbinlog_options binlog.binlog_stm_ctype_ucs binlog.binlog_stm_insert_select binlog.binlog_stm_row binlog.binlog_tmp_table binlog.binlog_tmp_table_row main.ctype_cp932_binlog_row main.mysqlbinlog_row_compressed mariabackup.binlog innodb.restart innodb_zip.innochecksum_3 innodb.encryption_threads_shutdown innodb.innodb-mdev-7513 sys_vars.innodb_write_io_threads_basic innodb.innodb_autoinc_lock_mode_zero main.statistics_json main.type_blob main.type_num_innodb main.type_temporal_innodb main.type_time_hires main.type_timestamp_hires main.union_innodb main.unique main.unsafe_binlog_innodb main.update main.update_innodb main.xa main.xtradb_mrr merge.merge_innodb perfschema.dml_file_instances perfschema.innodb_events_transactions_history_long perfschema.innodb_table_io perfschema.misc perfschema.misc_session_status perfschema.rollback_table_io perfschema.table_name perfschema.threads_innodb plugins.compression_load plugins.processlist sql_sequence.kill sql_sequence.mysqldump sql_sequence.read_only sql_sequence.rebuild stress.ddl_innodb binlog_encryption.rpl_stm_relay_ign_space rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic rpl.rpl_gtid_master_promote binlog_encryption.rpl_stm_relay_ign_space encryption.tempfiles encryption.innodb-spatial-index encryption.innodb-compressed-blob rpl.rpl_partition_innodb rpl.rpl_partition_memory rpl.rpl_partition_myisam rpl.rpl_stm_lcase_tblnames rpl.rpl_multi_update2 rpl.rpl_gtid_master_promote rpl.rpl_start_alter_chain_basic main.ipv4_as_ipv6 binlog.binlog_mysqlbinlog-cp932 binlog.flashback-largebinlog binlog.binlog_mysqlbinlog_row binlog.binlog_mysqlbinlog_row_myisam binlog.binlog_row_ctype_ucs binlog.binlog_row_insert_select binlog.binlog_row_mysqlbinlog_options binlog.binlog_stm_ctype_ucs binlog.binlog_stm_insert_select binlog.binlog_stm_row binlog.binlog_tmp_table binlog.binlog_tmp_table_row main.ctype_cp932_binlog_row main.mysqlbinlog_row_compressed mariabackup.binlog innodb.restart innodb_zip.innochecksum_3 innodb.encryption_threads_shutdown innodb.innodb-mdev-7513 sys_vars.innodb_write_io_threads_basic innodb.innodb_autoinc_lock_mode_zero main.statistics_json main.type_blob main.type_num_innodb main.type_temporal_innodb main.type_time_hires main.type_timestamp_hires main.union_innodb main.unique main.unsafe_binlog_innodb main.update main.update_innodb main.xa main.xtradb_mrr merge.merge_innodb perfschema.dml_file_instances perfschema.innodb_events_transactions_history_long perfschema.innodb_table_io perfschema.misc perfschema.misc_session_status perfschema.rollback_table_io perfschema.table_name perfschema.threads_innodb plugins.compression_load plugins.processlist sql_sequence.kill sql_sequence.mysqldump sql_sequence.read_only sql_sequence.rebuild stress.ddl_innodb The database might have been rebootstrapped in between. This just gives some idea how we might reproduce this.

Axel Schwenke made changes - 2024-01-15 13:41

Attachment

crash-11.4_g5.err [ 72799 ]

Axel Schwenke made changes - 2024-01-15 13:41

Attachment

crash-11.4_my.cnf [ 72800 ]

Axel Schwenke added a comment - 2024-01-15 13:42

I have seen this type of crash for an OLTP benchmark of 11.4, commit be6d48fd538. I attach the errorlog and the my.cnf for the incident: crash-11.4_g5.err, crash-11.4_my.cnf

That benchmark was originally done for ~~MDEV-31273~~ (if it matters).

Axel Schwenke added a comment - 2024-01-15 13:42 I have seen this type of crash for an OLTP benchmark of 11.4, commit be6d48fd538 . I attach the errorlog and the my.cnf for the incident: crash-11.4_g5.err , crash-11.4_my.cnf That benchmark was originally done for MDEV-31273 (if it matters).

Marko Mäkelä added a comment - 2024-01-15 14:19

axel, thank you. I assume that your benchmark does not involve any crash recovery. Nevertheless, mleich mentioned that he was so far unable to reproduce this by running some 200 instances of the test stress.ddl_innodb concurrently. I see that crash-11.4_my.cnf is specifying the ~~MDEV-29986~~ default value innodb_undo_tablespaces=3. The start LSN in crash-11.4_g5.err corresponds to starting up right after a server bootstrap. That is, we can exclude any involvement of the ~~MDEV-19229~~ or ~~MDEV-14795~~ logic here.

Marko Mäkelä added a comment - 2024-01-15 14:19 axel , thank you. I assume that your benchmark does not involve any crash recovery . Nevertheless, mleich mentioned that he was so far unable to reproduce this by running some 200 instances of the test stress.ddl_innodb concurrently. I see that crash-11.4_my.cnf is specifying the MDEV-29986 default value innodb_undo_tablespaces=3 . The start LSN in crash-11.4_g5.err corresponds to starting up right after a server bootstrap. That is, we can exclude any involvement of the MDEV-19229 or MDEV-14795 logic here.

Marko Mäkelä added a comment - 2024-01-16 10:20 - edited

By luck, I hit a debug assertion failure on a 10.6 based build:

mariadbd: /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1787: ulint buf_flush_LRU(ulint, bool): Assertion `((&(&buf_pool.mutex)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&buf_pool.mutex)->m_mutex)->thread))' failed.

The fix should be simple:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index f07ecedf535..df976f55901 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -2511,6 +2511,8 @@ static void buf_flush_page_cleaner()

     else if (buf_flush_async_lsn <= oldest_lsn)

       goto check_oldest_and_set_idle;

+    else

+      mysql_mutex_lock(&buf_pool.mutex);

     n= n >= n_flushed ? n - n_flushed : 0;

     goto LRU_flush;

This bug had been introduced in ~~MDEV-26827~~. Apparently, this branch before the goto LRU_flush is very rarely executed in our tests with debug instrumented builds. It remains to be seen if the corruption would be fixed by this.

Marko Mäkelä added a comment - 2024-01-16 10:20 - edited By luck, I hit a debug assertion failure on a 10.6 based build: mariadbd: /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1787: ulint buf_flush_LRU(ulint, bool): Assertion `((&(&buf_pool.mutex)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&buf_pool.mutex)->m_mutex)->thread))' failed. The fix should be simple: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index f07ecedf535..df976f55901 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2511,6 +2511,8 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + mysql_mutex_lock(&buf_pool.mutex); n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; This bug had been introduced in MDEV-26827 . Apparently, this branch before the goto LRU_flush is very rarely executed in our tests with debug instrumented builds. It remains to be seen if the corruption would be fixed by this.

Marko Mäkelä added a comment - 2024-01-17 06:09

To reach the code in my above patch, I think that we need innodb_adaptive_flushing=ON (which is the default) and a rather large innodb_log_file_size (not so much flushing pressure due to the checkpoint age getting close to the log capacity). I think that for testing purposes, it would be good to issue a message to the error log to prove that the code is being covered during the workload:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index f07ecedf535..8d6f2e96b5e 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -2511,6 +2511,11 @@ static void buf_flush_page_cleaner()

     else if (buf_flush_async_lsn <= oldest_lsn)

       goto check_oldest_and_set_idle;

+    else

+    {

+      ib::info() << "MDEV-33189";

+      mysql_mutex_lock(&buf_pool.mutex);

+    }

     n= n >= n_flushed ? n - n_flushed : 0;

     goto LRU_flush;

Also, users who are affected by this could try if setting innodb_adaptive_flushing=OFF makes the corruption go away.

Marko Mäkelä added a comment - 2024-01-17 06:09 To reach the code in my above patch, I think that we need innodb_adaptive_flushing=ON (which is the default) and a rather large innodb_log_file_size (not so much flushing pressure due to the checkpoint age getting close to the log capacity). I think that for testing purposes, it would be good to issue a message to the error log to prove that the code is being covered during the workload: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index f07ecedf535..8d6f2e96b5e 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2511,6 +2511,11 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + { + ib::info() << "MDEV-33189" ; + mysql_mutex_lock(&buf_pool.mutex); + } n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; Also, users who are affected by this could try if setting innodb_adaptive_flushing=OFF makes the corruption go away.

Marko Mäkelä added a comment - 2024-01-17 06:37 - edited

I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index ebcf430d532..311ccdff802 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner()

     else if (buf_flush_async_lsn <= oldest_lsn)

       goto check_oldest_and_set_idle;

+    else

+    {

+      abort();

+      mysql_mutex_lock(&buf_pool.mutex);

+    }

     n= n >= n_flushed ? n - n_flushed : 0;

     goto LRU_flush;

diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h

index f4183e4c61a..85c2f662760 100644

--- a/storage/innobase/include/ut0new.h

+++ b/storage/innobase/include/ut0new.h

@@ -234,7 +234,7 @@ struct ut_new_pfx_t {

 #endif

};

-#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)

+#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)

 static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump)

 	ut_a(ptr != NULL);

The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool.

On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb. The code was unreachable with the first run with the mtr default innodb_log_file_size=10m. Second attempt: innodb_log_file_size=100m. Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m). Fourth attempt: additionally set innodb_io_capacity=10000. Fifth attempt: changed innodb_buffer_pool_size=100m. All attempts to cover this piece of code were unsuccessful.

I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes).

The test atomic.rename_table, which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

Marko Mäkelä added a comment - 2024-01-17 06:37 - edited I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index ebcf430d532..311ccdff802 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + { + abort(); + mysql_mutex_lock(&buf_pool.mutex); + } n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h index f4183e4c61a..85c2f662760 100644 --- a/storage/innobase/include/ut0new.h +++ b/storage/innobase/include/ut0new.h @@ -234,7 +234,7 @@ struct ut_new_pfx_t { #endif }; -#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) +#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump) { ut_a(ptr != NULL); The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool. On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb . The code was unreachable with the first run with the mtr default innodb_log_file_size=10m . Second attempt: innodb_log_file_size=100m . Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m ). Fourth attempt: additionally set innodb_io_capacity=10000 . Fifth attempt: changed innodb_buffer_pool_size=100m . All attempts to cover this piece of code were unsuccessful. I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes). The test atomic.rename_table , which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

Marko Mäkelä made changes - 2024-01-17 10:30

Fix Version/s		10.6 [ 24028 ]
Fix Version/s		10.11 [ 27614 ]
Fix Version/s		11.0 [ 28320 ]
Fix Version/s		11.1 [ 28549 ]
Fix Version/s		11.2 [ 28603 ]
Fix Version/s		11.3 [ 28565 ]
Fix Version/s		11.4 [ 29301 ]
Affects Version/s		11.2 [ 28603 ]
Affects Version/s		11.4 [ 29301 ]
Priority	Major [ 3 ]	Blocker [ 1 ]

Marko Mäkelä made changes - 2024-01-18 11:07

Link

This issue relates to ~~MDEV-33275~~ [ ~~MDEV-33275~~ ]

Marko Mäkelä added a comment - 2024-01-18 12:35

I filed ~~MDEV-33275~~ for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

Marko Mäkelä added a comment - 2024-01-18 12:35 I filed MDEV-33275 for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

Marko Mäkelä made changes - 2024-01-18 12:35

Priority

Blocker [ 1 ]

Critical [ 2 ]

Marko Mäkelä added a comment - 2024-01-18 17:21

On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after ~~MDEV-26827~~ had been pushed:

bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d
stress.ddl_innodb 'innodb' w9 [ fail ]
Test ended at 2023-06-14 04:49:36

CURRENT_TEST: stress.ddl_innodb
mysqltest: In included file "./suite/stress/include/ddl4.inc":
included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41:
At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query
…
Server log from this test:
----------SERVER LOG START-----------
2023-06-14 04:49:35 0xffff9c218060 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354
InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS

Marko Mäkelä added a comment - 2024-01-18 17:21 On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after MDEV-26827 had been pushed: bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d stress.ddl_innodb 'innodb' w9 [ fail ] Test ended at 2023-06-14 04:49:36 CURRENT_TEST: stress.ddl_innodb mysqltest: In included file "./suite/stress/include/ddl4.inc": included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41: At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query … Server log from this test: ----------SERVER LOG START----------- 2023-06-14 04:49:35 0xffff9c218060 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354 InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS

Marko Mäkelä added a comment - 2024-02-12 15:13

The fix of ~~MDEV-33275~~ was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions?

There also is the open bug ~~MDEV-33363~~, which could potentially explain this.

Marko Mäkelä added a comment - 2024-02-12 15:13 The fix of MDEV-33275 was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions? There also is the open bug MDEV-33363 , which could potentially explain this.

Marko Mäkelä made changes - 2024-02-12 15:13

Status

Open [ 1 ]

Needs Feedback [ 10501 ]

Sergei Golubchik made changes - 2024-03-18 13:28

Fix Version/s		N/A [ 14700 ]
Fix Version/s	10.6 [ 24028 ]
Fix Version/s	10.11 [ 27614 ]
Fix Version/s	11.0 [ 28320 ]
Fix Version/s	11.1 [ 28549 ]
Fix Version/s	11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]
Fix Version/s	11.4 [ 29301 ]
Resolution		Incomplete [ 4 ]
Status	Needs Feedback [ 10501 ]	Closed [ 6 ]

Marko Mäkelä made changes - 2024-05-24 11:41

Link

This issue relates to ~~MDEV-34233~~ [ ~~MDEV-34233~~ ]

Marko Mäkelä made changes - 2024-06-28 12:38

Link

This issue relates to ~~MDEV-34453~~ [ ~~MDEV-34453~~ ]

Marko Mäkelä made changes - 2024-06-28 12:38

Link

This issue relates to ~~MDEV-34233~~ [ ~~MDEV-34233~~ ]

Marko Mäkelä made changes - 2024-06-28 12:38

Link

This issue relates to ~~MDEV-33922~~ [ ~~MDEV-33922~~ ]

Marko Mäkelä made changes - 2024-06-28 12:38

Link

This issue relates to ~~MDEV-32817~~ [ ~~MDEV-32817~~ ]

Marko Mäkelä added a comment - 2024-06-28 12:38

Unfortunately, we still have some corruption going on. See the linked related tickets.

Marko Mäkelä added a comment - 2024-06-28 12:38 Unfortunately, we still have some corruption going on. See the linked related tickets.

Gavriil Papadakis made changes - 2024-11-11 15:51

Link

This issue relates to ~~MDEV-35385~~ [ ~~MDEV-35385~~ ]

MariaDB Server

Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration