[MDEV-33189] Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.11.6, 11.2(EOL), 11.4
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:
Debian GNU/Linux 11 (bullseye)
Dell PowerEdge R750
XFS filesystem

Description

We experienced a one-time server crash in production, so far not reproducible.

We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.

The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.

The server itself crashed with:

InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

InnoDB: File './ibdata1' is corrupted

and two assertion failures in trx0undo.cc and buf0lru.cc. All subsequent restart attempts failed so we switched the application over to the replica database.

We did not attempt any forced recovery. The assertion failures:

InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416

InnoDB: Failing assertion: rollback

231221 14:24:48 [ERROR] mysqld got signal 6 ;

The backtrace only gave one line before having the next assertion failure.

stack_bottom = 0x7f614d088cd8 thread_stack 0x49000

InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285

InnoDB: Failing assertion: !block->page.in_file()

See attachment db-syslog.2023-12-21.txt for all the relevant syslog entries.

We have preserved the corrupt 716 MiB ibdata1 (750780416 B) file for further inspection, should the need arise.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

crash-11.4_g5.err
7 kB
2024-01-15 13:41
crash-11.4_my.cnf
1 kB
2024-01-15 13:41
db-syslog.2023-12-21.txt
11 kB
2024-01-05 13:34

Issue Links

relates to

MDEV-32817 在最近将版本升级到10.11.5后，针对表进行频繁的读写操作不久后，出现index for table xxxx is corrupt，随后此表tablespace xxxxxx corrupted，最后Tablespace is missing for a table，此表已完全不可用

Closed

MDEV-33922 InnoDB undo log tablespace file corruption

Closed

MDEV-34233 InnoDB crashes due to corrupted ibdata1 (Assertion failure in innodb.undo_page)

Closed

MDEV-34453 Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

Closed

MDEV-35385 Server crash after reading outside of bounds on ibdata1

Closed

MDEV-33275 buf_flush_LRU(): mysql_mutex_assert_owner(&buf_pool.mutex) failed

Closed

MDEV-34233 InnoDB crashes due to corrupted ibdata1 (Assertion failure in innodb.undo_page)

Closed

(2 relates to)

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Marko Mäkelä added a comment - 2024-01-17 06:37 - edited

I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch:

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index ebcf430d532..311ccdff802 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner()

     else if (buf_flush_async_lsn <= oldest_lsn)

       goto check_oldest_and_set_idle;

+    else

+    {

+      abort();

+      mysql_mutex_lock(&buf_pool.mutex);

+    }

     n= n >= n_flushed ? n - n_flushed : 0;

     goto LRU_flush;

diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h

index f4183e4c61a..85c2f662760 100644

--- a/storage/innobase/include/ut0new.h

+++ b/storage/innobase/include/ut0new.h

@@ -234,7 +234,7 @@ struct ut_new_pfx_t {

 #endif

};

-#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)

+#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)

 static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump)

 	ut_a(ptr != NULL);

The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool.

On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb. The code was unreachable with the first run with the mtr default innodb_log_file_size=10m. Second attempt: innodb_log_file_size=100m. Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m). Fourth attempt: additionally set innodb_io_capacity=10000. Fifth attempt: changed innodb_buffer_pool_size=100m. All attempts to cover this piece of code were unsuccessful.

I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes).

The test atomic.rename_table, which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

Marko Mäkelä added a comment - 2024-01-17 06:37 - edited I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch: diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index ebcf430d532..311ccdff802 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner() } else if (buf_flush_async_lsn <= oldest_lsn) goto check_oldest_and_set_idle; + else + { + abort(); + mysql_mutex_lock(&buf_pool.mutex); + } n= n >= n_flushed ? n - n_flushed : 0; goto LRU_flush; diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h index f4183e4c61a..85c2f662760 100644 --- a/storage/innobase/include/ut0new.h +++ b/storage/innobase/include/ut0new.h @@ -234,7 +234,7 @@ struct ut_new_pfx_t { #endif }; -#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) +#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP) static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump) { ut_a(ptr != NULL); The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool. On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb . The code was unreachable with the first run with the mtr default innodb_log_file_size=10m . Second attempt: innodb_log_file_size=100m . Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m ). Fourth attempt: additionally set innodb_io_capacity=10000 . Fifth attempt: changed innodb_buffer_pool_size=100m . All attempts to cover this piece of code were unsuccessful. I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes). The test atomic.rename_table , which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.

Marko Mäkelä added a comment - 2024-01-18 12:35

I filed ~~MDEV-33275~~ for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

Marko Mäkelä added a comment - 2024-01-18 12:35 I filed MDEV-33275 for the observed bug that might explain this. We do not have enough data to say if it really fixes this.

Marko Mäkelä added a comment - 2024-01-18 17:21

On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after ~~MDEV-26827~~ had been pushed:

bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d
stress.ddl_innodb 'innodb' w9 [ fail ]
Test ended at 2023-06-14 04:49:36

CURRENT_TEST: stress.ddl_innodb
mysqltest: In included file "./suite/stress/include/ddl4.inc":
included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41:
At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query
…
Server log from this test:
----------SERVER LOG START-----------
2023-06-14 04:49:35 0xffff9c218060 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354
InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS

Marko Mäkelä added a comment - 2024-01-18 17:21 On our CI systems, the test stress.ddl_innodb occasionally fails. The first failure that loosely matches the symptoms here occurred 3 months after MDEV-26827 had been pushed: bb-11.2-MDEV-5816 ffb445d2b9337478b8bd750b06dac7336983503d stress.ddl_innodb 'innodb' w9 [ fail ] Test ended at 2023-06-14 04:49:36 CURRENT_TEST: stress.ddl_innodb mysqltest: In included file "./suite/stress/include/ddl4.inc": included from /home/buildbot/aarch64-fedora-37/build/mysql-test/suite/stress/t/ddl_innodb.test at line 41: At line 331: query 'EXECUTE create_table1' failed: <Unknown> (2013): Lost connection to server during query … Server log from this test: ----------SERVER LOG START----------- 2023-06-14 04:49:35 0xffff9c218060 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-37/build/storage/innobase/trx/trx0purge.cc line 354 InnoDB: Failing assertion: flst_add_first(rseg_header, TRX_RSEG + TRX_RSEG_HISTORY, undo_page, uint16_t(page_offset(undo_header) + TRX_UNDO_HISTORY_NODE), mtr) == DB_SUCCESS

Marko Mäkelä added a comment - 2024-02-12 15:13

The fix of ~~MDEV-33275~~ was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions?

There also is the open bug ~~MDEV-33363~~, which could potentially explain this.

Marko Mäkelä added a comment - 2024-02-12 15:13 The fix of MDEV-33275 was included in 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3. Is this bug reproducible with those server versions? There also is the open bug MDEV-33363 , which could potentially explain this.

Marko Mäkelä added a comment - 2024-06-28 12:38

Unfortunately, we still have some corruption going on. See the linked related tickets.

Marko Mäkelä added a comment - 2024-06-28 12:38 Unfortunately, we still have some corruption going on. See the linked related tickets.

MariaDB Server

Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration