Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Incomplete
-
10.11.6, 11.2(EOL), 11.4
-
None
-
Debian GNU/Linux 11 (bullseye)
Dell PowerEdge R750
XFS filesystem
Description
We experienced a one-time server crash in production, so far not reproducible.
We are running MariaDB 10.11.6 (1:10.11.6+maria~deb11) installed from a MariaDB repo mirror on Debian GNU/Linux 11 (bullseye) as the database primary for a read- and write-heavy application. It runs on a bare-metal server Dell PowerEdge R750 with 64 cores (Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz), 512 GiB RAM on a software RAID-1 NVMe with an XFS filesystem.
The server crash happened on 2023-12-21. On 2023-12-02 we had upgraded to v10.11.6. Prior to that the DB ran without any problems on v10.6.7 for almost 1.5 years.
The server itself crashed with:
InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
|
InnoDB: File './ibdata1' is corrupted
|
and two assertion failures in trx0undo.cc and buf0lru.cc. All subsequent restart attempts failed so we switched the application over to the replica database.
We did not attempt any forced recovery. The assertion failures:
InnoDB: Assertion failure in file ./storage/innobase/trx/trx0undo.cc line 1416
|
InnoDB: Failing assertion: rollback
|
231221 14:24:48 [ERROR] mysqld got signal 6 ;
|
The backtrace only gave one line before having the next assertion failure.
stack_bottom = 0x7f614d088cd8 thread_stack 0x49000
|
InnoDB: Assertion failure in file ./storage/innobase/buf/buf0lru.cc line 285
|
InnoDB: Failing assertion: !block->page.in_file()
|
See attachment db-syslog.2023-12-21.txt for all the relevant syslog entries.
We have preserved the corrupt 716 MiB ibdata1 (750780416 B) file for further inspection, should the need arise.
Attachments
Issue Links
- relates to
-
MDEV-32817 在最近将版本升级到10.11.5后,针对表进行频繁的读写操作不久后,出现index for table xxxx is corrupt,随后此表tablespace xxxxxx corrupted,最后Tablespace is missing for a table,此表已完全不可用
-
- Closed
-
-
MDEV-33922 InnoDB undo log tablespace file corruption
-
- Closed
-
-
MDEV-34233 InnoDB crashes due to corrupted ibdata1 (Assertion failure in innodb.undo_page)
-
- Closed
-
-
MDEV-34453 Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1
-
- Closed
-
-
MDEV-35385 Server crash after reading outside of bounds on ibdata1
-
- Closed
-
-
MDEV-33275 buf_flush_LRU(): mysql_mutex_assert_owner(&buf_pool.mutex) failed
-
- Closed
-
-
MDEV-34233 InnoDB crashes due to corrupted ibdata1 (Assertion failure in innodb.undo_page)
-
- Closed
-
I retested a CMAKE_BUILD_TYPE=RelWithDebInfo build of the same 11.2 commit e4cb1e3295f7e6f0e5287d97884d6149a2390d22 with the following patch:
diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
index ebcf430d532..311ccdff802 100644
--- a/storage/innobase/buf/buf0flu.cc
+++ b/storage/innobase/buf/buf0flu.cc
@@ -2608,6 +2608,11 @@ static void buf_flush_page_cleaner()
}
else if (buf_flush_async_lsn <= oldest_lsn)
goto check_oldest_and_set_idle;
+ else
+ {
+ abort();
+ mysql_mutex_lock(&buf_pool.mutex);
+ }
n= n >= n_flushed ? n - n_flushed : 0;
goto LRU_flush;
diff --git a/storage/innobase/include/ut0new.h b/storage/innobase/include/ut0new.h
index f4183e4c61a..85c2f662760 100644
--- a/storage/innobase/include/ut0new.h
+++ b/storage/innobase/include/ut0new.h
@@ -234,7 +234,7 @@ struct ut_new_pfx_t {
#endif
};
-#if defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)
+#if 0 && defined(DBUG_OFF) && defined(HAVE_MADVISE) && defined(MADV_DODUMP)
static inline void ut_dontdump(void *ptr, size_t m_size, bool dontdump)
{
The first hunk would cause the server to crash if the code line was executed. The second hunk would ensure that any core dumps will include a copy of the buffer pool.
On the first run of the default test suites, there was no crash. I then started 60 concurrent runs of the test stress.ddl_innodb. The code was unreachable with the first run with the mtr default innodb_log_file_size=10m. Second attempt: innodb_log_file_size=100m. Third attempt: additionally set innodb_buffer_pool_size=5m (instead of the mtr default innodb_buffer_pool_size=8m). Fourth attempt: additionally set innodb_io_capacity=10000. Fifth attempt: changed innodb_buffer_pool_size=100m. All attempts to cover this piece of code were unsuccessful.
I think that we will need a longer-running, DML-heavy test to cover this piece of code. The stress.ddl_innodb only runs for 160 seconds (less than 3 minutes).
The test atomic.rename_table, which is where I observed the debug assertion failure that pointed to this mutex, is not usable in this build, because depends on debug injection.