[MDEV-29511] "Page read from tablespace is corrupted" when starting mariadb service after crash (signal 6) Created: 2022-09-12  Updated: 2022-11-04  Resolved: 2022-11-04

Status: Closed
Project: MariaDB Server
Component/s: Server, Storage Engine - InnoDB
Affects Version/s: 10.7.5
Fix Version/s: 10.7.6

Type: Bug Priority: Major
Reporter: Juan Ferrer Toribio Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: innodb

Issue Links:
Relates
relates to MDEV-29342 Assertion failure in file que0que.cc ... Closed
relates to MDEV-29364 CRITICAL - MariaDB 10.8.4 creating co... Closed
relates to MDEV-29374 Frequent "Data structure corruption" ... Closed
relates to MDEV-29383 Assertion mysql_mutex_assert_owner(&l... Closed

 Description   

MariaDB throws a lot of "Page read from tablespace is corrupted" errors" when starting mariadb service after crash with signal 6.

Crash:

2022-09-12  1:50:08 0 [ERROR] InnoDB: (Duplicate key) writing word node to FTS auxiliary index table `mediawiki`.`searchindex`
2022-09-12 01:50:08 0x7fb2f3fff700  InnoDB: Assertion failure in file ./storage/innobase/que/que0que.cc line 728
InnoDB: Failing assertion: trx->error_state == DB_SUCCESS
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
220912  1:50:08 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.7.5-MariaDB-1:10.7.5+maria~deb11-log
key_buffer_size=262144
read_buffer_size=1048576
max_used_connections=481
max_threads=1002
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5156811 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x80000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x556d64f0082e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x556d649eb8b5]

Trying to start service again:

2022-09-12  2:22:15 0 [Note] mariadbd: Aria engine: starting recovery
recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 4 3 2 1 0
 (0.0 seconds); 
2022-09-12  2:22:15 0 [Note] mariadbd: Aria engine: recovery done
2022-09-12  2:22:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2022-09-12  2:22:15 0 [Note] InnoDB: Number of transaction pools: 1
2022-09-12  2:22:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2022-09-12  2:22:15 0 [Note] InnoDB: Using liburing
2022-09-12  2:22:15 0 [Note] InnoDB: Initializing buffer pool, total size = 34359738368, chunk size = 134217728
2022-09-12  2:22:15 0 [Note] InnoDB: Completed initialization of buffer pool
2022-09-12  2:22:15 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=12508540539949,12508639505793
2022-09-12  2:22:30 0 [Note] InnoDB: Read redo log up to LSN=12509985608704
2022-09-12  2:22:45 0 [Note] InnoDB: Read redo log up to LSN=12511455187968
2022-09-12  2:22:50 0 [Note] InnoDB: Starting final batch to recover  56856 pages from redo log.
2022-09-12  2:22:51 0 [ERROR] InnoDB: Not applying DELETE_ROW_FORMAT_DYNAMIC due to corruption on [page id: space=29347044, page number=575]
2022-09-12  2:22:51 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
... <- A lot of same error for different DB tables
2022-09-12  2:22:51 0 [ERROR] InnoDB: Failed to read page 575 from file './hedera/userSession.ibd': Page read from tablespace is corrupted.
2022-09-12  2:22:51 0 [ERROR] InnoDB: Failed to read page 7634 from file './vn/routeLog.ibd': Page read from tablespace is corrupted.
... <- A lot of same error for different DB tables
2022-09-12  2:23:43 0 [ERROR] Plugin 'InnoDB' init function returned error.
2022-09-12  2:23:43 0 [ERROR] Plugin 'InnoDB' registration as a STORAGE ENGINE failed.
2022-09-12  2:23:43 0 [Note] Plugin 'FEEDBACK' is disabled.
2022-09-12  2:23:43 0 [ERROR] Unknown/unsupported storage engine: InnoDB
2022-09-12  2:23:43 0 [ERROR] Aborting



 Comments   
Comment by Daniel Black [ 2022-09-12 ]

Full text crash looks like MDEV-29342

The recovery fail looks like MDEV-29364, and hence likely to be MDEV-29374 (as oom/ asserts are a very similar condition with regards to consistency).

Comment by Juan Ferrer Toribio [ 2022-09-12 ]

Thank you Daniel, fortunately the impact has been minimal thanks to the fact that we have a slave server, probably we lost some transactions...

I haven't tried recovery with innodb_force_recovery=1 but I can if needed.

Comment by Daniel Black [ 2022-09-12 ]

There's a blog about upcoming releases - https://mariadb.org/regressions-in-recent-mariadb-server-releases/

I'm glad you had minimal impacts.

Interim 10.7 packages that includes fix are available here containing tested changes only.

innodb_force_recovery=1 seems right per the added error message in MDEV-29374. Will still need some datasync to ensure its all up to date however after some form of upgrade.

Comment by Marko Mäkelä [ 2022-09-12 ]

juanferrer, unfortunately due to MDEV-29374 and MDEV-29383, InnoDB in MariaDB 10.7.5 is not crash-safe. Due to incorrect log checkpoints allowed by MDEV-29383, data files can be corrupted even after recovery or preparing a backup succeeded.

Comment by Juan Ferrer Toribio [ 2022-09-12 ]

Hi Marko, as I have understood, there are several errors that together could cause partial/total loss of data even if you have backups.

Would it be advisable to downgrade 10.7.4? Could this or one of the related bugs silently corrupt data while the database is running?

Thank you for your prompt response and feedback.

Comment by Marko Mäkelä [ 2022-09-12 ]

juanferrer, yes, I would recommend a downgrade to 10.7.4 until 10.7.6 is available.

If the 10.6.9, 10.7.5, 10.8.4, … server never crashes before orderly shutdown (not including a shutdown with the crash-like innodb_fast_shutdown=2), then I do not think that any corruption should be possible.

Comment by Juan Ferrer Toribio [ 2022-09-12 ]

Hi marko in our case it crashed during an explicit systemctl stop mariadb, we were manually shutting down the server for maintenance.

2022-09-12  1:50:08 0 [Note] /usr/sbin/mariadbd (iniciado por: unknown): Apagado normal
2022-09-12  1:50:08 0 [Note] Event Scheduler: Killing the scheduler thread, thread id 2
2022-09-12  1:50:08 0 [Note] Event Scheduler: Waiting for the scheduler thread to reply
2022-09-12  1:50:08 0 [Note] Event Scheduler: Stopped
2022-09-12  1:50:08 0 [Note] Event Scheduler: Purging the queue. 21 events
2022-09-12  1:50:08 0 [ERROR] InnoDB: (Duplicate key) writing word node to FTS auxiliary index table `mediawiki`.`searchindex`
2022-09-12 01:50:08 0x7fb2f3fff700  InnoDB: Assertion failure in file ./storage/innobase/que/que0que.cc line 728
... <- Rest of logs included above

Also, it crashed (signal 6) 4 days ago but with different error which also led into innodb data corruption.

2022-09-08 22:04:26 305989 [Warning] InnoDB: A long wait (179 seconds) was observed for dict_sys.latch
2022-09-08 22:09:54 316058 [Warning] InnoDB: A long wait (506 seconds) was observed for dict_sys.latch
2022-09-08 22:10:57 317591 [Warning] InnoDB: A long wait (569 seconds) was observed for dict_sys.latch
2022-09-08 22:11:36 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce
-a-full-stack-trace-for-mysqld/
220908 22:11:36 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.7.5-MariaDB-1:10.7.5+maria~deb11-log
key_buffer_size=262144
read_buffer_size=1048576
max_used_connections=959
max_threads=1002
thread_count=961
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 5156811 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x80000
/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x562e01c3b82e]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x562e017268b5]
2022-09-08 22:12:01 317674 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-pr
oduce-a-full-stack-trace-for-mysqld/
2022-09-08 22:50:04 0 [Note] mariadbd: Aria engine: starting recovery
recovered pages: 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (0.0 seconds); tables to flush: 4 3 2 1 0
 (0.0 seconds); 
2022-09-08 22:50:04 0 [Note] mariadbd: Aria engine: recovery done
2022-09-08 22:50:04 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2022-09-08 22:50:04 0 [Note] InnoDB: Number of transaction pools: 1
2022-09-08 22:50:04 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2022-09-08 22:50:04 0 [Note] InnoDB: Using liburing
2022-09-08 22:50:04 0 [Note] InnoDB: Initializing buffer pool, total size = 34359738368, chunk size = 134217728
2022-09-08 22:50:04 0 [Note] InnoDB: Completed initialization of buffer pool
2022-09-08 22:50:04 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=12510533871541,12511306797799
2022-09-08 22:50:19 0 [Note] InnoDB: Read redo log up to LSN=12511489910272
2022-09-08 22:50:32 0 [Note] InnoDB: Ignoring data file './vn/bAccount.ibd' with space ID 83730578. Another data file called ./vn/bAccount__.ibd exis
ts with the same space ID.
2022-09-08 22:50:32 0 [Note] InnoDB: Ignoring data file './vn/bAccount.ibd' with space ID 83730578. Another data file called ./vn/bAccount__.ibd exis
ts with the same space ID.
2022-09-08 22:50:34 0 [Note] InnoDB: Read redo log up to LSN=12512397190656
2022-09-08 22:50:49 0 [Note] InnoDB: Read redo log up to LSN=12513273472512
2022-09-08 22:51:04 0 [Note] InnoDB: Read redo log up to LSN=12514065475072
2022-09-08 22:51:19 0 [Note] InnoDB: To recover: 209745 pages from log
2022-09-08 22:51:19 0 [Note] InnoDB: 2 transaction(s) which must be rolled back or cleaned up in total 58 row operations to undo
2022-09-08 22:51:19 0 [Note] InnoDB: Trx id counter is 23614408848
2022-09-08 22:51:19 0 [Note] InnoDB: Starting final batch to recover  209712 pages from redo log.
2022-09-08 22:51:26 0 [ERROR] InnoDB: Not applying DELETE_ROW_FORMAT_DYNAMIC due to corruption on [page id: space=63072650, page number=1599]
... <- A lot of similar corruption errors like above

So, for security, we are going to downgrade to 10.7.4, Are data files compatible between 10.7.4 and 10.7.5? Should I take any other action aside from downgrading the version? Plus make a backup...

Comment by Sergei Golubchik [ 2022-09-13 ]

juanferrer, could you please add a comment here, when you'll eventually upgrade to 10.7.6, telling us whether it'll work for you? If it will, this issue could be closed as a duplicate, otherwise we'll have to to keep looking.

Comment by Marko Mäkelä [ 2022-09-14 ]

juanferrer, downgrades within the same major version series should generally be possible.

You should treat the data directory after any kill or crash of a 10.7.5 server as potentially corrupted. Even if crash recovery appears to succeed, some data may be corrupted. This is because due to MDEV-29383, the log checkpoint could be advanced too far. The constraint is that the checkpoint LSN must never be newer than the oldest modified data page.

I am quoting my comment from MDEV-29422:

I happened to encounter this failure where preparing a backup succeeds but a table is reported as corrupted.

10.10 c0a6ce61d88104db14f7107cb16edf475b46bffd

mariabackup.huge_lsn 'strict_full_crc32' w3 [ fail ]
        Test ended at 2022-09-06 10:55:11
[01] 2022-09-06 10:55:09 Copying ./mysql/transaction_registry.ibd to /dev/shm/var/3/mysqld.1/data/mysql/transaction_registry.ibd
[01] 2022-09-06 10:55:09         ...done
[00] 2022-09-06 10:55:09 completed OK!
mysqltest: At line 92: query 'SELECT * FROM t' failed: ER_TABLE_CORRUPT (1877): Table test/t is corrupted. Please drop the table and recreate.

This was with a code revision that did not include the fixes of MDEV-29383 or MDEV-29374.

Comment by Juan Ferrer Toribio [ 2022-09-14 ]

Thank you marko, seems clear now.

serg, for now we are going to downgrade to 10.7.4, as soon as version 10.7.6 is published and after working with it for a while, I will communicate it.

Comment by Marko Mäkelä [ 2022-09-22 ]

juanferrer, 10.7.6 should be available now. Does that work for you?

Comment by Juan Ferrer Toribio [ 2022-09-22 ]

Hi marko, is it (relatively) safe to upgrade compared to 10.7.4 & 10.7.5, or it's better to wait some time?

These errors occurred in production and I would not like to expose again our data to corruption.

According to the road map published on https://jira.mariadb.org, the release date or 10.7.6 was planned for 2022-10-27, has the release date been brought forward?

Comment by Marko Mäkelä [ 2022-09-22 ]

juanferrer, a series of unscheduled releases including 10.7.6 took place this week because of the crash recovery regressions. The next scheduled releases will include 10.7.7.

I believe that it should be safe to upgrade to 10.7.6. Since the time of that release, we have only found the recovery failure MDEV-29559, which unlike this bug is a "parsing" problem of a rare corner case, not a "write side" error.

Comment by Marko Mäkelä [ 2022-09-22 ]

There are some good news from a 10.8.5 user in MDEV-29364.

Comment by Juan Ferrer Toribio [ 2022-10-07 ]

Hi marko, today I've upgraded our production MariaDB server from 10.7.4 to 10.7.6, after some time of testing I will give you feedback.

Comment by Elena Stepanova [ 2022-11-04 ]

juanferrer,
How did it go with 10.7.6?

Comment by Juan Ferrer Toribio [ 2022-11-04 ]

Hi elenst,

There are still crashes in MariaDB 10.7.6, we had at least 3 (signal 11) since last upgrade, but the innodb recovery system worked fine in all of them, so I guess this particular issue is resolved.

Generated at Thu Feb 08 10:09:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.