[MDEV-34453] Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6.17
Fix Version/s: 10.6.20, 10.11.10, 11.2.6, 11.4.4
Component/s: Storage Engine - InnoDB
Labels:
None

Description

Customer gets the following error when starting 10.6.17 immediately following sofware update from previous version, 10.6.16.

2024-06-19 21:05:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

2024-06-19 21:05:44 10 [ERROR] InnoDB: File './ibdata1' is corrupted

2024-06-19 21:05:44 10 [ERROR] InnoDB: We detected index corruption in an InnoDB type table. You have to dump + drop + reimport the table or, in a case of widespread corruption, dump all InnoDB tables and recreate the whole tablespace. If the mariadbd server crashes after the startup or when you dump the tables. Please refer to https://mariadb.com/kb/en/library/innodb-recovery-modes/ for information about forcing recovery.

2024-06-19 21:05:44 10 [ERROR] mariadbd: Index for table 'gtid_slave_pos' is corrupt; try to repair it

2024-06-19 21:05:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

2024-06-19 21:05:44 10 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

2024-06-19 21:05:44 0x7f266c669700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/trx/trx0purge.cc line 268

For customer, this has occurred more than one time.

Customer discovered a workaround:
Start database one time with

innodb_force_recovery=1

Shutdown, remove innodb_force_recovery configuration.
Start database normally.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Debian-bookworm-latest-amd64-base.err
11 kB
2024-08-29 11:04

Issue Links

causes

MDEV-35181 Assertion 'state < buf_page_t::READ_FIX' in buf_page_create_low()

Closed

MDEV-35626 Race condition between buf_page_create_low() and read completion

Closed

is caused by

MDEV-27058 Buffer page descriptors are too large

Closed

is duplicated by

MDEV-32174 ROW_FORMAT=COMPRESSED table corruption due to ROLLBACK

Closed

MDEV-34059 Random innodb index corruption

Closed

MDEV-34873 corrupted index

Closed

MDEV-34978 MariaDB 11.4.2 crash read outside the bounds of the file: .//undo001

Closed

MDEV-35056 InnoDB: Trying to read 16384 bytes at xxxx outside the bounds of the file: ./ibdata1

Closed

MDEV-35385 Server crash after reading outside of bounds on ibdata1

Closed

relates to

MDEV-32817 在最近将版本升级到10.11.5后，针对表进行频繁的读写操作不久后，出现index for table xxxx is corrupt，随后此表tablespace xxxxxx corrupted，最后Tablespace is missing for a table，此表已完全不可用

Closed

MDEV-33922 InnoDB undo log tablespace file corruption

Closed

MDEV-34233 InnoDB crashes due to corrupted ibdata1 (Assertion failure in innodb.undo_page)

Closed

MDEV-29832 rpl.rpl_semi_sync_after_sync_row frequently fails

Open

MDEV-33189 Server crash after reading outside of bounds on ibdata1 , file corrupted, no auto-recovery

Closed

(4 is duplicated by, 5 relates to)

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-06-26 04:10

70368744161280 is the byte offset of the largest 32-bit page number (4294967295, also known as FIL_NULL in the source code) when using the default innodb_page_size=16k.

In many persistent InnoDB data structures, such as linked lists of pages, FIL_NULL is being used as a "null pointer". A list somewhere is corrupted, and we do not know where.

It would be great if a copy of the corrupted data had been saved before engaging the workaround, and MariaDB 10.6.18 could be started on such a copy. I would like to see if the changes made in ~~MDEV-33325~~ would have any impact on this. Ultimately, I would like to know where exactly the incorrect page number was being read without validation, so that we could get closer to the root cause (the write that corrupted the database).

Based on the fact that innodb_force_recovery=1 seemed to heal the database, I assume that the corruption is somewhere in the undo logs, and as a result of this corruption, some purging of the committed transaction history was skipped. I would expect some CHECK TABLE…EXTENDED (~~MDEV-24402~~) to report some orphan secondary index records. But, we must keep in mind that such orphans can also be caused by the known bug MDEV-29823 even if this undo log corruption bug wasn’t involved.

Marko Mäkelä added a comment - 2024-06-26 04:10 70368744161280 is the byte offset of the largest 32-bit page number (4294967295, also known as FIL_NULL in the source code) when using the default innodb_page_size=16k . In many persistent InnoDB data structures, such as linked lists of pages, FIL_NULL is being used as a "null pointer". A list somewhere is corrupted, and we do not know where. It would be great if a copy of the corrupted data had been saved before engaging the workaround, and MariaDB 10.6.18 could be started on such a copy. I would like to see if the changes made in MDEV-33325 would have any impact on this. Ultimately, I would like to know where exactly the incorrect page number was being read without validation, so that we could get closer to the root cause (the write that corrupted the database). Based on the fact that innodb_force_recovery=1 seemed to heal the database, I assume that the corruption is somewhere in the undo logs, and as a result of this corruption, some purging of the committed transaction history was skipped. I would expect some CHECK TABLE…EXTENDED ( MDEV-24402 ) to report some orphan secondary index records. But, we must keep in mind that such orphans can also be caused by the known bug MDEV-29823 even if this undo log corruption bug wasn’t involved.

Marko Mäkelä added a comment - 2024-06-28 12:42

I did not have time to check this. However, it would be good if debarun can check this. I think that a matching copy of the ib_logfile0 will be necessary as well. We may have to tweak some recovery code, and to hack the checkpoint fields to start recovery from an earlier checkpoint LSN, to understand the issue. Having just the ibdata1 is not that useful. If we also got the log file, we could say more.

~~MDEV-32817~~ and the related bugs look promising, because apparently some Chinese people are able to reproduce corruption using Sysbench. I think that we must engage with them.

Marko Mäkelä added a comment - 2024-06-28 12:42 I did not have time to check this. However, it would be good if debarun can check this. I think that a matching copy of the ib_logfile0 will be necessary as well. We may have to tweak some recovery code, and to hack the checkpoint fields to start recovery from an earlier checkpoint LSN, to understand the issue. Having just the ibdata1 is not that useful. If we also got the log file, we could say more. MDEV-32817 and the related bugs look promising, because apparently some Chinese people are able to reproduce corruption using Sysbench. I think that we must engage with them.

Debarun Banerjee added a comment - 2024-08-14 15:09

Please provide the corrupt redo logs for further analysis.

Debarun Banerjee added a comment - 2024-08-14 15:09 Please provide the corrupt redo logs for further analysis.

Kristian Nielsen added a comment - 2024-08-29 11:04

This comment for some reason has the wrong visibility:

Marko Mäkelä added a comment - 2024-06-28 14:57 - Restricted to Developers

It is a long shot, but in ~~MDEV-32817~~ I just posted that ~~MDEV-33588~~ might have fixed this corruption. This guess has not been checked yet.

Kristian Nielsen added a comment - 2024-08-29 11:04 This comment for some reason has the wrong visibility: Marko Mäkelä added a comment - 2024-06-28 14:57 - Restricted to Developers It is a long shot, but in MDEV-32817 I just posted that MDEV-33588 might have fixed this corruption. This guess has not been checked yet.

Kristian Nielsen added a comment - 2024-08-29 11:22

As a data point, I just got something similar to this in heavy benchmark runs:

2024-08-28 14:42:54 0 [Note] /mdev34705/install/ma1104_base/bin/mariadbd: ready for connections.

Version: '11.4.2-MariaDB-log'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution

2024-08-28 15:48:10 696 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo011

2024-08-28 15:48:10 696 [ERROR] InnoDB: File './/undo011' is corrupted

2024-08-28 15:48:10 0x7feb974696c0  InnoDB: Assertion failure in file /mdev34705/src/mariadb-11.4-base/storage/innobase/trx/trx0purge.cc line 268

2024-08-28 15:48:11 0 [Note] Starting MariaDB 11.4.2-MariaDB-log source revision ae6684d79f4183cdc69b3bba9b146e7455addc0b as process 494163

2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: starting recovery

recovered pages: 0% 11% 22% 33% 51% 66% 100% (0.0 seconds); tables to flush: 2 1 0 (0.0 seconds);

2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: recovery done

2024-08-28 15:48:11 0 [Note] InnoDB: Compressed tables use zlib 1.2.13

2024-08-28 15:48:11 0 [Note] InnoDB: Number of transaction pools: 1

2024-08-28 15:48:11 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions

2024-08-28 15:48:11 0 [Note] InnoDB: Using Linux native AIO

2024-08-28 15:48:11 0 [Note] InnoDB: Initializing buffer pool, total size = 8.000GiB, chunk size = 128.000MiB

2024-08-28 15:48:11 0 [Note] InnoDB: Completed initialization of buffer pool

2024-08-28 15:48:11 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)

2024-08-28 15:48:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=173397039313

2024-08-28 15:48:26 0 [Note] InnoDB: Parsed redo log up to LSN=175522198374; to recover: 175470 pages

2024-08-28 15:48:34 0 [ERROR] InnoDB: Corrupted page identifier at 176274948073; set innodb_force_recovery=1 to ignore the record.

2024-08-28 15:48:34 0 [Note] InnoDB: End of log at LSN=176274948073

2024-08-28 15:48:34 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error

I got this with old 11.4 version on commit ae6684d79f4183cdc69b3bba9b146e7455addc0b . Note that this version does not have the ~~MDEV-33588~~ fix.

Full error log attached as Debian-bookworm-latest-amd64-base.err.

This bug happened on a new clean install of the database, after 1 hour of heavy load. As seen in the attached error log, after the crash the database is reported as corrupted. Thus, unlike the original report, this is not related to upgrading.

This is from an automated benchmark run and occurred only once so far, so I don't have any way to reproduce, nor any saved data files or redo/undo logs. But I thought it was still worth it to mention as a data point.

Kristian Nielsen added a comment - 2024-08-29 11:22 As a data point, I just got something similar to this in heavy benchmark runs: 2024-08-28 14:42:54 0 [Note] /mdev34705/install/ma1104_base/bin/mariadbd: ready for connections. Version: '11.4.2-MariaDB-log' socket: '/tmp/mysql.sock' port: 3306 Source distribution 2024-08-28 15:48:10 696 [ERROR] InnoDB: Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: .//undo011 2024-08-28 15:48:10 696 [ERROR] InnoDB: File './/undo011' is corrupted 2024-08-28 15:48:10 0x7feb974696c0 InnoDB: Assertion failure in file /mdev34705/src/mariadb-11.4-base/storage/innobase/trx/trx0purge.cc line 268 2024-08-28 15:48:11 0 [Note] Starting MariaDB 11.4.2-MariaDB-log source revision ae6684d79f4183cdc69b3bba9b146e7455addc0b as process 494163 2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: starting recovery recovered pages: 0% 11% 22% 33% 51% 66% 100% (0.0 seconds); tables to flush: 2 1 0 (0.0 seconds); 2024-08-28 15:48:11 0 [Note] mariadbd: Aria engine: recovery done 2024-08-28 15:48:11 0 [Note] InnoDB: Compressed tables use zlib 1.2.13 2024-08-28 15:48:11 0 [Note] InnoDB: Number of transaction pools: 1 2024-08-28 15:48:11 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions 2024-08-28 15:48:11 0 [Note] InnoDB: Using Linux native AIO 2024-08-28 15:48:11 0 [Note] InnoDB: Initializing buffer pool, total size = 8.000GiB, chunk size = 128.000MiB 2024-08-28 15:48:11 0 [Note] InnoDB: Completed initialization of buffer pool 2024-08-28 15:48:11 0 [Note] InnoDB: Buffered log writes (block size=512 bytes) 2024-08-28 15:48:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=173397039313 2024-08-28 15:48:26 0 [Note] InnoDB: Parsed redo log up to LSN=175522198374; to recover: 175470 pages 2024-08-28 15:48:34 0 [ERROR] InnoDB: Corrupted page identifier at 176274948073; set innodb_force_recovery=1 to ignore the record. 2024-08-28 15:48:34 0 [Note] InnoDB: End of log at LSN=176274948073 2024-08-28 15:48:34 0 [ERROR] InnoDB: Plugin initialization aborted with error Generic error I got this with old 11.4 version on commit ae6684d79f4183cdc69b3bba9b146e7455addc0b . Note that this version does not have the MDEV-33588 fix. Full error log attached as Debian-bookworm-latest-amd64-base.err. This bug happened on a new clean install of the database, after 1 hour of heavy load. As seen in the attached error log, after the crash the database is reported as corrupted. Thus, unlike the original report, this is not related to upgrading. This is from an automated benchmark run and occurred only once so far, so I don't have any way to reproduce, nor any saved data files or redo/undo logs. But I thought it was still worth it to mention as a data point.

Marko Mäkelä added a comment - 2024-09-19 12:40

Thank you, it is great that you found and fixed my mistake.

The proposed fix looks correct to me. But it looks like we can eliminate a pair of unfix() and fix() operations. I would also like to see some form of intermittent fault injection that would better cover this rare code path at least in one regression test, such as stress.ddl_innodb.

Marko Mäkelä added a comment - 2024-09-19 12:40 Thank you, it is great that you found and fixed my mistake. The proposed fix looks correct to me. But it looks like we can eliminate a pair of unfix() and fix() operations. I would also like to see some form of intermittent fault injection that would better cover this rare code path at least in one regression test, such as stress.ddl_innodb .

Debarun Banerjee added a comment - 2024-12-11 14:30

This issue was not completely fixed. Thanks to marko and mleich for identifying and filing ~~MDEV-35181~~to handle it.

Debarun Banerjee added a comment - 2024-12-11 14:30 This issue was not completely fixed. Thanks to marko and mleich for identifying and filing MDEV-35181 to handle it.

People

Assignee:: Debarun Banerjee

Reporter:: Edward Stoever

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2024-06-25 17:31

Updated:: 2025-03-10 09:02

Resolved:: 2024-09-20 18:11

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration