[MDEV-35226] InnoDB occasionally fails to recover a corrupted page from the doublewrite buffer - Jira

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6
Fix Version/s: 10.6
Component/s: Storage Engine - InnoDB
Labels:
- need_rr
- race
- recovery

Description

While testing the fix of ~~MDEV-35225~~, I came up with the following rare failure of the test innodb.innodb-32k-crash:

10.6 1067046b7f4337080e5b59c364c0a53cbeca715e
2024-10-21 23:10:11 0 [Note] InnoDB: Completed initialization of buffer pool
2024-10-21 23:10:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1059287101,1062232449
2024-10-21 23:10:11 0 [ERROR] InnoDB: Cannot apply log to [page id: space=704, page number=0] of corrupted file './test/t1.ibd'
2024-10-21 23:10:11 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1513] with error Data structure corruption

When I tried recovery on a copy of the data directory 32k-recovery.tar.xz, it would succeed each time. The difference to the above would be that there would be a message about doublewrite buffer recovery:

10.6 1067046b7f4337080e5b59c364c0a53cbeca715e
2024-10-21 23:42:51 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1059287101,1059287101
2024-10-21 23:42:51 0 [Note] InnoDB: Trying to recover page [page id: space=704, page number=0] from the doublewrite buffer.
2024-10-21 23:42:51 0 [Note] InnoDB: Recovered page [page id: space=704, page number=0] to './test/t1.ibd' from the doublewrite buffer.
2024-10-21 23:42:51 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 20 row operations to undo
2024-10-21 23:42:51 0 [Note] InnoDB: Trx id counter is 52417
2024-10-21 23:42:51 0 [Note] InnoDB: To recover: 698 pages
2024-10-21 23:42:51 0 [Note] InnoDB: Resizing redo log from 33554432 to 100663296 bytes; LSN=1062232813

I will try to reproduce this failure under rr record so that this can be analyzed and fixed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

32k-recovery.tar.xz
1.37 MB
2024-10-22 06:48

Issue Links

relates to

MDEV-34830 LSN in the future is not being treated as serious corruption

Closed

MDEV-35225 Bogus debug assertion failures in innodb.innodb-32k-crash

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä created issue - 2024-10-22 06:50

Marko Mäkelä made changes - 2024-10-22 06:50

Field	Original Value	New Value
Link		This issue relates to ~~MDEV-34830~~ [ ~~MDEV-34830~~ ]

Marko Mäkelä made changes - 2024-10-22 06:50

Link

This issue relates to ~~MDEV-35225~~ [ ~~MDEV-35225~~ ]

Marko Mäkelä made changes - 2024-10-22 06:51

Status

Open [ 1 ]

In Progress [ 3 ]

Marko Mäkelä made changes - 2024-10-22 09:00

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Marko Mäkelä added a comment - 2024-10-22 09:18

I reproduced this once or twice more by running the test innodb.innodb-32k-crash. The failure rate is too low. I have been trying to reproduce it with the following patches applied:

diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc

index 727a9f4d4fc..0219bfb8c92 100644

--- a/storage/innobase/srv/srv0start.cc

+++ b/storage/innobase/srv/srv0start.cc

@@ -914,6 +914,7 @@ srv_init_abort_low(

 			" with error " << err;

+	abort();

 	srv_shutdown_bg_undo_sources();

 	srv_shutdown_threads();

 	return(err);

diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc

index 31036bf978d..67bc3c3970f 100644

--- a/storage/innobase/buf/buf0dblwr.cc

+++ b/storage/innobase/buf/buf0dblwr.cc

@@ -464,7 +464,7 @@ void buf_dblwr_t::recover()

     goto next_page;

-  recv_sys.dblwr.pages.clear();

+  // recv_sys.dblwr.pages.clear();

   fil_flush_file_spaces();

   aligned_free(read_buf);

The latter hunk aims to preserve recv_sys.dblwr.pages intact for post-mortem analysis in a core dump. With rr, neither patch would be strictly necessary.

I just got a failure with both patches applied:

10.6 b38edd09ff1c566f7b4681abfa6794550dfbbbc5 (patched)
innodb.innodb-32k-crash w67 [ 389 fail ]
…
2024-10-22 2:01:46 0 [ERROR] InnoDB: Cannot apply log to [page id: space=1169, page number=0] of corrupted file './test/t1.ibd'
2024-10-22 2:01:46 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1514] with error Data structure corruption

At the time of the crash, recv_sys.dblwr.pages is empty, even though there was a page (with page number 0) for that tablespace in the doublewrite buffer:

od -Ax -t x1  var/log/innodb.innodb-32k-crash/mysqld.1/data/ibdata1 | grep -B2 '[08]020 .. .. 00 00 04 91 00 00 04 91'

200000 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff

200010 00 00 00 00 69 4d 50 97 00 08 00 00 00 00 00 00

200020 00 00 00 00 04 91 00 00 04 91 00 00 00 00 00 00

The FIL_PAGE_LSN is 0x694d5079 or 1,766,674,583 which is relatively close to the start of the recovery LSN:

2024-10-22  2:01:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1763439252,1766696461

The log_sys.lsn at the time of the failure was 1,766,696,675 (which is ahead of the FIL_PAGE_LSN). The recv_sys.scanned_lsn as well as recv_sys.recovered_lsn were slightly less, 1,766,696,577, still ahead of that FIL_PAGE_LSN.

I also tried to reproduce the recovery failure on 32k-recovery.tar.xz under rr record for about 2 hours, but I had no luck. The FIL_PAGE_LSN at the start of the first page of test/t1.ibd does mismatch with the 32 least significant bits at the end of the page, and innochecksum is reporting the failure as well.

I would expect RQG to be a better tool for reproducing this. A large innodb_page_size should improve the chances of killing the server during a page write.

Marko Mäkelä added a comment - 2024-10-22 09:18 I reproduced this once or twice more by running the test innodb.innodb-32k-crash . The failure rate is too low. I have been trying to reproduce it with the following patches applied: diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc index 727a9f4d4fc..0219bfb8c92 100644 --- a/storage/innobase/srv/srv0start.cc +++ b/storage/innobase/srv/srv0start.cc @@ -914,6 +914,7 @@ srv_init_abort_low( " with error " << err; } + abort(); srv_shutdown_bg_undo_sources(); srv_shutdown_threads(); return(err); diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc index 31036bf978d..67bc3c3970f 100644 --- a/storage/innobase/buf/buf0dblwr.cc +++ b/storage/innobase/buf/buf0dblwr.cc @@ -464,7 +464,7 @@ void buf_dblwr_t::recover() goto next_page; } - recv_sys.dblwr.pages.clear(); + // recv_sys.dblwr.pages.clear(); fil_flush_file_spaces(); aligned_free(read_buf); } The latter hunk aims to preserve recv_sys.dblwr.pages intact for post-mortem analysis in a core dump. With rr , neither patch would be strictly necessary. I just got a failure with both patches applied: 10.6 b38edd09ff1c566f7b4681abfa6794550dfbbbc5 (patched) innodb.innodb-32k-crash w67 [ 389 fail ] … 2024-10-22 2:01:46 0 [ERROR] InnoDB: Cannot apply log to [page id: space=1169, page number=0] of corrupted file './test/t1.ibd' 2024-10-22 2:01:46 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1514] with error Data structure corruption At the time of the crash, recv_sys.dblwr.pages is empty, even though there was a page (with page number 0) for that tablespace in the doublewrite buffer: od -Ax -t x1 var/log/innodb.innodb-32k-crash/mysqld.1/data/ibdata1 | grep -B2 '[08]020 .. .. 00 00 04 91 00 00 04 91' 200000 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff 200010 00 00 00 00 69 4d 50 97 00 08 00 00 00 00 00 00 200020 00 00 00 00 04 91 00 00 04 91 00 00 00 00 00 00 The FIL_PAGE_LSN is 0x694d5079 or 1,766,674,583 which is relatively close to the start of the recovery LSN: 2024-10-22 2:01:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1763439252,1766696461 The log_sys.lsn at the time of the failure was 1,766,696,675 (which is ahead of the FIL_PAGE_LSN ). The recv_sys.scanned_lsn as well as recv_sys.recovered_lsn were slightly less, 1,766,696,577, still ahead of that FIL_PAGE_LSN . I also tried to reproduce the recovery failure on 32k-recovery.tar.xz under rr record for about 2 hours, but I had no luck. The FIL_PAGE_LSN at the start of the first page of test/t1.ibd does mismatch with the 32 least significant bits at the end of the page, and innochecksum is reporting the failure as well. I would expect RQG to be a better tool for reproducing this. A large innodb_page_size should improve the chances of killing the server during a page write.

Marko Mäkelä made changes - 2024-10-22 09:18

Assignee	Marko Mäkelä [ marko ]	Matthias Leich [ mleich ]
Labels	race recovery	need_rr race recovery

People

Assignee:: Matthias Leich

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-10-22 06:50

Updated:: 2024-10-22 09:27

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration