Details
-
Bug
-
Status: Stalled (View Workflow)
-
Critical
-
Resolution: Unresolved
-
10.6
Description
While testing the fix of MDEV-35225, I came up with the following rare failure of the test innodb.innodb-32k-crash:
10.6 1067046b7f4337080e5b59c364c0a53cbeca715e |
2024-10-21 23:10:11 0 [Note] InnoDB: Completed initialization of buffer pool
|
2024-10-21 23:10:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1059287101,1062232449
|
2024-10-21 23:10:11 0 [ERROR] InnoDB: Cannot apply log to [page id: space=704, page number=0] of corrupted file './test/t1.ibd'
|
2024-10-21 23:10:11 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1513] with error Data structure corruption
|
When I tried recovery on a copy of the data directory 32k-recovery.tar.xz, it would succeed each time. The difference to the above would be that there would be a message about doublewrite buffer recovery:
10.6 1067046b7f4337080e5b59c364c0a53cbeca715e |
2024-10-21 23:42:51 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1059287101,1059287101
|
2024-10-21 23:42:51 0 [Note] InnoDB: Trying to recover page [page id: space=704, page number=0] from the doublewrite buffer.
|
2024-10-21 23:42:51 0 [Note] InnoDB: Recovered page [page id: space=704, page number=0] to './test/t1.ibd' from the doublewrite buffer.
|
2024-10-21 23:42:51 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 20 row operations to undo
|
2024-10-21 23:42:51 0 [Note] InnoDB: Trx id counter is 52417
|
2024-10-21 23:42:51 0 [Note] InnoDB: To recover: 698 pages
|
2024-10-21 23:42:51 0 [Note] InnoDB: Resizing redo log from 33554432 to 100663296 bytes; LSN=1062232813
|
I will try to reproduce this failure under rr record so that this can be analyzed and fixed.
Attachments
Issue Links
- relates to
-
MDEV-34830 LSN in the future is not being treated as serious corruption
-
- Closed
-
-
MDEV-35225 Bogus debug assertion failures in innodb.innodb-32k-crash
-
- Closed
-
I reproduced this once or twice more by running the test innodb.innodb-32k-crash. The failure rate is too low. I have been trying to reproduce it with the following patches applied:
diff --git a/storage/innobase/srv/srv0start.cc b/storage/innobase/srv/srv0start.cc
index 727a9f4d4fc..0219bfb8c92 100644
--- a/storage/innobase/srv/srv0start.cc
+++ b/storage/innobase/srv/srv0start.cc
@@ -914,6 +914,7 @@ srv_init_abort_low(
" with error " << err;
}
+ abort();
srv_shutdown_bg_undo_sources();
srv_shutdown_threads();
return(err);
diff --git a/storage/innobase/buf/buf0dblwr.cc b/storage/innobase/buf/buf0dblwr.cc
index 31036bf978d..67bc3c3970f 100644
--- a/storage/innobase/buf/buf0dblwr.cc
+++ b/storage/innobase/buf/buf0dblwr.cc
@@ -464,7 +464,7 @@ void buf_dblwr_t::recover()
goto next_page;
}
- recv_sys.dblwr.pages.clear();
+ // recv_sys.dblwr.pages.clear();
fil_flush_file_spaces();
aligned_free(read_buf);
The latter hunk aims to preserve recv_sys.dblwr.pages intact for post-mortem analysis in a core dump. With rr, neither patch would be strictly necessary.
I just got a failure with both patches applied:
10.6 b38edd09ff1c566f7b4681abfa6794550dfbbbc5 (patched)
innodb.innodb-32k-crash w67 [ 389 fail ]
…
2024-10-22 2:01:46 0 [ERROR] InnoDB: Cannot apply log to [page id: space=1169, page number=0] of corrupted file './test/t1.ibd'
2024-10-22 2:01:46 0 [ERROR] InnoDB: Plugin initialization aborted at srv0start.cc[1514] with error Data structure corruption
At the time of the crash, recv_sys.dblwr.pages is empty, even though there was a page (with page number 0) for that tablespace in the doublewrite buffer:
od -Ax -t x1 var/log/innodb.innodb-32k-crash/mysqld.1/data/ibdata1 | grep -B2 '[08]020 .. .. 00 00 04 91 00 00 04 91'
200000 00 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff
200010 00 00 00 00 69 4d 50 97 00 08 00 00 00 00 00 00
200020 00 00 00 00 04 91 00 00 04 91 00 00 00 00 00 00
The FIL_PAGE_LSN is 0x694d5079 or 1,766,674,583 which is relatively close to the start of the recovery LSN:
2024-10-22 2:01:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1763439252,1766696461
The log_sys.lsn at the time of the failure was 1,766,696,675 (which is ahead of the FIL_PAGE_LSN). The recv_sys.scanned_lsn as well as recv_sys.recovered_lsn were slightly less, 1,766,696,577, still ahead of that FIL_PAGE_LSN.
I also tried to reproduce the recovery failure on 32k-recovery.tar.xz
under rr record for about 2 hours, but I had no luck. The FIL_PAGE_LSN at the start of the first page of test/t1.ibd does mismatch with the 32 least significant bits at the end of the page, and innochecksum is reporting the failure as well.
I would expect RQG to be a better tool for reproducing this. A large innodb_page_size should improve the chances of killing the server during a page write.