Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.5, 10.6, 10.11, 11.0(EOL)
-
None
Description
The test innodb.doublewrite_debug that was recently added in MDEV-32242 is occasionally failing on 10.6 and 10.11 like this:
10.6 98287bd2d6877969514469ebfbf27f9951b99657 |
innodb.doublewrite_debug '32k,innodb,strict_crc32' w28 [ fail ]
|
Test ended at 2023-12-19 16:08:33
|
|
CURRENT_TEST: innodb.doublewrite_debug
|
mysqltest: At line 131: query 'select f1, f2 from t1' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
|
|
The result from queries just before the failure was:
|
< snip >
|
# system tablespace is corrupted and 2nd page as corrupted.
|
set global innodb_log_checkpoint_now = 1;
|
begin;
|
insert into t1 values (6, repeat('%', 400));
|
# Make the first page dirty for system tablespace
|
set global innodb_saved_page_number_debug = 0;
|
set global innodb_fil_make_page_dirty_debug = 0;
|
# Make the second page dirty for system tablespace
|
set global innodb_saved_page_number_debug = 1;
|
set global innodb_fil_make_page_dirty_debug = 0;
|
set global innodb_buf_flush_list_now = 1;
|
# Corrupt the 1st page (page_no=0) and 2nd page of the system tablespace.
|
# restart
|
FOUND 2 /InnoDB: Restoring page \[page id: space=0, page number=0\] of datafile/ in mysqld.1.err
|
FOUND 1 /InnoDB: Recovered page \[page id: space=0, page number=1\]/ in mysqld.1.err
|
check table t1;
|
Table Op Msg_type Msg_text
|
test.t1 check Error Unknown storage engine 'InnoDB'
|
test.t1 check error Corrupt
|
select f1, f2 from t1;
|
Attachments
Issue Links
- relates to
-
MDEV-32242 innodb.doublewrite test case always gets skipped
-
- Closed
-
I analyzed one failure on 10.11 in more detail. It seems to me that there unexpectedly was a log checkpoint right at the time when the server was killed. This seems to have revealed a code bug, which I think is already being addressed in
MDEV-32968:2023-12-20 18:21:02 0 [Note] InnoDB: Completed initialization of buffer pool
2023-12-20 18:21:02 0 [Note] InnoDB: Setting O_DIRECT on file ./ibdata1 failed
2023-12-20 18:21:02 0 [ERROR] InnoDB: Data file './ibdata1' uses page size 16384, but the innodb_page_size start-up parameter is 4096
2023-12-20 18:21:02 0 [Note] InnoDB: Restoring page [page id: space=0, page number=0] of datafile './ibdata1' from the doublewrite buffer. Writing 4096 bytes into file './ibdata1'
2023-12-20 18:21:02 0 [Note] InnoDB: End of log at LSN=81124
2023-12-20 18:21:02 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 1 row operations to undo
2023-12-20 18:21:02 0 [Note] InnoDB: Trx id counter is 31
2023-12-20 18:21:02 0 [Note] InnoDB: 128 rollback segments are active.
2023-12-20 18:21:02 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-12-20 18:21:02 0 [Note] InnoDB: Rolled back recovered transaction 29
2023-12-20 18:21:02 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-12-20 18:21:02 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-12-20 18:21:02 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2023-12-20 18:21:02 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2023-12-20 18:21:02 0 [Note] InnoDB: log sequence number 81124 (memory-mapped); transaction id 33
The code bug here is that we are attempting to restore a page from the doublewrite buffer, even though there had been no server crash (the ib_logfile0 is logically empty). I would like to see how this would behave with the
MDEV-32968fix applied. In 10.11 thanks toMDEV-14425we can make use of memory-mapped log file in /dev/shm. That could make the test fail much more easily than on 10.6 where we write the log via regular system calls.In the ib_logfile0 right before offset 81124 (0x13ce4) we have the following:
013cd0 32 ac 58 e8 fa 00 00 00 00 00 00 00 01 3c d4 01
013ce0 7d 7a 5c 2e b0 01 05 00 2e 2f 74 65 73 74 2f 74
At 0x13cd4 we have the 16-byte FILE_CHECKPOINT record comprising the LSN 0x13cd4, followed by the end marker 0x01 and a CRC-32C checksum. From the recovery point of view, the server that had been killed had apparently been shut down cleanly, and no doublewrite recovery should have been invoked. We should have failed to start up InnoDB due to a corruption.
The test bug here is that we do not detect that an unexpected log checkpoint occurred, and carry on to corrupt the page. We should either fail or skip the test in that case. thiru has some ideas on how to avoid the log checkpoint harder. That may work, but I would also double check in the test that no checkpoint occurred. Something like this should make the test fail in a more obvious way (instead of being silently skipped, which was supposed to be fixed in
MDEV-32242):--let CLEANUP_IF_CHECKPOINT=ERROR_IN_TEST;