[MDEV-33098] The test innodb.doublewrite_debug occasionally fails to start up InnoDB Created: 2023-12-20 Updated: 2024-01-08 Resolved: 2024-01-03 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB, Tests |
| Affects Version/s: | 10.5, 10.6, 10.11, 11.0 |
| Fix Version/s: | 10.5.24, 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3, 11.3.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Thirunarayanan Balathandayuthapani |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
The test innodb.doublewrite_debug that was recently added in
|
| Comments |
| Comment by Marko Mäkelä [ 2023-12-21 ] | |||||||||||||||||||
|
I analyzed one failure on 10.11 in more detail. It seems to me that there unexpectedly was a log checkpoint right at the time when the server was killed. This seems to have revealed a code bug, which I think is already being addressed in
The code bug here is that we are attempting to restore a page from the doublewrite buffer, even though there had been no server crash (the ib_logfile0 is logically empty). I would like to see how this would behave with the In the ib_logfile0 right before offset 81124 (0x13ce4) we have the following:
At 0x13cd4 we have the 16-byte FILE_CHECKPOINT record comprising the LSN 0x13cd4, followed by the end marker 0x01 and a CRC-32C checksum. From the recovery point of view, the server that had been killed had apparently been shut down cleanly, and no doublewrite recovery should have been invoked. We should have failed to start up InnoDB due to a corruption. The test bug here is that we do not detect that an unexpected log checkpoint occurred, and carry on to corrupt the page. We should either fail or skip the test in that case. thiru has some ideas on how to avoid the log checkpoint harder. That may work, but I would also double check in the test that no checkpoint occurred. Something like this should make the test fail in a more obvious way (instead of being silently skipped, which was supposed to be fixed in
| |||||||||||||||||||
| Comment by Thirunarayanan Balathandayuthapani [ 2023-12-22 ] | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-12-22 ] | |||||||||||||||||||
|
OK to push after addressing my review comments. On the merge from 10.5 to 10.6, we have to reenable the test. On the merge to 11.0, we should revert previously made changes to doublewrite_debug.combinations. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-03 ] | |||||||||||||||||||
|
Unfortunately, on merge to 10.6 (and re-enabling the test) there was a failure. There was no log checkpoint, but one of the two intentionally corrupted pages was not restored from the doublewrite buffer:
The above is from mysql-test/var/16/log/mysqld.1.err in the attached var.tar.gz | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-03 ] | |||||||||||||||||||
|
In var.tar.gz
All of these are for the system tablespace, as we can see by the following:
We can ignore the copy of page 0 at 0x248000 because its FIL_SPACE_LSN 0xe553 is smaller than the one starting at 0x210000. It is possible that this failure would be fixed by | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-03 ] | |||||||||||||||||||
|
It turns out that the page 1 had actually been written out slightly before the latest log checkpoint, and there was nothing to be recovered from the doublewrite buffer. The main difference between 10.5 and 10.6 with respect to this test is that the global change that had been made to mtr in I do not know the exact mechanism, but if we look at the relevant part of the test, we can see that the automatic recalculation of InnoDB statistics could interfere with it, essentially causing a race condition with no_checkpoint_start.inc:
The test expects that there are absolutely no InnoDB writes between the last two source directives. I assume that the update of statistics as well as a log checkpoint could somehow run concurrently with the no_checkpoint_start.inc and ruin the logic that is supposed to detect unexpected log checkpoints. | |||||||||||||||||||
| Comment by Marko Mäkelä [ 2024-01-08 ] | |||||||||||||||||||
|
I noticed that the test fails rather often on a combined ASAN+UBSAN build. I was able to reproduce the failure on 10.6, first without rr (running --parallel=120, or 3 times what nproc reports) and then with rr (running only 9 tests concurrently, to avoid running out of RAM disk). The test turned out to fail due to two bugs in debug instrumentation, which I have now fixed in 10.5 and merged to 10.6. We failed to wait for the write of page 1 to be completed, and to actually prevent a log checkpoint from being executed by the page cleaner thread. |