[MDEV-28976] InnoDB: Missing FILE_CHECKPOINT Created: 2022-06-29 Updated: 2023-09-05 Resolved: 2022-07-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB, Tests |
| Affects Version/s: | 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10 |
| Fix Version/s: | 10.3.36, 10.4.26, 10.5.17, 10.6.9, 10.7.5, 10.8.4, 10.9.2, 10.10.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | race | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
A test frequently fails like this:
For the above failure on kvm-asan, data.tar.xz To my surprise, the recovery would start from a much later LSN:
The 2 checkpoint blocks contain the following:
That is, the first checkpoint is 0xed25 (60709) and the second one is 0x1bd88f (1824911). Even if I discard the second checkpoint, the recovery will succeed for me:
At the indicated LSN 1800619 (0x1b79ab) of the failure message, I see the end of a mini-transaction that the checkpoint pointed to (0x1b7913). In the
This must have been written by fil_names_clear(). We have a number of FILE_MODIFY records of files that were modified since the checkpoint (0xb0, encoded length, encoded tablespace ID, 0, file name), and a FILE_CHECKPOINT record (0xfa, 0, 0, big-endian 64-bit LSN=0xed25), and the end-of-mini-transaction marker (sequence bit=1) and the CRC-32C checksum 0x9e41cb2d. Everything looks perfectly fine. Most failures occur on FreeBSD, but the above is the first case where a copy of the data directory is available yet. Here are two failures for 10.5:
In each case, it seems possible that the latest checkpoint was from the shutdown of the server bootstrap. serg, to me, this looks like a race condition in mtr. I suspect that the new server process was started while the old one is still being killed by the test (while writing a checkpoint). Before |
| Comments |
| Comment by Marko Mäkelä [ 2022-07-27 ] | |||||||||||||
|
An attempted fix of
I think that I will add some retry loop there, like we did when InnoDB did not obey --skip-external-locking. It is not ideal that the test harness is not reliable when it comes to killing and restarting processes, but since that is outside my control, I think that it is easiest to work around that deficiency in the InnoDB startup code. | |||||||||||||
| Comment by Marko Mäkelä [ 2022-07-27 ] | |||||||||||||
|
Already in 10.3, several kill-and-restart tests were affected by this, also outside the encryption suite. Apparently, the probability of log checkpoints occurring near the server kill is much higher with encryption tests (which are writing relatively much data right before killing the server). | |||||||||||||
| Comment by Marko Mäkelä [ 2022-07-27 ] | |||||||||||||
|
| |||||||||||||
| Comment by Marko Mäkelä [ 2023-04-19 ] | |||||||||||||
|
An additional fix could help with those cases that are not ‘rescued’ by InnoDB advisory file locks. I suspect that some Spider tests could have failed due to a new server being started before the killed server has properly terminated. | |||||||||||||
| Comment by Marko Mäkelä [ 2023-07-05 ] | |||||||||||||
|
|