[MDEV-31362] recv_sys_t::apply(bool): Assertion `!last_batch || recovered_lsn == scanned_lsn' failed Created: 2023-05-29 Updated: 2023-06-07 Resolved: 2023-05-30 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.6 |
| Fix Version/s: | 10.6.15 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | debug, recovery, rr-profile-analyzed | ||
| Issue Links: |
|
||||||||
| Description |
|
mleich provided a copy of a data directory as well as rr replay traces that leads to a recovery failure with an assertion failure. Unfortunately, the data directory enables encryption and includes a 96 MiB ib_logfile0 that wrapped around once. Because encrypted data does not compress well, a compressed copy of the data directory would be too large to attach here. The rr replay trace from before the crash is of limited use, because rr replay would report a replay divergence near the end of the trace. Right before the 10.6 version of
If I attempt recovery with the fix of
During the time of the assertion failure, we have recv_sys.recovered_lsn==151202728 and recv_sys.scanned_lsn==151209984. The largest observed *contiguous_lsn in recv_scan_log_recs() is 150516736. In the non-crashing run, recovery only proceeded up to 151202977. This would seem to suggest that the "successful" run without MariaDB Server 10.8 and later versions could be unaffected by this exact bug, because the separate log block and log record parsers were unified when |
| Comments |
| Comment by Marko Mäkelä [ 2023-05-30 ] | ||||||||||||||||||||||
|
I conducted some analysis of this. Here are the the first 16 bytes of each 512-byte blocks of the redo log, up to the logical end:
At the file offset 0x3029000 we can find an earlier log block from the first "round" when the 96MiB ib_logfile0 had not yet wrapped around. The last log_crypt() call before the assertion failure is on the block that corresponds to the file offset 0x3028e00:
The two bytes right after the log block number are the payload length: 0x200=512 bytes. At the time of the assertion failure, recv_sys.scanned_lsn is 151209984, which is exactly 151209472+512. The recv_sys.recovered_lsn is less than that, only 151202728, because the last parsed mini-transaction was incomplete. In the log resizing message that I posted yesterday, we can see LSN=151202977. The difference of 249 bytes could be explained by some FILE_MODIFY records and a FILE_CHECKPOINT record that must have been written to the old log as part of a log checkpoint before the log could be safely resized. In other words, the debug assertion is bogus. On a closer read of the 10.6 version of I think that the simplest fix is to remove the debug assertion from 10.6. | ||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-05-30 ] | ||||||||||||||||||||||
|
Relevant for triggering this bogus debug assertion should be that the server was killed while it was writing multiple redo log blocks. We can observe that the last written block ends at a 4096-byte boundary of the file. I assume that most of the time when the server is killed, the last written redo log block would also be recovered as the last block. |