[MDEV-30657] [ERROR] InnoDB: Not applying UNDO_APPEND due to corruption during Recovery Created: 2023-02-15 Updated: 2023-02-21 Resolved: 2023-02-16 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | N/A |
| Fix Version/s: | 10.11.3, 11.0.1, 10.5.20, 10.6.13, 10.7.8, 10.8.8, 10.9.6, 10.10.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Matthias Leich | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | regression | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
|
| Comments |
| Comment by Marko Mäkelä [ 2023-02-15 ] | ||||||||||||||||||||||||||||||||||||||
|
During recovery, the frame of the undo log page 2:0x186 is completely filled with NUL bytes. The LSN is as follows:
Applying the log fails on the very first record, whose length is 1587 bytes. The LSN difference is 49 bytes more, but that could be explained by the terminating NUL byte and 3*(12+4) bytes of framing for three 512-byte log blocks. So, it looks like there is only 1 record in that mini-transaction. In the rr replay trace of the killed server, we should be able to find when this log record was written and what happened to the undo log page between that time and the server kill. It is somewhere between these times:
| ||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-15 ] | ||||||||||||||||||||||||||||||||||||||
|
The problematic log record was written in
At this point of time, the undo page pretty much exists in the buffer pool:
We will need to find out what happened to it after this point of time. | ||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-15 ] | ||||||||||||||||||||||||||||||||||||||
|
The page was marked clean during undo tablespace truncation:
The undo truncation mini-transaction ending at LSN=158035982 should have been parsed by recovery, because at the end of that trace we have log_sys.lsn._M_i=160484452 and the recovery started with the following:
It looks like we could be dealing with a recovery-side bug similar to | ||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-15 ] | ||||||||||||||||||||||||||||||||||||||
|
Surprisingly, a copy of the data directory data.tar.xz
It could be that the recovery problem requires some luck with the timing, or that this problem only affects 10.5 and not 10.6, which uses a compatible ib_logfile0 format. I will check the rr replay trace of the failed recovery in more detail. | ||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-15 ] | ||||||||||||||||||||||||||||||||||||||
|
There is a clue in the log of the rr replay run:
This could be a bug in multi-batch recovery of innodb_undo_log_truncate=ON. I can in fact reproduce this on my 10.6 branch when using a tiny buffer pool to force the recovery to be in multiple batches:
| ||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-15 ] | ||||||||||||||||||||||||||||||||||||||
|
I am sorry, this is a regression due to an incorrect optimization of memory usage that I had applied in addition to the fix of If recovery happens in multiple batches, it is not sufficient to discard old redo log records for the truncated tablespace while parsing them. We must discard them on every log-applying batch. Reverting that optimization fixes the recovery. What we can optimize is that we can remove any old log records already in recv_sys_t::parse(), to increase the chances that fewer recovery batches may be possible. | ||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-02-16 ] | ||||||||||||||||||||||||||||||||||||||
|
In multi-batch recovery, recv_sys_t::parse() may add obsolete records to recv_sys.pages, which are to be removed in recv_sys_t::apply(). I decided to not optimize that for now, because such an optimization would introduce a data structure lookup each time before a log record is to be buffered. For that, we would need to look up the array truncated_undo_spaces, preceded by 2 comparisons for checking the array bounds. Undo log truncation should be relatively rare. |