[MDEV-22097] Not applying DELETE_ROW_FORMAT_REDUNDANT due to corruption Created: 2020-03-31 Updated: 2020-04-08 Resolved: 2020-04-08 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | mariabackup, Storage Engine - InnoDB |
| Affects Version/s: | 10.5.2 |
| Fix Version/s: | 10.5.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | recovery | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Description |
|
In our test environment, we got a case where mariabackup --prepare would fail to apply log due to corruption on a ROW_FORMAT=REDUNDANT page. With a debugger, I collected the following records (start_lsn, lsn, log record data) for the page from log_phys_t::apply():
The problem was detected at the last record, DELETE_ROW_FORMAT_REDUNDANT of the first user record on the page. Earlier in the log, the link to the page supremum record in the last record of the page (there were only two records) was replaced with a null pointer, by the last record listed for start_lsn=1883172. Apparently, some log record was not written. We had a rr record of the server process, but unfortunately it became unusable due to some change in the environment (probably related to updating shared libraries when upgrading the operating system). We must repeat this bug again so that it can be analyzed with rr replay. |
| Comments |
| Comment by Matthias Leich [ 2020-03-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
New day, new trace. This time, it involved a server kill and restart. The page 9:296 fails to recover. It looks like a clustered index leaf page in a ROW_FORMAT=REDUNDANT table, with innodb_page_size=4k. Here are the records that were applied to the page during recovery, until the error:
There is at least one preceding INSERT operation that is inserting a record with initially null BLOB pointer. It is the second page_apply_delete_redundant() that fails. The reason is that the logged prev_rec is the very last record in the page, pointing to the page supremum. We must be missing a log record write. I have been debugging this all day. The rr replay is good, and in it, the trace appears to start the server right after bootstrap. We do have all the necessary information there. What remains to be done is to find out at which point the page images are diverging between recovery and the initial server run. If the data file turns out to be already out of sync at the point where the recovery starts, we are out of luck and will need a better trace. I did not yet check this. Debugging was greatly slowed down by | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-01 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
At the start of recovery, we have FIL_PAGE_LSN=1809537 on the page. In rr replay, I extracted the page from the buffer pool at this logical time. The image matches the one on recovery, except for the last 8 bytes (the least significant 4 bytes of LSN and the CRC-32C were not yet initialized):
I will continue debugging from this point tomorrow. The goal is to dump the page image after each recovery apply step, and to find the corresponding page image in rr replay, and to compare them. Once we find a mismatch, we will know the logical time frame where a redo log record was wrongly omitted. I already found out that the discrepancy is present at LSN=2537685 (the start of the mini-transaction that delete-marks a clustered index leaf page record and updates its DB_TRX_ID,DB_ROLL_PTR). So, I will have to check at most 10 page images. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The first write that is not found on recovery is the very first one after the LSN=1809537:
Redo logging was correctly enabled for the operation, and the reason why we did not apply log was that the latest checkpoint was newer than this mini-transaction:
Earlier in the rr replay log, I see the following:
This makes me believe that the broken recovery could somehow involve backup, and we are witnessing mleich, if this run really involved backup, then it should be analyzed in the scope of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2020-04-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The second RQG run with rr trace invokes mariabackup. There is no KILL of any server process. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
While the rr replay trace of this failure was ruined by unattended-upgrades and could not be analyzed further, the fact that this failure is no longer reproducible with a more recent 10.5 strongly suggests that the cause of this failure was that the server failed to write a FREE_PAGE record before omitting a page write. That bug was introduced by |