[MDEV-19346] Assertion `recv_sys->mlog_checkpoint_lsn <= recv_sys->recovered_lsn' failed in recv_parse_log_recs during mariabackup --prepare Created: 2019-04-26 Updated: 2019-05-06 Resolved: 2019-05-03 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2.24 |
| Fix Version/s: | 10.2.24, 10.3.15, 10.4.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Matthias Leich | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Linux Ubuntu 17.10 but most probably unimportant |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
|
| Comments |
| Comment by Matthias Leich [ 2019-04-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-04-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
table_stress1.test
I helped the reproducibility with the following patch:
For one failure, I am seeing that tablespace ID 346 (table400_innodb_int_autoinc), the space->free_limit is first assigned to 64, then later to 128 (which is causing the assertion failure). The first assignment occurs with the following call stack:
Later, the field is assigned to 0,64,128, all with the following call stacks:
Then, we will read the page 0 again, because the file had been closed by fil_system->LRU meanwhile. At this point, fil_node_t::read_page0() is reading back stale data from page 0. We already have newer data in the buffer pool (and in fil_space_t)! So, we should not replace it with what we re-read from the file. The system tablespace is a special case:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-04-29 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mleich, I reported and fixed | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-04-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
backup.tar.bz2
The output ends with the following:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The assertion is complaining that multiple MLOG_CHECKPOINT records are referring to the same checkpoint LSN, from which the recovery started. Here is the relevant part of the ib_log output:
We start by jumping to the metadata (MLOG_FILE_NAME records followed by MLOG_CHECKPOINT) that were written at the completion of the log checkpoint. Then, we rewind to the start of the log, and at an earlier LSN, find another bunch of MLOG_FILE_NAME and MLOG_CHECKPOINT records for the same checkpoint LSN. To track this down, we must instrument log_checkpoint() in the server. Why are we issuing multiple checkpoints for the same LSN, with so many log records in between? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If I comment out the failing assertion, the backup will be recovered, and there will be more messages about `MLOG_CHECKPOINT` records for the same checkpoint:
It looks like log_checkpoint() is at the very least doing busy work (writing useless redo log records), and maybe at worst doing something wrong. In a worse scenario, this shares a common root cause with | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe that the log_checkpoint() parameter write_always is unnecessary and should be removed. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The assertion failure was fixed by removing the unnecessary parameter write_always and changing the code to behave as if it never held. Also, I changed some crash injection code to flush redo log instead of performing log checkpoint. This should improve the crash test coverage, as was already demonstrated by |