The only caller of the parse function recv_t::parse_pmem (which may be a trivial wrapper for recv_t::parse_mtr) is recv_scan_log(). Two calls are passing the template parameter store=false. The first one is at the very beginning of the parsing, when we are looking for a FILE_CHECKPOINT record. The second one is in the skip_the_rest: loop whose sole purpose is to find all tablespaces.
There are two calls with store=true. One is the initial call when the log checkpoint LSN has been determined and we are about to parse and store the very first record. Another one is in the main loop, which would here be terminated by r == recv_sys_t::GOT_OOM.
The structure of the log is as follows:
- The checkpoint LSN points to the start of an arbitrary mini-transaction.
- We may have some log records for modifying files for which a FILE_MODIFY had been written before the checkpoint. These records were "purged" by advancing the checkpoint.
- At some point the space reserved for recv_sys.pages will run out and we would switch to the skip_the_rest: mode.
- We encounter a log record for a tablespace that will be deleted a bit later. This would trip the bogus debug assertion.
- There is a FILE_DELETE record for this tablespace.
- The "checkpoint end" LSN points to a possibly empty sequence of FILE_MODIFY records and a FILE_CHECKPOINT record. Recovery will parse these first, before rewinding to the checkpoint "start LSN".
- There typically are further records following the FILE_CHECKPOINT record. These will be processed by recovery after the "rewinding".
The scenario here is that there will be no FILE_MODIFY record written before FILE_CHECKPOINT for the tablespace, because the tablespace had been deleted (and there will be a FILE_DELETE record between the checkpoint start and the FILE_CHECKPOINT record).
Let us recall the out-of-memory handling:
static bool recv_scan_log(bool last_phase)
|
if (r == recv_sys_t::GOT_OOM)
|
{
|
ut_ad(!last_phase);
|
rewound_lsn= recv_sys.lsn;
|
store= false;
|
if (recv_sys.scanned_lsn <= 1)
|
goto skip_the_rest;
|
If we run out of memory, by assigning store=false we will ensure that the skip_the_rest: loop in recv_scan_log() will consume all log records until the very end. There may be multiple invocations of that loop if the log to process is larger than innodb_log_buffer_size. At the end of the function we have the following handling:
static bool recv_scan_log(bool last_phase)
|
if (r != recv_sys_t::PREMATURE_EOF)
|
{
|
ut_ad(r == recv_sys_t::GOT_EOF);
|
got_eof:
|
ut_ad(recv_sys.is_initialised());
|
if (recv_sys.scanned_lsn > 1)
|
{
|
ut_ad(recv_sys.scanned_lsn == recv_sys.lsn);
|
break;
|
}
|
recv_sys.scanned_lsn= recv_sys.lsn;
|
sql_print_information("InnoDB: End of log at LSN=" LSN_PF, recv_sys.lsn);
|
break;
|
}
|
// …
|
}
|
|
if (last_phase)
|
{
|
ut_ad(!rewound_lsn);
|
ut_ad(recv_sys.lsn >= recv_sys.file_checkpoint);
|
log_sys.set_recovered_lsn(recv_sys.lsn);
|
}
|
else if (rewound_lsn)
|
{
|
ut_ad(!store);
|
ut_ad(recv_sys.file_checkpoint);
|
recv_sys.lsn= rewound_lsn;
|
}
|
The assignment of rewound_lsn to recv_sys.lsn will ensure that a subsequent call to recv_scan_log(false) will continue where we left off before switching to the skip_the_rest: loop. In that loop, all we really want and need is to parse all FILE_ records. Other records do not matter at that point; they will be guaranteed to be processed by a subsequent call to recv_scan_log(false).
The insertion of a dummy entry into recv_spaces in the store=true case would come into play before any skip_the_rest: phase was invoked. In fact, that code is necessary if we invoke recovery with a large enough innodb_buffer_pool_size so that a single scan of the log will suffice. That code should be redundant after the completion of the skip_the_rest: handling.
I agree that we could add similar processing for the store=false case, but I do not see how it could be necessary. The motivation of this change in MDEV-29911 was to improve the speed of crash recovery and to minimize any memory allocation operations. That is why I made store a template parameter of the parsing function.
This turns out to be a too strict debug assertion:
diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 2b70501dc11..ee665e3a3a1 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -2846,7 +2846,8 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists)
last_offset)
: file_name_t::initial_flags;
if (it == recv_spaces.end())
- ut_ad(!file_checkpoint || space_id == TRX_SYS_SPACE ||
+ ut_ad(!store ||
+ !file_checkpoint || space_id == TRX_SYS_SPACE ||
srv_is_undo_tablespace(space_id));
else if (!it->second.space)
In the call stack, we are in a loop that is parsing the log to the end, not storing any other data than file name metadata:
10.11 852d42e9933a2760b2542e977f2141d4e80dd8d6
#8 0x0000560d03ec04ac in recv_sys_t::parse_pmem<false> (if_exists=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:3117
#9 0x0000560d03ea891c in recv_scan_log (last_phase=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:4149
#10 0x0000560d03eaae05 in recv_recovery_from_checkpoint_start () at /mariadb/10.11/storage/innobase/log/log0recv.cc:4620
static bool recv_scan_log(bool last_phase)
skip_the_rest:
With the assertion relaxed, the data directory recovers just fine, and CHECK TABLE…EXTENDED does not report any errors for the tables. Some warnings are there about not-purged history in clustered indexes; it seems to be a separate issue from MDEV-29823.
I can reproduce the assertion failure with both pread and mmap (/dev/shm) based ib_logfile0 recovery.
The 10.6 version of
MDEV-29911is different, because beforeMDEV-14425there was a two-stage log parser (first blocks, then records). The corresponding debug assertion would look like the following:recv_sys_t::parse(lsn_t checkpoint_lsn, store_t *store, bool apply)
ut_ad(!mlog_checkpoint_lsn || space_id == TRX_SYS_SPACE ||
srv_is_undo_tablespace(space_id));
Unlike 10.11, the multi-batch recovery would be mostly handled within this function. It seems possible that the above debug assertion could fail in 10.6. Given that the assertion failure in 10.11 is too strict and not a sign of actual trouble, I think that we should leave the assertion in 10.6 unchanged for now. If the assertion fails in a debug build, we will analyze the data set and address that separately.