On Buildbot, we have a few cases of assertion failures in mariabackup --prepare, because buf_pool.n_pend_reads is not zero. Matthias Leich was able to reproduce the error under rr record, and I analyzed a case which is reported in
In the analyzed trace, we are recovering 4 page reads. The read callback for 3 of the pages was finished (with no log to be applied, because the pages were up to date), but for the last page, the read callback was still in progress. Yet, srv_start() would concurrently initiate shutdown. We failed to wait for all asynchronous reads to complete. This omission could cause the failure to apply log to some data pages.
While this error started to occur on Buildbot after we replaced the InnoDB mutexes with native ones (
MDEV-21452), I think that it is possible in 10.5, and probably in older versions as well, even though the code was heavily refactored in 10.5, starting with MDEV-16264. The earliest known failure for the 10.6 main branch was as follows:
I think that there are two causes to this bug in 10.5 and 10.6:
- io_callback() was invoking read_slots->release() before invoking the callback function (for write callbacks the existing order is OK)
- recv_sys_t::apply() is missing a call to read_slots->wait() before flushing the buffer pool. The recv_sys.apply(true) was called by srv_start(), so this should fix the observed race condition (