Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
10.11, 11.4
Description
mleich produced a data set where a server was running with innodb_log_file_size=96M and innodb_buffer_pool_size=6M. After the server was killed and restarted, it would crash like the following (from a local run using a copy of the data, using an even smaller buffer pool):
10.11 852d42e9933a2760b2542e977f2141d4e80dd8d6 |
2024-09-10 12:24:46 0 [Note] InnoDB: Small buffer pool size (5.000MiB), the flst_validate() debug function can cause a deadlock if the buffer pool fills up.
|
2024-09-10 12:24:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=19308819
|
2024-09-10 12:24:46 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-14a918-20.ibd' with space ID 19. Another data file called ./test/t6.ibd exists with the same space ID.
|
2024-09-10 12:24:46 0 [Note] InnoDB: Multi-batch recovery needed at LSN 21507736
|
mariadbd: /mariadb/10.11/storage/innobase/log/log0recv.cc:2849: recv_sys_t::parse_mtr_result recv_sys_t::parse(source&, bool) [with source = recv_buf; bool store = false]: Assertion `!file_checkpoint || space_id == TRX_SYS_SPACE || srv_is_undo_tablespace(space_id)' failed.
|
The debug assertion fails, because we are expecting to know the file name of the tablespace id 17. Long before the server had been killed, it had dropped this tablespace and written a FILE_DELETE record, in a CREATE OR REPLACE TABLE operation.
If I start the recovery with a larger buffer pool, it will recover just fine. The changes that were made in MDEV-29911 could possibly be to blame for this.
I am unable to attach a copy of the data directory here, because it would exceed the maximum Jira attachment size. The data set uses encryption, and encrypted data is not compessible.
Attachments
Issue Links
- relates to
-
MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress
-
- Closed
-
This turns out to be a too strict debug assertion:
diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc
index 2b70501dc11..ee665e3a3a1 100644
--- a/storage/innobase/log/log0recv.cc
+++ b/storage/innobase/log/log0recv.cc
@@ -2846,7 +2846,8 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists)
last_offset)
: file_name_t::initial_flags;
if (it == recv_spaces.end())
- ut_ad(!file_checkpoint || space_id == TRX_SYS_SPACE ||
+ ut_ad(!store ||
+ !file_checkpoint || space_id == TRX_SYS_SPACE ||
srv_is_undo_tablespace(space_id));
else if (!it->second.space)
In the call stack, we are in a loop that is parsing the log to the end, not storing any other data than file name metadata:
10.11 852d42e9933a2760b2542e977f2141d4e80dd8d6
#8 0x0000560d03ec04ac in recv_sys_t::parse_pmem<false> (if_exists=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:3117
#9 0x0000560d03ea891c in recv_scan_log (last_phase=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:4149
#10 0x0000560d03eaae05 in recv_recovery_from_checkpoint_start () at /mariadb/10.11/storage/innobase/log/log0recv.cc:4620
static bool recv_scan_log(bool last_phase)
skip_the_rest:
With the assertion relaxed, the data directory recovers just fine, and CHECK TABLE…EXTENDED does not report any errors for the tables. Some warnings are there about not-purged history in clustered indexes; it seems to be a separate issue from MDEV-29823.
I can reproduce the assertion failure with both pread and mmap (/dev/shm) based ib_logfile0 recovery.
The 10.6 version of
MDEV-29911is different, because beforeMDEV-14425there was a two-stage log parser (first blocks, then records). The corresponding debug assertion would look like the following:recv_sys_t::parse(lsn_t checkpoint_lsn, store_t *store, bool apply)
ut_ad(!mlog_checkpoint_lsn || space_id == TRX_SYS_SPACE ||
srv_is_undo_tablespace(space_id));
Unlike 10.11, the multi-batch recovery would be mostly handled within this function. It seems possible that the above debug assertion could fail in 10.6. Given that the assertion failure in 10.11 is too strict and not a sign of actual trouble, I think that we should leave the assertion in 10.6 unchanged for now. If the assertion fails in a debug build, we will analyze the data set and address that separately.