[MDEV-34907] Bogus debug assertion failure in multi-batch recovery while parsing FILE_ records - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.11, 11.4
Fix Version/s: 10.11.10, 11.2.6, 11.4.4, 11.6.2
Component/s: Backup, Storage Engine - InnoDB
Labels:

Description

mleich produced a data set where a server was running with innodb_log_file_size=96M and innodb_buffer_pool_size=6M. After the server was killed and restarted, it would crash like the following (from a local run using a copy of the data, using an even smaller buffer pool):

10.11 852d42e9933a2760b2542e977f2141d4e80dd8d6
2024-09-10 12:24:46 0 [Note] InnoDB: Small buffer pool size (5.000MiB), the flst_validate() debug function can cause a deadlock if the buffer pool fills up.
2024-09-10 12:24:46 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=19308819
2024-09-10 12:24:46 0 [Note] InnoDB: Ignoring data file './test/#sql-alter-14a918-20.ibd' with space ID 19. Another data file called ./test/t6.ibd exists with the same space ID.
2024-09-10 12:24:46 0 [Note] InnoDB: Multi-batch recovery needed at LSN 21507736
mariadbd: /mariadb/10.11/storage/innobase/log/log0recv.cc:2849: recv_sys_t::parse_mtr_result recv_sys_t::parse(source&, bool) [with source = recv_buf; bool store = false]: Assertion `!file_checkpoint \|\| space_id == TRX_SYS_SPACE \|\| srv_is_undo_tablespace(space_id)' failed.

The debug assertion fails, because we are expecting to know the file name of the tablespace id 17. Long before the server had been killed, it had dropped this tablespace and written a FILE_DELETE record, in a CREATE OR REPLACE TABLE operation.

If I start the recovery with a larger buffer pool, it will recover just fine. The changes that were made in ~~MDEV-29911~~ could possibly be to blame for this.

I am unable to attach a copy of the data directory here, because it would exceed the maximum Jira attachment size. The data set uses encryption, and encrypted data is not compessible.

Attachments

Issue Links

relates to

MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-09-10 10:14

This turns out to be a too strict debug assertion:

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc

index 2b70501dc11..ee665e3a3a1 100644

--- a/storage/innobase/log/log0recv.cc

+++ b/storage/innobase/log/log0recv.cc

@@ -2846,7 +2846,8 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists)

                                    last_offset)

                 : file_name_t::initial_flags;

               if (it == recv_spaces.end())

-                ut_ad(!file_checkpoint || space_id == TRX_SYS_SPACE ||

+                ut_ad(!store ||

+                      !file_checkpoint || space_id == TRX_SYS_SPACE ||

                       srv_is_undo_tablespace(space_id));

               else if (!it->second.space)

In the call stack, we are in a loop that is parsing the log to the end, not storing any other data than file name metadata:

10.11 852d42e9933a2760b2542e977f2141d4e80dd8d6
#8 0x0000560d03ec04ac in recv_sys_t::parse_pmem<false> (if_exists=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:3117
#9 0x0000560d03ea891c in recv_scan_log (last_phase=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:4149
#10 0x0000560d03eaae05 in recv_recovery_from_checkpoint_start () at /mariadb/10.11/storage/innobase/log/log0recv.cc:4620

static bool recv_scan_log(bool last_phase)
skip_the_rest:
while ((r= recv_sys.parse_pmem<false>(false)) == recv_sys_t::OK);

With the assertion relaxed, the data directory recovers just fine, and CHECK TABLE…EXTENDED does not report any errors for the tables. Some warnings are there about not-purged history in clustered indexes; it seems to be a separate issue from MDEV-29823.

I can reproduce the assertion failure with both pread and mmap (/dev/shm) based ib_logfile0 recovery.

The 10.6 version of ~~MDEV-29911~~ is different, because before ~~MDEV-14425~~ there was a two-stage log parser (first blocks, then records). The corresponding debug assertion would look like the following:

recv_sys_t::parse(lsn_t checkpoint_lsn, store_t *store, bool apply)
if (it == recv_spaces.end())
ut_ad(!mlog_checkpoint_lsn \|\| space_id == TRX_SYS_SPACE \|\|
srv_is_undo_tablespace(space_id));

Unlike 10.11, the multi-batch recovery would be mostly handled within this function. It seems possible that the above debug assertion could fail in 10.6. Given that the assertion failure in 10.11 is too strict and not a sign of actual trouble, I think that we should leave the assertion in 10.6 unchanged for now. If the assertion fails in a debug build, we will analyze the data set and address that separately.

Marko Mäkelä added a comment - 2024-09-10 10:14 This turns out to be a too strict debug assertion: diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 2b70501dc11..ee665e3a3a1 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -2846,7 +2846,8 @@ recv_sys_t::parse_mtr_result recv_sys_t::parse(source &l, bool if_exists) last_offset) : file_name_t::initial_flags; if (it == recv_spaces.end()) - ut_ad(!file_checkpoint || space_id == TRX_SYS_SPACE || + ut_ad(!store || + !file_checkpoint || space_id == TRX_SYS_SPACE || srv_is_undo_tablespace(space_id)); else if (!it->second.space) { In the call stack, we are in a loop that is parsing the log to the end, not storing any other data than file name metadata: 10.11 852d42e9933a2760b2542e977f2141d4e80dd8d6 #8 0x0000560d03ec04ac in recv_sys_t::parse_pmem<false> (if_exists=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:3117 #9 0x0000560d03ea891c in recv_scan_log (last_phase=false) at /mariadb/10.11/storage/innobase/log/log0recv.cc:4149 #10 0x0000560d03eaae05 in recv_recovery_from_checkpoint_start () at /mariadb/10.11/storage/innobase/log/log0recv.cc:4620 static bool recv_scan_log(bool last_phase) skip_the_rest: while ((r= recv_sys.parse_pmem< false >( false )) == recv_sys_t::OK); With the assertion relaxed, the data directory recovers just fine, and CHECK TABLE…EXTENDED does not report any errors for the tables. Some warnings are there about not-purged history in clustered indexes; it seems to be a separate issue from MDEV-29823 . I can reproduce the assertion failure with both pread and mmap ( /dev/shm ) based ib_logfile0 recovery. The 10.6 version of MDEV-29911 is different, because before MDEV-14425 there was a two-stage log parser (first blocks, then records). The corresponding debug assertion would look like the following: recv_sys_t::parse(lsn_t checkpoint_lsn, store_t *store, bool apply) if (it == recv_spaces.end()) ut_ad(!mlog_checkpoint_lsn || space_id == TRX_SYS_SPACE || srv_is_undo_tablespace(space_id)); Unlike 10.11, the multi-batch recovery would be mostly handled within this function. It seems possible that the above debug assertion could fail in 10.6. Given that the assertion failure in 10.11 is too strict and not a sign of actual trouble, I think that we should leave the assertion in 10.6 unchanged for now. If the assertion fails in a debug build, we will analyze the data set and address that separately.

Debarun Banerjee added a comment - 2024-09-12 06:20

I checked how this assert is prevented when we are storing the records and a little above we have the following check precisely taking care of the case where the space might have been deleted later.

2721     else if (store && file_checkpoint && !is_predefined_tablespace(space_id))

2722     {

2723       recv_spaces_t::iterator i= recv_spaces.lower_bound(space_id);

2724       if (i != recv_spaces.end() && i->first == space_id);

2725       else if (lsn < file_checkpoint)

2726         /* We have not seen all records between the checkpoint and

2727         FILE_CHECKPOINT. There should be a FILE_DELETE for this

2728         tablespace later. */

2729         recv_spaces.emplace_hint(i, space_id, file_name_t("", false));

Since for store=false, we do go thorough all the records and fill recv_spaces based on FILE_ records, it might be better to include this case also for store=false and emplace the dummy record. We can then consistently use the assert for all cases.

Debarun Banerjee added a comment - 2024-09-12 06:20 I checked how this assert is prevented when we are storing the records and a little above we have the following check precisely taking care of the case where the space might have been deleted later. 2721 else if (store && file_checkpoint && !is_predefined_tablespace(space_id)) 2722 { 2723 recv_spaces_t::iterator i= recv_spaces.lower_bound(space_id); 2724 if (i != recv_spaces.end() && i->first == space_id); 2725 else if (lsn < file_checkpoint) 2726 /* We have not seen all records between the checkpoint and 2727 FILE_CHECKPOINT. There should be a FILE_DELETE for this 2728 tablespace later. */ 2729 recv_spaces.emplace_hint(i, space_id, file_name_t("", false)); Since for store=false, we do go thorough all the records and fill recv_spaces based on FILE_ records, it might be better to include this case also for store=false and emplace the dummy record. We can then consistently use the assert for all cases.

Marko Mäkelä added a comment - 2024-09-12 07:08

The only caller of the parse function recv_t::parse_pmem (which may be a trivial wrapper for recv_t::parse_mtr) is recv_scan_log(). Two calls are passing the template parameter store=false. The first one is at the very beginning of the parsing, when we are looking for a FILE_CHECKPOINT record. The second one is in the skip_the_rest: loop whose sole purpose is to find all tablespaces.

There are two calls with store=true. One is the initial call when the log checkpoint LSN has been determined and we are about to parse and store the very first record. Another one is in the main loop, which would here be terminated by r == recv_sys_t::GOT_OOM.

The structure of the log is as follows:

The checkpoint LSN points to the start of an arbitrary mini-transaction.
We may have some log records for modifying files for which a FILE_MODIFY had been written before the checkpoint. These records were "purged" by advancing the checkpoint.
At some point the space reserved for recv_sys.pages will run out and we would switch to the skip_the_rest: mode.
We encounter a log record for a tablespace that will be deleted a bit later. This would trip the bogus debug assertion.
There is a FILE_DELETE record for this tablespace.
The "checkpoint end" LSN points to a possibly empty sequence of FILE_MODIFY records and a FILE_CHECKPOINT record. Recovery will parse these first, before rewinding to the checkpoint "start LSN".
There typically are further records following the FILE_CHECKPOINT record. These will be processed by recovery after the "rewinding".

The scenario here is that there will be no FILE_MODIFY record written before FILE_CHECKPOINT for the tablespace, because the tablespace had been deleted (and there will be a FILE_DELETE record between the checkpoint start and the FILE_CHECKPOINT record).

Let us recall the out-of-memory handling:

static bool recv_scan_log(bool last_phase)
if (r == recv_sys_t::GOT_OOM)
{
ut_ad(!last_phase);
rewound_lsn= recv_sys.lsn;
store= false;
if (recv_sys.scanned_lsn <= 1)
goto skip_the_rest;

If we run out of memory, by assigning store=false we will ensure that the skip_the_rest: loop in recv_scan_log() will consume all log records until the very end. There may be multiple invocations of that loop if the log to process is larger than innodb_log_buffer_size. At the end of the function we have the following handling:

static bool recv_scan_log(bool last_phase)
if (r != recv_sys_t::PREMATURE_EOF)
{
ut_ad(r == recv_sys_t::GOT_EOF);
got_eof:
ut_ad(recv_sys.is_initialised());
if (recv_sys.scanned_lsn > 1)
{
ut_ad(recv_sys.scanned_lsn == recv_sys.lsn);
break;
}
recv_sys.scanned_lsn= recv_sys.lsn;
sql_print_information("InnoDB: End of log at LSN=" LSN_PF, recv_sys.lsn);
break;
}
// …
}

if (last_phase)
{
ut_ad(!rewound_lsn);
ut_ad(recv_sys.lsn >= recv_sys.file_checkpoint);
log_sys.set_recovered_lsn(recv_sys.lsn);
}
else if (rewound_lsn)
{
ut_ad(!store);
ut_ad(recv_sys.file_checkpoint);
recv_sys.lsn= rewound_lsn;
}

The assignment of rewound_lsn to recv_sys.lsn will ensure that a subsequent call to recv_scan_log(false) will continue where we left off before switching to the skip_the_rest: loop. In that loop, all we really want and need is to parse all FILE_ records. Other records do not matter at that point; they will be guaranteed to be processed by a subsequent call to recv_scan_log(false).

The insertion of a dummy entry into recv_spaces in the store=true case would come into play before any skip_the_rest: phase was invoked. In fact, that code is necessary if we invoke recovery with a large enough innodb_buffer_pool_size so that a single scan of the log will suffice. That code should be redundant after the completion of the skip_the_rest: handling.

I agree that we could add similar processing for the store=false case, but I do not see how it could be necessary. The motivation of this change in ~~MDEV-29911~~ was to improve the speed of crash recovery and to minimize any memory allocation operations. That is why I made store a template parameter of the parsing function.

Marko Mäkelä added a comment - 2024-09-12 07:08 The only caller of the parse function recv_t::parse_pmem (which may be a trivial wrapper for recv_t::parse_mtr ) is recv_scan_log() . Two calls are passing the template parameter store=false . The first one is at the very beginning of the parsing, when we are looking for a FILE_CHECKPOINT record. The second one is in the skip_the_rest: loop whose sole purpose is to find all tablespaces. There are two calls with store=true . One is the initial call when the log checkpoint LSN has been determined and we are about to parse and store the very first record. Another one is in the main loop, which would here be terminated by r == recv_sys_t::GOT_OOM . The structure of the log is as follows: The checkpoint LSN points to the start of an arbitrary mini-transaction. We may have some log records for modifying files for which a FILE_MODIFY had been written before the checkpoint. These records were "purged" by advancing the checkpoint. At some point the space reserved for recv_sys.pages will run out and we would switch to the skip_the_rest: mode. We encounter a log record for a tablespace that will be deleted a bit later. This would trip the bogus debug assertion. There is a FILE_DELETE record for this tablespace. The "checkpoint end" LSN points to a possibly empty sequence of FILE_MODIFY records and a FILE_CHECKPOINT record. Recovery will parse these first, before rewinding to the checkpoint "start LSN". There typically are further records following the FILE_CHECKPOINT record. These will be processed by recovery after the "rewinding". The scenario here is that there will be no FILE_MODIFY record written before FILE_CHECKPOINT for the tablespace, because the tablespace had been deleted (and there will be a FILE_DELETE record between the checkpoint start and the FILE_CHECKPOINT record). Let us recall the out-of-memory handling: static bool recv_scan_log(bool last_phase) if (r == recv_sys_t::GOT_OOM) { ut_ad(!last_phase); rewound_lsn= recv_sys.lsn; store= false ; if (recv_sys.scanned_lsn <= 1) goto skip_the_rest; If we run out of memory, by assigning store=false we will ensure that the skip_the_rest: loop in recv_scan_log() will consume all log records until the very end. There may be multiple invocations of that loop if the log to process is larger than innodb_log_buffer_size . At the end of the function we have the following handling: static bool recv_scan_log(bool last_phase) if (r != recv_sys_t::PREMATURE_EOF) { ut_ad(r == recv_sys_t::GOT_EOF); got_eof: ut_ad(recv_sys.is_initialised()); if (recv_sys.scanned_lsn > 1) { ut_ad(recv_sys.scanned_lsn == recv_sys.lsn); break ; } recv_sys.scanned_lsn= recv_sys.lsn; sql_print_information( "InnoDB: End of log at LSN=" LSN_PF, recv_sys.lsn); break ; } // … } if (last_phase) { ut_ad(!rewound_lsn); ut_ad(recv_sys.lsn >= recv_sys.file_checkpoint); log_sys.set_recovered_lsn(recv_sys.lsn); } else if (rewound_lsn) { ut_ad(!store); ut_ad(recv_sys.file_checkpoint); recv_sys.lsn= rewound_lsn; } The assignment of rewound_lsn to recv_sys.lsn will ensure that a subsequent call to recv_scan_log(false) will continue where we left off before switching to the skip_the_rest: loop. In that loop, all we really want and need is to parse all FILE_ records. Other records do not matter at that point; they will be guaranteed to be processed by a subsequent call to recv_scan_log(false) . The insertion of a dummy entry into recv_spaces in the store=true case would come into play before any skip_the_rest: phase was invoked. In fact, that code is necessary if we invoke recovery with a large enough innodb_buffer_pool_size so that a single scan of the log will suffice. That code should be redundant after the completion of the skip_the_rest: handling. I agree that we could add similar processing for the store=false case, but I do not see how it could be necessary. The motivation of this change in MDEV-29911 was to improve the speed of crash recovery and to minimize any memory allocation operations. That is why I made store a template parameter of the parsing function.

Debarun Banerjee added a comment - 2024-09-12 08:49

marko Thanks for the details. Yes, you have described the flow well. However, we need to be careful to note that we need to consider the special INIT_PAGE and FREE_PAGE handling. It reminds me of ~~MDEV-34225~~ where the root cause was disabled path for store=false.

For this specific case, I agree that not doing recv_spaces.emplace_hint(i, space_id, file_name_t("", false)); and relaxing the assert is not going to have any functional impact. It would be caught later phase with store=true.

Debarun Banerjee added a comment - 2024-09-12 08:49 marko Thanks for the details. Yes, you have described the flow well. However, we need to be careful to note that we need to consider the special INIT_PAGE and FREE_PAGE handling. It reminds me of MDEV-34225 where the root cause was disabled path for store=false. For this specific case, I agree that not doing recv_spaces.emplace_hint(i, space_id, file_name_t("", false)); and relaxing the assert is not going to have any functional impact. It would be caught later phase with store=true.

Marko Mäkelä added a comment - 2024-09-12 08:54

Let me revise my previous comment by saying that in addition to processing FILE_ records, the skip_the_rest: loop is processing INIT_PAGE and FREE_PAGE records. Upon encountering those records, we will invoke store_freed_or_init_rec(), which will update recv_spaces. Therefore, a dummy record in recv_spaces may indeed be necessary.

I see that recv_sys_t::parse() with store=false is unnecessarily validating the details of other page-oriented records than INIT_PAGE and FREE_PAGE. We had better just skip to the next record and let the subsequent store=true take care of further validation. Doing that could significantly speed up multi-batch recovery as well as mariadb-backup --backup (~~MDEV-34850~~).

Marko Mäkelä added a comment - 2024-09-12 08:54 Let me revise my previous comment by saying that in addition to processing FILE_ records, the skip_the_rest: loop is processing INIT_PAGE and FREE_PAGE records. Upon encountering those records, we will invoke store_freed_or_init_rec() , which will update recv_spaces . Therefore, a dummy record in recv_spaces may indeed be necessary. I see that recv_sys_t::parse() with store=false is unnecessarily validating the details of other page-oriented records than INIT_PAGE and FREE_PAGE . We had better just skip to the next record and let the subsequent store=true take care of further validation. Doing that could significantly speed up multi-batch recovery as well as mariadb-backup --backup ( MDEV-34850 ).

Marko Mäkelä added a comment - 2024-09-12 09:48

The assertion is failing in a special handling of a WRITE record that is modifying the FSP_SIZE or FSP_SPACE_FLAGS of a tablespace header, while we are trying to apply the changes to the collection recv_spaces. In other words, the workload that triggered this was extending a table right before dropping it (more precisely, recreating it by CREATE OR REPLACE TABLE). There is no need to apply these changes during store=false or within the skip_the_rest: loop. The reason is that by design, there is going to be another round of store=true that will re-process the same records.

I now see that the assertion failure could be considered a sign that the bug ~~MDEV-34225~~ was not completely fixed. That bug was about invoking store_freed_or_init_rec() when store=false, specifically about updating recv_spaces. As far as I understand, it is not strictly necessary to do anything about predefined (system or undo) tablespaces, because those adjustments will be done during a subsequent pass with store=true.

I maintain that my original suggested patch is correct. The reasoning is that the only way how we can be missing a FILE_MODIFY record between the "checkpoint end" LSN and the FILE_CHECKPOINT record is that a FILE_DELETE record had been written between the checkpoint start and end. If we took the effort to add a dummy entry to recv_spaces during store=false, that entry would be replaced a little later in fil_name_process() during store=false with a file_name_t::DELETED entry. Yes, store_freed_or_init_rec() could fail to find a recv_spaces entry for the tablespace, but it does not matter, because the file would be deleted anyway.

I will develop a more refined fix that will simplify the store=false workflow.

Marko Mäkelä added a comment - 2024-09-12 09:48 The assertion is failing in a special handling of a WRITE record that is modifying the FSP_SIZE or FSP_SPACE_FLAGS of a tablespace header, while we are trying to apply the changes to the collection recv_spaces . In other words, the workload that triggered this was extending a table right before dropping it (more precisely, recreating it by CREATE OR REPLACE TABLE ). There is no need to apply these changes during store=false or within the skip_the_rest: loop. The reason is that by design, there is going to be another round of store=true that will re-process the same records. I now see that the assertion failure could be considered a sign that the bug MDEV-34225 was not completely fixed. That bug was about invoking store_freed_or_init_rec() when store=false , specifically about updating recv_spaces . As far as I understand, it is not strictly necessary to do anything about predefined (system or undo) tablespaces, because those adjustments will be done during a subsequent pass with store=true . I maintain that my original suggested patch is correct. The reasoning is that the only way how we can be missing a FILE_MODIFY record between the "checkpoint end" LSN and the FILE_CHECKPOINT record is that a FILE_DELETE record had been written between the checkpoint start and end. If we took the effort to add a dummy entry to recv_spaces during store=false , that entry would be replaced a little later in fil_name_process() during store=false with a file_name_t::DELETED entry. Yes, store_freed_or_init_rec() could fail to find a recv_spaces entry for the tablespace, but it does not matter, because the file would be deleted anyway. I will develop a more refined fix that will simplify the store=false workflow.

Marko Mäkelä added a comment - 2024-09-27 07:35

My revised patch, which removes the code path where the assertion failed during the preparatory step of multi-batch recovery, aims to be a performance fix. I just posted to ~~MDEV-34850~~ some results, showing a performance improvement by about 20% in that step.

Marko Mäkelä added a comment - 2024-09-27 07:35 My revised patch, which removes the code path where the assertion failed during the preparatory step of multi-batch recovery, aims to be a performance fix. I just posted to MDEV-34850 some results, showing a performance improvement by about 20% in that step.

Marko Mäkelä added a comment - 2024-09-27 10:10

The fix for this bug was just to remove the bogus debug assertion.

The revised patch will be used for addressing ~~MDEV-34850~~.

Marko Mäkelä added a comment - 2024-09-27 10:10 The fix for this bug was just to remove the bogus debug assertion. The revised patch will be used for addressing MDEV-34850 .

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-09-10 09:33

Updated:: 2024-10-27 08:18

Resolved:: 2024-09-27 10:09

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server