[MDEV-35225] Bogus debug assertion failures in innodb.innodb-32k-crash - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6, 10.11, 11.2(EOL), 11.4
Fix Version/s: 10.6.20, 10.11.10, 11.2.6, 11.4.4, 11.6.2
Component/s: Storage Engine - InnoDB
Labels:
- debug

Description

sanja noted that during the execution of the test innodb.innodb-32k-crash some debug assertions that had been added together with the fix of ~~MDEV-31354~~ to the function log_sort_flush_list() are failing rather often, like this:

10.6 753e7d6d7ce7770d3c98beb6fdcb97e0e8d1ec9f
innodb.innodb-32k-crash w18 [ fail ]
Test ended at 2024-10-01 10:31:25
…
2024-10-01 10:31:25 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1299959,3973825
2024-10-01 10:31:25 0 [Note] InnoDB: 1 transaction(s) which must be rolled back or cleaned up in total 3 row operations to undo
2024-10-01 10:31:25 0 [Note] InnoDB: Trx id counter is 225
2024-10-01 10:31:25 0 [Note] InnoDB: To recover: 658 pages
mariadbd: /home/buildbot/amd64-ubuntu-2204-debug-ps/build/storage/innobase/log/log0recv.cc:3658: log_sort_flush_list()::<lambda(const buf_page_t, const buf_page_t)>: Assertion `l > 2' failed.

I was able to reproduce this. In the core dump that I analyzed, all 7 members of buf_pool.flush_list carried oldest_modification()==1, that is, the pages had been written back to the file system.

As noted in ~~MDEV-31354~~, starting with ~~MDEV-25113~~ it is possible that the buf_page_t::oldest_modification() will be updated to 1 by a thread that is not holding buf_pool.flush_list_mutex. The debug assertions on LSN being above 2 must be revised accordingly. As a slight optimization, when we are copying the sorted list back to buf_pool.flush_list, we can omit such blocks.

The test innodb.innodb-32k-crash also started to fail in another way in 10.6 but not later versions, due to a bogus debug assertion that was added to recv_recovery_from_checkpoint_start() in ~~MDEV-34830~~:

ut_ad(log_sys.get_lsn() >= recv_sys.scanned_lsn);

This assertion may fail when the last mini-transaction in the log was not completely written. In that case, the recv_sys.scanned_lsn could be a few 512-byte blocks ahead of recv_sys.recovered_lsn, which is what matters. In ~~MDEV-14425~~, these fields were replaced by recv_sys.lsn and there is no log block layer anymore; each mini-transaction is a logical log block on its own.

Attachments

Issue Links

relates to

MDEV-35226 InnoDB occasionally fails to recover a corrupted page from the doublewrite buffer

Stalled

MDEV-25113 Reduce effect of parallel background flush on select workload

Closed

MDEV-31354 SIGSEGV in log_sort_flush_list() in InnoDB crash recovery

Closed

Activity

There are no comments yet on this issue.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2024-10-22 05:55

Updated:: 2024-10-22 06:50

Resolved:: 2024-10-22 06:42

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server