[MDEV-31353] InnoDB recovery hangs after reporting corruption - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6, 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
Fix Version/s: 10.6.15, 10.9.8, 10.10.6, 10.11.5, 11.0.3, 11.1.2
Component/s: Backup, Storage Engine - InnoDB
Labels:
- corruption
- hang
- recovery

Description

This failure, which I got while testing ~~MDEV-31350~~, might be possible in 10.6 as well; I did not try it yet. I do not think that the one-line patch of ~~MDEV-31350~~ plays any role here. The original error messages from the test run were:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74
innodb.recovery_memory 'innodb,release' w49 [ 54 fail ]
Test ended at 2023-05-26 11:29:12

CURRENT_TEST: innodb.recovery_memory
mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'

The server error log says:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74
2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048
2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683
2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages
2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]

When I tried start recover a copy of the data directory (in the attached data.tar.xz) using a larger buffer pool (and a single recovery batch), I got a different error instead:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74
2023-05-26 11:40:49 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
2023-05-26 11:40:49 0 [Note] InnoDB: End of log at LSN=294552683
2023-05-26 11:40:49 0 [Note] InnoDB: To recover: 151 pages
2023-05-26 11:40:49 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=216, page number=4]
2023-05-26 11:40:49 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
2023-05-26 11:40:49 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
2023-05-26 11:40:49 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]

The server would subsequently hang in the main recovery thread:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

#8  os_aio_wait_until_no_pending_reads (declare=<optimized out>) at /mariadb/11/storage/innobase/os/os0file.cc:3695

#9  0x00005620d562ba59 in recv_sys_t::wait_for_pool (this=this@entry=0x5620d594e500 <recv_sys>, pages=pages@entry=0) at /mariadb/11/storage/innobase/log/log0recv.cc:2047

I suppose that the real reason for the hang is the infinite loop in the thread that reported the corruption:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74
#1 LF_BACKOFF () at /mariadb/11/include/my_cpu.h:125
#2 buf_pool_t::corrupted_evict (...) at /mariadb/11/storage/innobase/buf/buf0lru.cc:1247
#3 0x00005620d562662d in recv_recover_page (...) at /mariadb/11/storage/innobase/log/log0recv.cc:3290
#4 0x00005620d56267f4 in IORequest::fake_read_complete (this=0x7f6f4c331308, offset=94699053504392) at /mariadb/11/storage/innobase/log/log0recv.cc:3467
#5 0x00005620d5640a97 in fake_io_callback (c=0x7f6f4c331290) at /mariadb/11/storage/innobase/os/os0file.cc:3434

In the data directory, page 4 of the file data/test/t1.ibd is filled with NUL bytes, so the page should have been recoverable by applying the log records. In the ib_logfile0 there are two checkpoint blocks present: at offset 0x2000 for the latest checkpoint LSN 293063978, and at 0x1000 for an earlier checkpoint LSN 289089329. If I corrupt the latest checkpoint:

dd if=/dev/zero bs=4096 seek=2 count=1 conv=notrunc of=data/ib_logfile0

sql/mariadbd --datadir "$(pwd)/data"

the recovery will start from the earlier checkpoint LSN and fail in the same way:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74
2023-05-26 12:07:18 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=289089329
2023-05-26 12:07:18 0 [Note] InnoDB: End of log at LSN=294552683
2023-05-26 12:07:18 0 [Note] InnoDB: To recover: 217 pages
2023-05-26 12:07:18 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=216, page number=4]
2023-05-26 12:07:18 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
2023-05-26 12:07:18 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
2023-05-26 12:07:18 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]

This would seem to suggest that the problem is not a race condition around writing log checkpoints but instead something that ~~MDEV-18976~~ should be able to catch. It could be worthwhile to enable that logic in a non-debug build:

diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc

index fb08511126a..de74c5658f3 100644

--- a/storage/innobase/mtr/mtr0mtr.cc

+++ b/storage/innobase/mtr/mtr0mtr.cc

@@ -1020,7 +1020,7 @@ std::pair<lsn_t,mtr_t::page_flush_ahead> mtr_t::do_write()

   ut_ad(!m_latch_ex || log_sys.latch.is_write_locked());

 #endif

-#ifndef DBUG_OFF

+#if 1

do

     if (m_log_mode != MTR_LOG_ALL)

What needs to be fixed in this ticket is the InnoDB hang after corruption was reported.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

data.tar.xz
3.70 MB
2023-05-26 08:55

Issue Links

relates to

MDEV-31354 SIGSEGV in log_sort_flush_list() in InnoDB crash recovery

Closed

MDEV-13542 Crashing on a corrupted page is unhelpful

Closed

MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Closed

MDEV-31350 test innodb.recovery_memory failed on '21 failed attempts to flush a page'

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2023-05-26 10:24

My attempt with the patched mtr0mtr.cc and the ~~MDEV-31350~~ one-liner fix reproduced nothing like ~~MDEV-30479~~, but instead a new crash ~~MDEV-31354~~ near the end of the 88×100-test run:

Spent 149414.713 of 1779 seconds executing testcases

Failure: Failed 1/7600 tests, 99.99% were successful.

Marko Mäkelä added a comment - 2023-05-26 10:24 My attempt with the patched mtr0mtr.cc and the MDEV-31350 one-liner fix reproduced nothing like MDEV-30479 , but instead a new crash MDEV-31354 near the end of the 88×100-test run: Spent 149414.713 of 1779 seconds executing testcases Failure: Failed 1/7600 tests, 99.99% were successful.

Marko Mäkelä added a comment - 2023-05-26 13:05

The following change fixes the hang after recovery reports corruption (but will have to be tested extensively if it would break anything else):

diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc

index 4619786ee8d..bf7de0d7935 100644

--- a/storage/innobase/log/log0recv.cc

+++ b/storage/innobase/log/log0recv.cc

@@ -3279,9 +3279,6 @@ static buf_block_t *recv_recover_page(buf_block_t *block, mtr_t &mtr,

 		     || recv_sys.is_corrupt_log()) && !srv_force_recovery) {

 			if (init) {

 				init->created = false;

-				if (space || block->page.id().page_no()) {

-					block->page.lock.x_lock_recursive();

-				}

 			mtr.discard_modifications();

The current thread should already hold an exclusive latch on the block upon entering this function. This code was added in ~~MDEV-13542~~.

Marko Mäkelä added a comment - 2023-05-26 13:05 The following change fixes the hang after recovery reports corruption (but will have to be tested extensively if it would break anything else): diff --git a/storage/innobase/log/log0recv.cc b/storage/innobase/log/log0recv.cc index 4619786ee8d..bf7de0d7935 100644 --- a/storage/innobase/log/log0recv.cc +++ b/storage/innobase/log/log0recv.cc @@ -3279,9 +3279,6 @@ static buf_block_t *recv_recover_page(buf_block_t *block, mtr_t &mtr, || recv_sys.is_corrupt_log()) && !srv_force_recovery) { if (init) { init->created = false; - if (space || block->page.id().page_no()) { - block->page.lock.x_lock_recursive(); - } } mtr.discard_modifications(); The current thread should already hold an exclusive latch on the block upon entering this function. This code was added in MDEV-13542 .

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2023-05-26 09:12

Updated:: 2023-06-02 07:16

Resolved:: 2023-06-02 07:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration