[MDEV-31350] test innodb.recovery_memory failed on '21 failed attempts to flush a page' - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.11.3, 10.6.13
Fix Version/s: 11.1.1, 11.0.2, 10.6.14, 10.9.7, 10.10.5, 10.11.4
Component/s: Storage Engine - InnoDB
Labels:
- hang
- recovery
- regression

Description

After uploading MariaDB 10.11.3 to Launchpad builders several of them failed on test innodb.recovery_memory:

innodb.recovery_memory 'innodb,release'  w3 [ fail ]  Found warnings/errors in server log file!

        Test ended at 2023-05-11 22:44:39

line

2023-05-11 22:44:34 0 [Warning] InnoDB: Difficult to find free blocks in the buffer pool (21 search iterations)! 21 failed attempts to flush a page! Consider increasing innodb_buffer_pool_size. Pending flushes (fsync): 0. 166 OS file reads, 66 OS file writes, 0 OS fsyncs.

Examples from amd64 and arm64 builds:
https://launchpadlibrarian.net/665736950/buildlog_ubuntu-mantic-arm64.mariadb_1%3A10.11.3-1~ubuntu23.10.1~1683836249.0a0f09bbe32.dev.otto_BUILDING.txt.gz
https://launchpadlibrarian.net/665721775/buildlog_ubuntu-mantic-amd64.mariadb_1%3A10.11.3-1~ubuntu23.10.1~1683836249.0a0f09bbe32.dev.otto_BUILDING.txt.gz

I see thiru worked on this test in February/March of 2023, he perhaps would know best what might have regressed here?

Attachments

Issue Links

is caused by

MDEV-26827 Make page flushing even faster

Closed

relates to

MDEV-26827 Make page flushing even faster

Closed

MDEV-31353 InnoDB recovery hangs after reporting corruption

Closed

MDEV-31354 SIGSEGV in log_sort_flush_list() in InnoDB crash recovery

Closed

MDEV-29911 InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Closed

Activity

Ascending order - Click to sort in descending order

View 4 older comments

Marko Mäkelä added a comment - 2023-05-26 08:26

This patch looks promising so far. I will let the test run for 88×100 runs (currently about 88×41 passed):

diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc

index 91c8de3191b..90263757c19 100644

--- a/storage/innobase/buf/buf0flu.cc

+++ b/storage/innobase/buf/buf0flu.cc

@@ -2438,6 +2438,7 @@ static void buf_flush_page_cleaner()

     else if (buf_pool.ran_out())

       buf_pool.page_cleaner_set_idle(false);

+      buf_pool.get_oldest_modification(0);

       mysql_mutex_unlock(&buf_pool.flush_list_mutex);

       n= srv_max_io_capacity;

       mysql_mutex_lock(&buf_pool.mutex);

This is a regression due to ~~MDEV-26827~~. Between ~~MDEV-23855~~ and ~~MDEV-26827~~, LRU eviction flushing was only initiated by user threads.

I believe that ~~MDEV-29911~~ fixed the original reported problem, because crash recovery will use a different way of allocating data pages from the buffer pool. The only additional thing that needs to be fixed is this hang due to ~~MDEV-26827~~.

Marko Mäkelä added a comment - 2023-05-26 08:26 This patch looks promising so far. I will let the test run for 88×100 runs (currently about 88×41 passed): diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 91c8de3191b..90263757c19 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2438,6 +2438,7 @@ static void buf_flush_page_cleaner() else if (buf_pool.ran_out()) { buf_pool.page_cleaner_set_idle(false); + buf_pool.get_oldest_modification(0); mysql_mutex_unlock(&buf_pool.flush_list_mutex); n= srv_max_io_capacity; mysql_mutex_lock(&buf_pool.mutex); This is a regression due to MDEV-26827 . Between MDEV-23855 and MDEV-26827 , LRU eviction flushing was only initiated by user threads. I believe that MDEV-29911 fixed the original reported problem, because crash recovery will use a different way of allocating data pages from the buffer pool. The only additional thing that needs to be fixed is this hang due to MDEV-26827 .

Marko Mäkelä added a comment - 2023-05-26 08:43 - edited

My test was interrupted by a different failure, which I reported as ~~MDEV-31353~~. wc -l mysql-test/var/*/log/mysqld.1.err reported between 16413 and 16844 lines of output for each server instance, so there was no hang.

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch
innodb.recovery_memory 'innodb,release' w49 [ 54 fail ]
Test ended at 2023-05-26 11:29:12

CURRENT_TEST: innodb.recovery_memory
mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'

The server error log says:

10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch
2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048
2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683
2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages
2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]

The data file indeed is test/t1.ibd, carrying the tablespace ID 216, and innochecksum does not report it as corrupted. But, an attempted single-batch recovery of the saved data directory fails as well; actually it hangs after reporting a failure, to be fixed in ~~MDEV-31353~~.

Marko Mäkelä added a comment - 2023-05-26 08:43 - edited My test was interrupted by a different failure, which I reported as MDEV-31353 . wc -l mysql-test/var/*/log/mysqld.1.err reported between 16413 and 16844 lines of output for each server instance, so there was no hang. 10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch innodb.recovery_memory 'innodb,release' w49 [ 54 fail ] Test ended at 2023-05-26 11:29:12 CURRENT_TEST: innodb.recovery_memory mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB' The server error log says: 10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch 2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978 2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048 2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683 2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages 2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages. 2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4] The data file indeed is test/t1.ibd , carrying the tablespace ID 216, and innochecksum does not report it as corrupted. But, an attempted single-batch recovery of the saved data directory fails as well; actually it hangs after reporting a failure, to be fixed in MDEV-31353 .

Marko Mäkelä added a comment - 2023-05-26 10:53

I reproduced the same hang on 10.6, with the page cleaner invoking this constantly:

#3  buf_flush_LRU_list_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1275

#4  buf_do_LRU_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1382

#5  buf_flush_LRU (max_n=max_n@entry=2000, evict=false) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1728

#6  0x0000556f50449cc4 in buf_flush_page_cleaner () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2337

Again, buf_pool.free was empty, and all 210 entries of buf_pool.LRU or buf_pool.flush_list were actually clean (oldest_modification_ was 1).

Marko Mäkelä added a comment - 2023-05-26 10:53 I reproduced the same hang on 10.6, with the page cleaner invoking this constantly: #3 buf_flush_LRU_list_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1275 #4 buf_do_LRU_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1382 #5 buf_flush_LRU (max_n=max_n@entry=2000, evict=false) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1728 #6 0x0000556f50449cc4 in buf_flush_page_cleaner () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2337 Again, buf_pool.free was empty, and all 210 entries of buf_pool.LRU or buf_pool.flush_list were actually clean ( oldest_modification_ was 1).

Marko Mäkelä added a comment - 2023-05-26 11:27

A test with the one-liner fix applied on 10.6 RelWithDebInfo passed uneventfully:

innodb.recovery_memory 'innodb,release'  w42 [ 100 pass ]  14777

--------------------------------------------------------------------------

The servers were restarted 0 times

Spent 143884.706 of 1701 seconds executing testcases

Completed: All 8800 tests were successful.

Marko Mäkelä added a comment - 2023-05-26 11:27 A test with the one-liner fix applied on 10.6 RelWithDebInfo passed uneventfully: innodb.recovery_memory 'innodb,release' w42 [ 100 pass ] 14777 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 143884.706 of 1701 seconds executing testcases Completed: All 8800 tests were successful.

Marko Mäkelä added a comment - 2023-06-06 13:49

Note: the unscheduled releases 10.6.14, 10.9.7, and so on only fix the regression due to ~~MDEV-26827~~, but they will not include ~~MDEV-29911~~. Therefore, in those releases the test may fail as described.

Marko Mäkelä added a comment - 2023-06-06 13:49 Note: the unscheduled releases 10.6.14, 10.9.7, and so on only fix the regression due to MDEV-26827 , but they will not include MDEV-29911 . Therefore, in those releases the test may fail as described.

MariaDB Server

test innodb.recovery_memory failed on '21 failed attempts to flush a page'

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration