Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31350

test innodb.recovery_memory failed on '21 failed attempts to flush a page'

Details

    Description

      After uploading MariaDB 10.11.3 to Launchpad builders several of them failed on test innodb.recovery_memory:

      innodb.recovery_memory 'innodb,release'  w3 [ fail ]  Found warnings/errors in server log file!
              Test ended at 2023-05-11 22:44:39
      line
      2023-05-11 22:44:34 0 [Warning] InnoDB: Difficult to find free blocks in the buffer pool (21 search iterations)! 21 failed attempts to flush a page! Consider increasing innodb_buffer_pool_size. Pending flushes (fsync): 0. 166 OS file reads, 66 OS file writes, 0 OS fsyncs.
      

      Examples from amd64 and arm64 builds:
      https://launchpadlibrarian.net/665736950/buildlog_ubuntu-mantic-arm64.mariadb_1%3A10.11.3-1~ubuntu23.10.1~1683836249.0a0f09bbe32.dev.otto_BUILDING.txt.gz
      https://launchpadlibrarian.net/665721775/buildlog_ubuntu-mantic-amd64.mariadb_1%3A10.11.3-1~ubuntu23.10.1~1683836249.0a0f09bbe32.dev.otto_BUILDING.txt.gz

      I see thiru worked on this test in February/March of 2023, he perhaps would know best what might have regressed here?

      Attachments

        Issue Links

          Activity

            This patch looks promising so far. I will let the test run for 88×100 runs (currently about 88×41 passed):

            diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc
            index 91c8de3191b..90263757c19 100644
            --- a/storage/innobase/buf/buf0flu.cc
            +++ b/storage/innobase/buf/buf0flu.cc
            @@ -2438,6 +2438,7 @@ static void buf_flush_page_cleaner()
                 else if (buf_pool.ran_out())
                 {
                   buf_pool.page_cleaner_set_idle(false);
            +      buf_pool.get_oldest_modification(0);
                   mysql_mutex_unlock(&buf_pool.flush_list_mutex);
                   n= srv_max_io_capacity;
                   mysql_mutex_lock(&buf_pool.mutex);
            

            This is a regression due to MDEV-26827. Between MDEV-23855 and MDEV-26827, LRU eviction flushing was only initiated by user threads.

            I believe that MDEV-29911 fixed the original reported problem, because crash recovery will use a different way of allocating data pages from the buffer pool. The only additional thing that needs to be fixed is this hang due to MDEV-26827.

            marko Marko Mäkelä added a comment - This patch looks promising so far. I will let the test run for 88×100 runs (currently about 88×41 passed): diff --git a/storage/innobase/buf/buf0flu.cc b/storage/innobase/buf/buf0flu.cc index 91c8de3191b..90263757c19 100644 --- a/storage/innobase/buf/buf0flu.cc +++ b/storage/innobase/buf/buf0flu.cc @@ -2438,6 +2438,7 @@ static void buf_flush_page_cleaner() else if (buf_pool.ran_out()) { buf_pool.page_cleaner_set_idle(false); + buf_pool.get_oldest_modification(0); mysql_mutex_unlock(&buf_pool.flush_list_mutex); n= srv_max_io_capacity; mysql_mutex_lock(&buf_pool.mutex); This is a regression due to MDEV-26827 . Between MDEV-23855 and MDEV-26827 , LRU eviction flushing was only initiated by user threads. I believe that MDEV-29911 fixed the original reported problem, because crash recovery will use a different way of allocating data pages from the buffer pool. The only additional thing that needs to be fixed is this hang due to MDEV-26827 .
            marko Marko Mäkelä added a comment - - edited

            My test was interrupted by a different failure, which I reported as MDEV-31353. wc -l mysql-test/var/*/log/mysqld.1.err reported between 16413 and 16844 lines of output for each server instance, so there was no hang.

            10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch

            innodb.recovery_memory 'innodb,release'  w49 [ 54 fail ]
                    Test ended at 2023-05-26 11:29:12
             
            CURRENT_TEST: innodb.recovery_memory
            mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
            

            The server error log says:

            10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch

            2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
            2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048
            2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683
            2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages
            2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
            2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
            

            The data file indeed is test/t1.ibd, carrying the tablespace ID 216, and innochecksum does not report it as corrupted. But, an attempted single-batch recovery of the saved data directory fails as well; actually it hangs after reporting a failure, to be fixed in MDEV-31353.

            marko Marko Mäkelä added a comment - - edited My test was interrupted by a different failure, which I reported as MDEV-31353 . wc -l mysql-test/var/*/log/mysqld.1.err reported between 16413 and 16844 lines of output for each server instance, so there was no hang. 10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch innodb.recovery_memory 'innodb,release' w49 [ 54 fail ] Test ended at 2023-05-26 11:29:12   CURRENT_TEST: innodb.recovery_memory mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB' The server error log says: 10.9 44c9008ba65686abf1c82c9166255a8c52d61f74 with patch 2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978 2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048 2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683 2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages 2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages. 2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4] The data file indeed is test/t1.ibd , carrying the tablespace ID 216, and innochecksum does not report it as corrupted. But, an attempted single-batch recovery of the saved data directory fails as well; actually it hangs after reporting a failure, to be fixed in MDEV-31353 .

            I reproduced the same hang on 10.6, with the page cleaner invoking this constantly:

            #3  buf_flush_LRU_list_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1275
            #4  buf_do_LRU_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1382
            #5  buf_flush_LRU (max_n=max_n@entry=2000, evict=false) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1728
            #6  0x0000556f50449cc4 in buf_flush_page_cleaner () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2337
            

            Again, buf_pool.free was empty, and all 210 entries of buf_pool.LRU or buf_pool.flush_list were actually clean (oldest_modification_ was 1).

            marko Marko Mäkelä added a comment - I reproduced the same hang on 10.6, with the page cleaner invoking this constantly: #3 buf_flush_LRU_list_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1275 #4 buf_do_LRU_batch (max=2000, evict=false, n=<optimized out>) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1382 #5 buf_flush_LRU (max_n=max_n@entry=2000, evict=false) at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:1728 #6 0x0000556f50449cc4 in buf_flush_page_cleaner () at /mariadb/10.6/storage/innobase/buf/buf0flu.cc:2337 Again, buf_pool.free was empty, and all 210 entries of buf_pool.LRU or buf_pool.flush_list were actually clean ( oldest_modification_ was 1).

            A test with the one-liner fix applied on 10.6 RelWithDebInfo passed uneventfully:

            innodb.recovery_memory 'innodb,release'  w42 [ 100 pass ]  14777
            --------------------------------------------------------------------------
            The servers were restarted 0 times
            Spent 143884.706 of 1701 seconds executing testcases
             
            Completed: All 8800 tests were successful.
            

            marko Marko Mäkelä added a comment - A test with the one-liner fix applied on 10.6 RelWithDebInfo passed uneventfully: innodb.recovery_memory 'innodb,release' w42 [ 100 pass ] 14777 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 143884.706 of 1701 seconds executing testcases   Completed: All 8800 tests were successful.

            Note: the unscheduled releases 10.6.14, 10.9.7, and so on only fix the regression due to MDEV-26827, but they will not include MDEV-29911. Therefore, in those releases the test may fail as described.

            marko Marko Mäkelä added a comment - Note: the unscheduled releases 10.6.14, 10.9.7, and so on only fix the regression due to MDEV-26827 , but they will not include MDEV-29911 . Therefore, in those releases the test may fail as described.

            People

              marko Marko Mäkelä
              otto Otto Kekäläinen
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.