Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31353

InnoDB recovery hangs after reporting corruption

    XMLWordPrintable

Details

    Description

      This failure, which I got while testing MDEV-31350, might be possible in 10.6 as well; I did not try it yet. I do not think that the one-line patch of MDEV-31350 plays any role here. The original error messages from the test run were:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      innodb.recovery_memory 'innodb,release'  w49 [ 54 fail ]
              Test ended at 2023-05-26 11:29:12
       
      CURRENT_TEST: innodb.recovery_memory
      mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
      

      The server error log says:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
      2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048
      2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683
      2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages
      2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
      

      When I tried start recover a copy of the data directory (in the attached data.tar.xz) using a larger buffer pool (and a single recovery batch), I got a different error instead:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      2023-05-26 11:40:49 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
      2023-05-26 11:40:49 0 [Note] InnoDB: End of log at LSN=294552683
      2023-05-26 11:40:49 0 [Note] InnoDB: To recover: 151 pages
      2023-05-26 11:40:49 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=216, page number=4]
      2023-05-26 11:40:49 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2023-05-26 11:40:49 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2023-05-26 11:40:49 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
      

      The server would subsequently hang in the main recovery thread:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      #8  os_aio_wait_until_no_pending_reads (declare=<optimized out>) at /mariadb/11/storage/innobase/os/os0file.cc:3695
      #9  0x00005620d562ba59 in recv_sys_t::wait_for_pool (this=this@entry=0x5620d594e500 <recv_sys>, pages=pages@entry=0) at /mariadb/11/storage/innobase/log/log0recv.cc:2047
      

      I suppose that the real reason for the hang is the infinite loop in the thread that reported the corruption:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      #1  LF_BACKOFF () at /mariadb/11/include/my_cpu.h:125
      #2  buf_pool_t::corrupted_evict (...) at /mariadb/11/storage/innobase/buf/buf0lru.cc:1247
      #3  0x00005620d562662d in recv_recover_page (...) at /mariadb/11/storage/innobase/log/log0recv.cc:3290
      #4  0x00005620d56267f4 in IORequest::fake_read_complete (this=0x7f6f4c331308, offset=94699053504392) at /mariadb/11/storage/innobase/log/log0recv.cc:3467
      #5  0x00005620d5640a97 in fake_io_callback (c=0x7f6f4c331290) at /mariadb/11/storage/innobase/os/os0file.cc:3434
      

      In the data directory, page 4 of the file data/test/t1.ibd is filled with NUL bytes, so the page should have been recoverable by applying the log records. In the ib_logfile0 there are two checkpoint blocks present: at offset 0x2000 for the latest checkpoint LSN 293063978, and at 0x1000 for an earlier checkpoint LSN 289089329. If I corrupt the latest checkpoint:

      dd if=/dev/zero bs=4096 seek=2 count=1 conv=notrunc of=data/ib_logfile0
      sql/mariadbd --datadir "$(pwd)/data"
      

      the recovery will start from the earlier checkpoint LSN and fail in the same way:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      2023-05-26 12:07:18 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=289089329
      2023-05-26 12:07:18 0 [Note] InnoDB: End of log at LSN=294552683
      2023-05-26 12:07:18 0 [Note] InnoDB: To recover: 217 pages
      2023-05-26 12:07:18 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=216, page number=4]
      2023-05-26 12:07:18 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2023-05-26 12:07:18 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2023-05-26 12:07:18 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
      

      This would seem to suggest that the problem is not a race condition around writing log checkpoints but instead something that MDEV-18976 should be able to catch. It could be worthwhile to enable that logic in a non-debug build:

      diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
      index fb08511126a..de74c5658f3 100644
      --- a/storage/innobase/mtr/mtr0mtr.cc
      +++ b/storage/innobase/mtr/mtr0mtr.cc
      @@ -1020,7 +1020,7 @@ std::pair<lsn_t,mtr_t::page_flush_ahead> mtr_t::do_write()
         ut_ad(!m_latch_ex || log_sys.latch.is_write_locked());
       #endif
       
      -#ifndef DBUG_OFF
      +#if 1
         do
         {
           if (m_log_mode != MTR_LOG_ALL)
      

      What needs to be fixed in this ticket is the InnoDB hang after corruption was reported.

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.