Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31353

InnoDB recovery hangs after reporting corruption

    XMLWordPrintable

Details

    Description

      This failure, which I got while testing MDEV-31350, might be possible in 10.6 as well; I did not try it yet. I do not think that the one-line patch of MDEV-31350 plays any role here. The original error messages from the test run were:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      innodb.recovery_memory 'innodb,release'  w49 [ 54 fail ]
              Test ended at 2023-05-26 11:29:12
       
      CURRENT_TEST: innodb.recovery_memory
      mysqltest: At line 42: query 'CREATE TABLE t1(f1 INT NOT NULL)ENGINE=InnoDB' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
      

      The server error log says:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      2023-05-26 11:29:11 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
      2023-05-26 11:29:11 0 [Note] InnoDB: Multi-batch recovery needed at LSN 293582048
      2023-05-26 11:29:11 0 [Note] InnoDB: End of log at LSN=294552683
      2023-05-26 11:29:11 0 [Note] InnoDB: To recover: LSN 293785020/294552683; 347 pages
      2023-05-26 11:29:11 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2023-05-26 11:29:11 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
      

      When I tried start recover a copy of the data directory (in the attached data.tar.xz) using a larger buffer pool (and a single recovery batch), I got a different error instead:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      2023-05-26 11:40:49 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=293063978
      2023-05-26 11:40:49 0 [Note] InnoDB: End of log at LSN=294552683
      2023-05-26 11:40:49 0 [Note] InnoDB: To recover: 151 pages
      2023-05-26 11:40:49 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=216, page number=4]
      2023-05-26 11:40:49 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2023-05-26 11:40:49 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2023-05-26 11:40:49 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
      

      The server would subsequently hang in the main recovery thread:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      #8  os_aio_wait_until_no_pending_reads (declare=<optimized out>) at /mariadb/11/storage/innobase/os/os0file.cc:3695
      #9  0x00005620d562ba59 in recv_sys_t::wait_for_pool (this=this@entry=0x5620d594e500 <recv_sys>, pages=pages@entry=0) at /mariadb/11/storage/innobase/log/log0recv.cc:2047
      

      I suppose that the real reason for the hang is the infinite loop in the thread that reported the corruption:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      #1  LF_BACKOFF () at /mariadb/11/include/my_cpu.h:125
      #2  buf_pool_t::corrupted_evict (...) at /mariadb/11/storage/innobase/buf/buf0lru.cc:1247
      #3  0x00005620d562662d in recv_recover_page (...) at /mariadb/11/storage/innobase/log/log0recv.cc:3290
      #4  0x00005620d56267f4 in IORequest::fake_read_complete (this=0x7f6f4c331308, offset=94699053504392) at /mariadb/11/storage/innobase/log/log0recv.cc:3467
      #5  0x00005620d5640a97 in fake_io_callback (c=0x7f6f4c331290) at /mariadb/11/storage/innobase/os/os0file.cc:3434
      

      In the data directory, page 4 of the file data/test/t1.ibd is filled with NUL bytes, so the page should have been recoverable by applying the log records. In the ib_logfile0 there are two checkpoint blocks present: at offset 0x2000 for the latest checkpoint LSN 293063978, and at 0x1000 for an earlier checkpoint LSN 289089329. If I corrupt the latest checkpoint:

      dd if=/dev/zero bs=4096 seek=2 count=1 conv=notrunc of=data/ib_logfile0
      sql/mariadbd --datadir "$(pwd)/data"
      

      the recovery will start from the earlier checkpoint LSN and fail in the same way:

      10.9 44c9008ba65686abf1c82c9166255a8c52d61f74

      2023-05-26 12:07:18 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=289089329
      2023-05-26 12:07:18 0 [Note] InnoDB: End of log at LSN=294552683
      2023-05-26 12:07:18 0 [Note] InnoDB: To recover: 217 pages
      2023-05-26 12:07:18 0 [ERROR] InnoDB: Not applying INSERT_HEAP_DYNAMIC due to corruption on [page id: space=216, page number=4]
      2023-05-26 12:07:18 0 [ERROR] InnoDB: Set innodb_force_recovery=1 to ignore corruption.
      2023-05-26 12:07:18 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2023-05-26 12:07:18 0 [ERROR] InnoDB: Unable to apply log to corrupted page [page id: space=216, page number=4]
      

      This would seem to suggest that the problem is not a race condition around writing log checkpoints but instead something that MDEV-18976 should be able to catch. It could be worthwhile to enable that logic in a non-debug build:

      diff --git a/storage/innobase/mtr/mtr0mtr.cc b/storage/innobase/mtr/mtr0mtr.cc
      index fb08511126a..de74c5658f3 100644
      --- a/storage/innobase/mtr/mtr0mtr.cc
      +++ b/storage/innobase/mtr/mtr0mtr.cc
      @@ -1020,7 +1020,7 @@ std::pair<lsn_t,mtr_t::page_flush_ahead> mtr_t::do_write()
         ut_ad(!m_latch_ex || log_sys.latch.is_write_locked());
       #endif
       
      -#ifndef DBUG_OFF
      +#if 1
         do
         {
           if (m_log_mode != MTR_LOG_ALL)
      

      What needs to be fixed in this ticket is the InnoDB hang after corruption was reported.

      Attachments

        1. data.tar.xz
          3.70 MB
          Marko Mäkelä

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.