Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35699

Multi-batch recovery occasionally fails

Details

    Description

      The test innodb.innodb-index started to fail recently with the 10.11 and later branches. The earliest such failure that I found in the cross-reference is the following:

      bb-10.11-MDEV-35072 3d70e432618f64231420a2c109bb3fb3687aae08

      innodb.innodb-index                      w8 [ fail ]
              Test ended at 2024-10-09 15:57:37
       
      CURRENT_TEST: innodb.innodb-index
      mysqltest: At line 1156: query 'SELECT * FROM t1' failed: ER_UNKNOWN_STORAGE_ENGINE (1286): Unknown storage engine 'InnoDB'
      

      This is 1 commit ahead of 10.11 8a6a4c947a0ca3d2fdca752d7440bdc5c6c83e37. The regression should have been introduced or exposed in some earlier change. Unfortunately, no server error logs are available for that failure. The oldest failure with a server error log that I could find was this one:
      https://buildbot.mariadb.net/buildbot/builders/kvm-rpm-fedora38-aarch64/builds/1747/steps/mtr/logs/mysqld.1.err.3

      10.11 cca85eb16e052f6ffc1d091cfba759fc2ba92ebf

      2024-11-01 19:37:08 0 [Note] InnoDB: Using ARMv8 crc32 + pmull instructions
      2024-11-01 19:37:08 0 [Note] InnoDB: Using liburing
      2024-11-01 19:37:08 0 [Note] InnoDB: Initializing buffer pool, total size = 8.000MiB, chunk size = 1.000MiB
      2024-11-01 19:37:08 0 [Note] InnoDB: Completed initialization of buffer pool
      2024-11-01 19:37:08 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
      2024-11-01 19:37:08 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=26115605
      2024-11-01 19:37:08 0 [Note] InnoDB: Multi-batch recovery needed at LSN 28893886
      2024-11-01 19:37:08 0 [Note] InnoDB: End of log at LSN=29667122
      2024-11-01 19:37:08 0 [ERROR] InnoDB: Tablespace 231 was not found at ./test/t2.ibd.
      2024-11-01 19:37:08 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
      2024-11-01 19:37:08 0 [ERROR] InnoDB: Plugin initialization aborted with error Tablespace not found
      

      Another failure: https://ci.mariadb.org/52081/logs/amd64-sles-1505/nm/mysqld.1.err.11 for https://buildbot.mariadb.org/#/builders/841/builds/44

      10.11 a54692a4b5bab8dc3aaac039bf9f9cb84bb2c9bf

      2024-11-21 10:52:18 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
      2024-11-21 10:52:18 0 [Note] InnoDB: Initializing buffer pool, total size = 8.000MiB, chunk size = 1.000MiB
      2024-11-21 10:52:18 0 [Note] InnoDB: Completed initialization of buffer pool
      2024-11-21 10:52:18 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
      2024-11-21 10:52:18 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=23961029
      2024-11-21 10:52:18 0 [Note] InnoDB: Multi-batch recovery needed at LSN 26833191
      2024-11-21 10:52:18 0 [Note] InnoDB: End of log at LSN=29586784
      2024-11-21 10:52:18 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore corrupted pages.
      2024-11-21 10:52:18 0 [ERROR] InnoDB: Unable to apply log to corrupted page 245 in file ./ibdata1
      2024-11-21 10:52:18 0 [Note] InnoDB: Retry with innodb_force_recovery=5
      2024-11-21 10:52:18 0 [ERROR] InnoDB: Plugin initialization aborted with error Page read from tablespace is corrupted.
      

      Attachments

        Issue Links

          Activity

            This failure https://ci.mariadb.org/52862/logs/amd64-debian-11-aocc/nm/mysqld.1.err.13 looks a bit curious, because the need for multi-batch recovery is noted twice:

            10.11 refs/pull/3692/merge

            2024-12-18  6:44:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=23960586
            2024-12-18  6:44:19 0 [Note] InnoDB: Multi-batch recovery needed at LSN 26739484
            2024-12-18  6:44:19 0 [Note] InnoDB: End of log at LSN=29586893
            2024-12-18  6:44:19 0 [Note] InnoDB: Multi-batch recovery needed at LSN 29347932
            2024-12-18  6:44:19 0 [ERROR] InnoDB: Tablespace 231 was not found at ./test/t2.ibd.
            2024-12-18  6:44:19 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace.
            

            The improved logic in MDEV-29911 was implemented differently for 10.6 and 10.11.

            marko Marko Mäkelä added a comment - This failure https://ci.mariadb.org/52862/logs/amd64-debian-11-aocc/nm/mysqld.1.err.13 looks a bit curious, because the need for multi-batch recovery is noted twice: 10.11 refs/pull/3692/merge 2024-12-18 6:44:19 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=23960586 2024-12-18 6:44:19 0 [Note] InnoDB: Multi-batch recovery needed at LSN 26739484 2024-12-18 6:44:19 0 [Note] InnoDB: End of log at LSN=29586893 2024-12-18 6:44:19 0 [Note] InnoDB: Multi-batch recovery needed at LSN 29347932 2024-12-18 6:44:19 0 [ERROR] InnoDB: Tablespace 231 was not found at ./test/t2.ibd. 2024-12-18 6:44:19 0 [Note] InnoDB: Set innodb_force_recovery=1 to ignore this and to permanently lose all changes to the tablespace. The improved logic in MDEV-29911 was implemented differently for 10.6 and 10.11.

            I had the uncommitted test files m-MDEV-34156-bis.opt and m-MDEV-34156-bis.test lying around on one system. They reproduce a recovery failure on my system with 10.6 but not with 10.11, and not with 10.11 if I start with a larger innodb_buffer_pool_size so that the recovery will happen in one batch. This seems to be directly related to MDEV-29911, which was implemented differently between 10.6 and 10.11. In 10.11, we seem to fail to parse the entire log before starting to apply changes. The failure of this test looks like the following:

            10.11 48b724047e4638e4795019e69d90fb26010d924a

            2025-01-07 14:36:51 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=61715
            2025-01-07 14:36:51 0 [Note] InnoDB: Multi-batch recovery needed at LSN 2091682
            2025-01-07 14:36:51 0 [Note] InnoDB: End of log at LSN=21139186
            2025-01-07 14:36:51 0 [Note] InnoDB: To recover: LSN 2409196/21139186; 11 pages
            2025-01-07 14:36:51 0 [ERROR] [FATAL] InnoDB: Trying to write 16384 bytes at 10485760 outside the bounds of the file: ./test/t1.ibd
            

            Yes, the symptoms are different from the originally reported failure in this ticket, but the root cause could be the same: we seem to fail to parse the relevant parts of the entire log before starting the first batch of multi-batch recovery.

            marko Marko Mäkelä added a comment - I had the uncommitted test files m-MDEV-34156-bis.opt and m-MDEV-34156-bis.test lying around on one system. They reproduce a recovery failure on my system with 10.6 but not with 10.11, and not with 10.11 if I start with a larger innodb_buffer_pool_size so that the recovery will happen in one batch. This seems to be directly related to MDEV-29911 , which was implemented differently between 10.6 and 10.11. In 10.11, we seem to fail to parse the entire log before starting to apply changes. The failure of this test looks like the following: 10.11 48b724047e4638e4795019e69d90fb26010d924a 2025-01-07 14:36:51 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=61715 2025-01-07 14:36:51 0 [Note] InnoDB: Multi-batch recovery needed at LSN 2091682 2025-01-07 14:36:51 0 [Note] InnoDB: End of log at LSN=21139186 2025-01-07 14:36:51 0 [Note] InnoDB: To recover: LSN 2409196/21139186; 11 pages 2025-01-07 14:36:51 0 [ERROR] [FATAL] InnoDB: Trying to write 16384 bytes at 10485760 outside the bounds of the file: ./test/t1.ibd Yes, the symptoms are different from the originally reported failure in this ticket, but the root cause could be the same: we seem to fail to parse the relevant parts of the entire log before starting the first batch of multi-batch recovery.

            MDEV-29911 seems to be innocent to this after all.

            MDEV-34850 was closed a few weeks before the failures started to occur. It omitted some necessary code from recv_sys_t::parse() for the template parameter storing=NO that is needed for correctly recovering the size of a data file.

            marko Marko Mäkelä added a comment - MDEV-29911 seems to be innocent to this after all. MDEV-34850 was closed a few weeks before the failures started to occur. It omitted some necessary code from recv_sys_t::parse() for the template parameter storing=NO that is needed for correctly recovering the size of a data file.

            I see the recovery path which we made specific for (storing == YES) in MDEV-34850 is needed also for storing == NO. Before MDEV-34850 this path was unconditional but I think you are right that is is not required for storing == BACKUP. So the condition "storing != BACKUP" seems right.

            I still find some path which is partly enabled for storing == NO and added comments. Unless you think that is addressing any functional issue, it is perhaps safer to follow pre-MDEV-34850 for storing=NO path. Please check.

            debarun Debarun Banerjee added a comment - I see the recovery path which we made specific for (storing == YES) in MDEV-34850 is needed also for storing == NO. Before MDEV-34850 this path was unconditional but I think you are right that is is not required for storing == BACKUP. So the condition "storing != BACKUP" seems right. I still find some path which is partly enabled for storing == NO and added comments. Unless you think that is addressing any functional issue, it is perhaps safer to follow pre- MDEV-34850 for storing=NO path. Please check.

            The question was about the OPTION records, which currently are only used by MDEV-18976. I ended up filing MDEV-35796 and including that fix in the branch.

            marko Marko Mäkelä added a comment - The question was about the OPTION records, which currently are only used by MDEV-18976 . I ended up filing MDEV-35796 and including that fix in the branch.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.