Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33295

The test innodb.doublewrite occasionally fails

Details

    Description

      I am not yet sure what is to blame here. The following test invocation would occasionally fail:

      ./mtr --parallel=80 --repeat=10 innodb.doublewrite{,,,,,,,} 
      

      This would fail due to corruption of the tables mysql.innodb_index_stats or mysql.innodb_table_stats, like this:

      10.11 35cc4b6c058aeb905e8d802569444bdfce02193a

      innodb.doublewrite '64k,innodb,strict_crc32' w30 [ 2 fail ]  Found warnings/errors in server log file!
              Test ended at 2024-01-22 16:32:44
      line
      2024-01-22 16:32:32 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/innodb_table_stats.ibd' page [page id: space=1, page number=0]. You may have to recover from a backup.
      2024-01-22 16:32:32 0 [ERROR] InnoDB: File './mysql/innodb_table_stats.ibd' is corrupted
      2024-01-22 16:32:32 0 [ERROR] InnoDB: Failed to read page 0 from file './mysql/innodb_table_stats.ibd': Page read from tablespace is corrupted.
      2024-01-22 16:32:37 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/innodb_table_stats.ibd' page [page id: space=1, page number=0]. You may have to recover from a backup.
      2024-01-22 16:32:37 0 [ERROR] InnoDB: File './mysql/innodb_table_stats.ibd' is corrupted
      2024-01-22 16:32:37 0 [ERROR] InnoDB: Failed to read page 0 from file './mysql/innodb_table_stats.ibd': Page read from tablespace is corrupted.
      2024-01-22 16:32:42 0 [ERROR] InnoDB: Database page corruption on disk or a failed read of file './mysql/innodb_table_stats.ibd' page [page id: space=1, page number=0]. You may have to recover from a backup.
      2024-01-22 16:32:42 0 [ERROR] InnoDB: File './mysql/innodb_table_stats.ibd' is corrupted
      2024-01-22 16:32:42 0 [ERROR] InnoDB: Failed to read page 0 from file './mysql/innodb_table_stats.ibd': Page read from tablespace is corrupted.
      ^ Found warnings in /dev/shm/10.11/mysql-test/var/30/log/mysqld.1.err
      

      The first page of the data file would look as follows:

      od -Ax -t x1 -N 65536 var/log/innodb.doublewrite-64k,innodb,strict_crc32/mysqld.1/data/mysql/innodb_table_stats.ibd 
      000000 01 00 00 00 00 00 00 00 ff ff ff ff ff ff ff ff
      000010 00 00 00 00 00 00 b3 d3 00 08 00 00 00 00 00 00
      000020 00 00 00 00 00 01 00 00 00 01 00 00 00 00 00 00
      000030 00 04 00 00 00 40 00 00 01 e1 00 00 00 04 00 00
      000040 00 00 ff ff ff ff 00 00 ff ff ff ff 00 00 00 00
      000050 00 01 00 00 00 00 00 9e 00 00 00 00 00 9e 00 00
      000060 00 00 ff ff ff ff 00 00 ff ff ff ff 00 00 00 00
      000070 00 00 00 00 00 03 00 00 00 00 ff ff ff ff 00 00
      000080 ff ff ff ff 00 00 00 00 00 01 00 00 00 02 00 26
      000090 00 00 00 02 00 26 00 00 00 00 00 00 00 00 ff ff
      0000a0 ff ff 00 00 ff ff ff ff 00 00 00 00 00 02 aa ff
      0000b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 00 00
      0000c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
      *
      00fff0 00 00 00 00 00 00 00 00 c5 21 ff 07 00 00 b3 d3
      

      The error is that the first 4 bytes are 0x1000000 big-endian instead of a copy of the checksum bytes that are at 0xfff8 near the end of the page.

      It is notable that during the test, this table should not be modified. The FIL_PAGE_LSN (in the above case 0xb3d3) is smaller than the LSN after the end of the bootstrap.

      I was able to reproduce the failure also by running exclusively the strict_crc32 combination, and always for the 64k page size. But I can’t seem to be able to reproduce it when invoking the following:

      while nice ./mtr --repeat=2 --parallel=80 innodb.doublewrite,strict_full_crc32{,,,}{,}{,}; do :; done
      

      The test is running on /dev/shm, and we could get close to exceeding the available RAM. There are a couple of kernel OOM kill messages in the sudo dmesg output, but nothing recently.

      Attachments

        Issue Links

          Activity

            The following just completed successfully on the same 10.11 revision:

            ./mtr --repeat=30 --parallel=80 innodb.doublewrite,strict_full_crc32{,,,}{,}{,}
            

            marko Marko Mäkelä added a comment - The following just completed successfully on the same 10.11 revision: . /mtr --repeat=30 --parallel=80 innodb.doublewrite,strict_full_crc32{,,,}{,}{,}

            For the record, the following did not fail either:

            ./mtr --repeat=30 --parallel=80 innodb.doublewrite,strict_crc32,64k{,,,}{,}{,}{,,,,}
            

            One hypothetical explanation of the failure would be some race condition around the serve bootstrap in mtr, when the test is being prepared for different innodb_page_size values.

            marko Marko Mäkelä added a comment - For the record, the following did not fail either: . /mtr --repeat=30 --parallel=80 innodb.doublewrite,strict_crc32,64k{,,,}{,}{,}{,,,,} One hypothetical explanation of the failure would be some race condition around the serve bootstrap in mtr , when the test is being prepared for different innodb_page_size values.

            I was not able to reproduce any failure with the 3 smaller page sizes (excluding 32k and 64k):

            diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test
            index d73009908db..8c5ed5d0a49 100644
            --- a/mysql-test/suite/innodb/t/doublewrite.test
            +++ b/mysql-test/suite/innodb/t/doublewrite.test
            @@ -3,7 +3,7 @@
             --echo # MDEV-32242 innodb.doublewrite test case always is skipped
             --echo #
             
            ---source include/innodb_page_size.inc
            +--source include/innodb_page_size_small.inc
             --source include/not_embedded.inc
             
             --disable_query_log
            

            ./mtr --repeat=30 --parallel=120 innodb.doublewrite{,,,,}{,}{,}
            

            marko Marko Mäkelä added a comment - I was not able to reproduce any failure with the 3 smaller page sizes (excluding 32k and 64k): diff --git a/mysql-test/suite/innodb/t/doublewrite.test b/mysql-test/suite/innodb/t/doublewrite.test index d73009908db..8c5ed5d0a49 100644 --- a/mysql-test/suite/innodb/t/doublewrite.test +++ b/mysql-test/suite/innodb/t/doublewrite.test @@ -3,7 +3,7 @@ --echo # MDEV-32242 innodb.doublewrite test case always is skipped --echo # ---source include/innodb_page_size.inc +--source include/innodb_page_size_small.inc --source include/not_embedded.inc --disable_query_log . /mtr --repeat=30 --parallel=120 innodb.doublewrite{,,,,}{,}{,}

            I am not able to reproduce this today, neither with the commit mentioned in the Description nor with the current 10.11.

            The difference is that today, I built the executable with GCC 13.2.0-13 (installed on January 30, about a week after I filed this bug) and I am running on Linux 6.6.13-amd64. Each time, I was running the test on /dev/shm.

            marko Marko Mäkelä added a comment - I am not able to reproduce this today, neither with the commit mentioned in the Description nor with the current 10.11. The difference is that today, I built the executable with GCC 13.2.0-13 (installed on January 30, about a week after I filed this bug) and I am running on Linux 6.6.13-amd64. Each time, I was running the test on /dev/shm .

            This test started to fail very reliably with the MDEV-29445 changes. I believe that this test is badly written. For example, it is assuming that during a failed recovery attempt, the server would not write anything via the doublewrite buffer.

            The original reported failure may have occurred due to a code bug that has been fixed during 2024.

            marko Marko Mäkelä added a comment - This test started to fail very reliably with the MDEV-29445 changes. I believe that this test is badly written. For example, it is assuming that during a failed recovery attempt, the server would not write anything via the doublewrite buffer. The original reported failure may have occurred due to a code bug that has been fixed during 2024.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.