Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35334

Incorrect page checksum at the start of an .ibd file

Details

    Description

      We have recently upgraded MariaDB from 10.11.7 to 10.11.9 in all environments and have recently encountered some corruption errors in the tablespaces and indexes.

      Attachments

        1. error_25602_redacted_full.log
          113 kB
        2. error_25602.log
          7 kB
        3. error_25612_full.log
          276 kB
        4. non-compressed-page0.bin
          16 kB
        5. screenshot-1.png
          screenshot-1.png
          11 kB
        6. screenshot-2.png
          screenshot-2.png
          11 kB
        7. space109953page0.bin
          16 kB

        Issue Links

          Activity

            thetaphi Uwe Schindler added a comment -

            Hi Marko,
            I may be able to get back the original file from the last snapshot of the VM, but the affected table ws matomo_log_visits, which contains the most data protection relevant data

            In any case, the corrumption may have been happened already before the issues with checksum at beginning of files wrong.

            The databases are now clean and I checked all files (after shutdown) with: ls *.ibd | xargs -L1 -t innochecksum (this originally printed corrumption for 3 tables, all very large and all privacy-sensitive.

            The mediawiki on the other server was clean, so it looks like the issues may had other reasons.

            I had an SQL-dump of the whole database created previously. And as this worked without errors, the optimize should not have caused issues.

            I see that you are using MariaDB 10.11.8. I must warn you that the bug MDEV-34453 can cause corruption in any InnoDB data pages, but I would expect the page checksum to appear correct in those cases. Do you have a copy of the corrupted data file anywhere, such as a backup? I’d like to understand what the corruption would look like.

            Yes that's the Ubuntu LTS version of MariaDB. The problem with Ubuntu is that for LTS releases they never ever change the version number of any package and just patch bugs. The Debian changelog shows some bugfixes, but there were no recent updates of the package. So the bug could be there. I will maybe change to an official PPA. Thanks for the warning!

            thetaphi Uwe Schindler added a comment - Hi Marko, I may be able to get back the original file from the last snapshot of the VM, but the affected table ws matomo_log_visits, which contains the most data protection relevant data In any case, the corrumption may have been happened already before the issues with checksum at beginning of files wrong. The databases are now clean and I checked all files (after shutdown) with: ls *.ibd | xargs -L1 -t innochecksum (this originally printed corrumption for 3 tables, all very large and all privacy-sensitive. The mediawiki on the other server was clean, so it looks like the issues may had other reasons. I had an SQL-dump of the whole database created previously. And as this worked without errors, the optimize should not have caused issues. I see that you are using MariaDB 10.11.8. I must warn you that the bug MDEV-34453 can cause corruption in any InnoDB data pages, but I would expect the page checksum to appear correct in those cases. Do you have a copy of the corrupted data file anywhere, such as a backup? I’d like to understand what the corruption would look like. Yes that's the Ubuntu LTS version of MariaDB. The problem with Ubuntu is that for LTS releases they never ever change the version number of any package and just patch bugs. The Debian changelog shows some bugfixes, but there were no recent updates of the package. So the bug could be there. I will maybe change to an official PPA. Thanks for the warning!
            thetaphi Uwe Schindler added a comment -

            FYI, I updated to the official MariaDB package repository as described in: https://mariadb.com/kb/en/mariadb-package-repository-setup-and-usage/; installed version is now: 11.7.2-MariaDB-ubu2404-log

            Mayne thanks for the help here. I was not able to help more on the bug report, but my observations may be helpful for others. Fixing just the first byte alone does not help, my observation was that also the checksum has to be recalculated.

            thetaphi Uwe Schindler added a comment - FYI, I updated to the official MariaDB package repository as described in: https://mariadb.com/kb/en/mariadb-package-repository-setup-and-usage/ ; installed version is now: 11.7.2-MariaDB-ubu2404-log Mayne thanks for the help here. I was not able to help more on the bug report, but my observations may be helpful for others. Fixing just the first byte alone does not help, my observation was that also the checksum has to be recalculated.

            For the tables that report checksum failures, it would be interesting to see the first 4 bytes of each corrupted page, to see if they match the pattern that had been observed on the first page of some corrupted files.

            I have one more hypothesis regarding what could cause a corruption. In MDEV-24854 (MariaDB Server 10.6) we enabled the use of O_DIRECT access to InnoDB data files by default. In Linux, man 2 open mentions the following:

            O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes.

            The InnoDB buffer pool is a MAP_PRIVATE mapping. The built-in crash handler of MariaDB Server, which is enabled by default, attempts to create a stack trace of the current thread. As the first step, it would invoke fork(2), without waiting for any pending O_DIRECT writes to complete. I tracked down the history of this change to the following commit in https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/:

            commit 1847167b8b7d85a4d52acb86f4cb3755a4abcebd
            Author:     Nick Piggin 
            AuthorDate: Wed May 9 17:50:54 2012 +1200
            Commit:     Michael Kerrisk
            CommitDate: Wed May 9 19:18:43 2012 +1200
             
                open.2: Describe race of direct I/O and fork()
                
                Rework 04cd7f64, which didn't capture the details correctly.
                See the April/May 2012 linux-man@ mail thread "[PATCH]
                Describe race of direct read and fork for unaligned buffers"
                http://thread.gmane.org/gmane.linux.kernel.mm/77571
                
                Acked-by: KOSAKI Motohiro
                Cowritten-by: Jan Kara
                Cowritten-by: Hugh Dickins
                Signed-off-by: Michael Kerrisk
            

            I don’t know if there is an archive of that mailing list available. The scenario that was described in this change was an O_DIRECT read that would run concurrently with a fork(). It was claimed that the result of the read could be split between the parent and child processes. I would imagine that under this kind of a scenario, InnoDB would "do the right thing" and refuse access to a corrupted page.

            In MDEV-35886, stephen.hames reported that a hang of the server (due to a bug in a Debian maintained version of the Linux kernel) would lead to data corruption like this. xan@biblionix.com did a great job of tracking down that hang. I’m not at all familiar with the kernel internals, but I got concerned that we could get data corruption due to a race between an O_DIRECT asynchronous write and fork(2).

            The way I read the current Linux man 2 open, it would seem to be unsafe to invoke fork(2) in any multi-threaded program that may access files that have been opened with O_DIRECT. InnoDB is opening such files with the O_CLOEXEC flag, so one might assume that any race condition between O_DIRECT file access and fork(2) would be limited to the point of time where the execve(2) system call has not been invoked yet (and the memory mappings of the parent process have not been destroyed). The fork(2) call was introduced in MariaDB Server in March 2012 (2 months before the above mentioned documentation change) and revised in 2018. In any case, even if the built-in stack trace reporter weren’t behind this corruption, it is known to hang depending on when it is being triggered (MDEV-21010).

            marko Marko Mäkelä added a comment - For the tables that report checksum failures, it would be interesting to see the first 4 bytes of each corrupted page, to see if they match the pattern that had been observed on the first page of some corrupted files. I have one more hypothesis regarding what could cause a corruption. In MDEV-24854 (MariaDB Server 10.6) we enabled the use of O_DIRECT access to InnoDB data files by default. In Linux, man 2 open mentions the following: O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. The InnoDB buffer pool is a MAP_PRIVATE mapping. The built-in crash handler of MariaDB Server, which is enabled by default, attempts to create a stack trace of the current thread. As the first step, it would invoke fork(2) , without waiting for any pending O_DIRECT writes to complete. I tracked down the history of this change to the following commit in https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/: commit 1847167b8b7d85a4d52acb86f4cb3755a4abcebd Author: Nick Piggin AuthorDate: Wed May 9 17:50:54 2012 +1200 Commit: Michael Kerrisk CommitDate: Wed May 9 19:18:43 2012 +1200   open.2: Describe race of direct I/O and fork() Rework 04cd7f64, which didn't capture the details correctly. See the April/May 2012 linux-man@ mail thread "[PATCH] Describe race of direct read and fork for unaligned buffers" http://thread.gmane.org/gmane.linux.kernel.mm/77571 Acked-by: KOSAKI Motohiro Cowritten-by: Jan Kara Cowritten-by: Hugh Dickins Signed-off-by: Michael Kerrisk I don’t know if there is an archive of that mailing list available. The scenario that was described in this change was an O_DIRECT read that would run concurrently with a fork() . It was claimed that the result of the read could be split between the parent and child processes. I would imagine that under this kind of a scenario, InnoDB would "do the right thing" and refuse access to a corrupted page. In MDEV-35886 , stephen.hames reported that a hang of the server (due to a bug in a Debian maintained version of the Linux kernel) would lead to data corruption like this. xan@biblionix.com did a great job of tracking down that hang. I’m not at all familiar with the kernel internals, but I got concerned that we could get data corruption due to a race between an O_DIRECT asynchronous write and fork(2) . The way I read the current Linux man 2 open , it would seem to be unsafe to invoke fork(2) in any multi-threaded program that may access files that have been opened with O_DIRECT . InnoDB is opening such files with the O_CLOEXEC flag, so one might assume that any race condition between O_DIRECT file access and fork(2) would be limited to the point of time where the execve(2) system call has not been invoked yet (and the memory mappings of the parent process have not been destroyed). The fork(2) call was introduced in MariaDB Server in March 2012 (2 months before the above mentioned documentation change) and revised in 2018 . In any case, even if the built-in stack trace reporter weren’t behind this corruption, it is known to hang depending on when it is being triggered ( MDEV-21010 ).
            thetaphi Uwe Schindler added a comment -

            For the tables that report checksum failures, it would be interesting to see the first 4 bytes of each corrupted page, to see if they match the pattern that had been observed on the first page of some corrupted files.

            In that case it was not the case. This is why it was not relevant here.

            thetaphi Uwe Schindler added a comment - For the tables that report checksum failures, it would be interesting to see the first 4 bytes of each corrupted page, to see if they match the pattern that had been observed on the first page of some corrupted files. In that case it was not the case. This is why it was not relevant here.

            I'm kind of surprised that I have never seen this issue at all. I'm running on amd64 bare-metal. Is having a virtualization layer a common element here?

            xan@biblionix.com Xan Charbonnet added a comment - I'm kind of surprised that I have never seen this issue at all. I'm running on amd64 bare-metal. Is having a virtualization layer a common element here?

            People

              marko Marko Mäkelä
              prosenjit.banerjee@cloudpay.net Prosenjit Banerjee
              Votes:
              2 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.