Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35334

Incorrect page checksum at the start of an .ibd file

Details

    Description

      We have recently upgraded MariaDB from 10.11.7 to 10.11.9 in all environments and have recently encountered some corruption errors in the tablespaces and indexes.

      Attachments

        1. corrupttable.png
          corrupttable.png
          653 kB
        2. data_flow.png
          data_flow.png
          136 kB
        3. error_25602_redacted_full.log
          113 kB
        4. error_25602.log
          7 kB
        5. error_25612_full.log
          276 kB
        6. non-compressed-page0.bin
          16 kB
        7. screenshot-1.png
          screenshot-1.png
          11 kB
        8. screenshot-2.png
          screenshot-2.png
          11 kB
        9. space109953page0.bin
          16 kB

        Issue Links

          Activity

            stephen.hames Stephen Hames added a comment - - edited

            Just an update on my side. After the kernel update (that resolved the hanging issue) things seem to get better for awhile, and we got through two weeks without any more of these issues.. We were monitoring. Now we've hit this issue again, four times in three days on two separate hosts. Interestingly, two incidents were copies, from approximately same point in time, affecting the same table at about the same time. Neither host have any errors logged. The symptom showed up when maria backup failed on the checksum check. Same 01 00 00 00 for the first four bytes of page 0.

            I recently modified the workflow to restart mariadb with fast shutdown disabled to reduce mariabackup prepare time.

            Prior to the service restart, we do a lot of bulk deletes. The victim table in this instance is one that receives a cascade delete.

            xan.charbonnet, all of our hosts are AWS ec2, both x86_64 and arm64 have shown this issue.

            Interesting observation: We have not seen this issue on our legacy CentOS 7 instances. Only on Debian 12 instances.

            Interesting observation #2: One of the victim tables on one of the replicas is a backup table that has not had any reads or writes via query in over two years. So am unsure why an otherwise unmodified table would also suffer page0 corruption.

            stephen.hames Stephen Hames added a comment - - edited Just an update on my side. After the kernel update (that resolved the hanging issue) things seem to get better for awhile, and we got through two weeks without any more of these issues.. We were monitoring. Now we've hit this issue again, four times in three days on two separate hosts. Interestingly, two incidents were copies, from approximately same point in time, affecting the same table at about the same time. Neither host have any errors logged. The symptom showed up when maria backup failed on the checksum check. Same 01 00 00 00 for the first four bytes of page 0. I recently modified the workflow to restart mariadb with fast shutdown disabled to reduce mariabackup prepare time. Prior to the service restart, we do a lot of bulk deletes. The victim table in this instance is one that receives a cascade delete. xan.charbonnet , all of our hosts are AWS ec2, both x86_64 and arm64 have shown this issue. Interesting observation: We have not seen this issue on our legacy CentOS 7 instances. Only on Debian 12 instances. Interesting observation #2: One of the victim tables on one of the replicas is a backup table that has not had any reads or writes via query in over two years. So am unsure why an otherwise unmodified table would also suffer page0 corruption.
            Spuij Jan-Willem added a comment - - edited

            @Stephen Hames. We have a modular application. When modules aren't used, tables don't have any rows. Yet they still experience corruption. We've had corruption on empty tables, that never contained any rows, and we are sure of it (because the module was never activated and authorised).

            Edit:

            I've included this screenshot that indicates a table that was last modified in 2020. This was the most recently corrupted table (2 days ago). Pretty sure that this table was never written to in more than four years. Only the first byte again was corrupted (01 00 00 00).

            Spuij Jan-Willem added a comment - - edited @Stephen Hames. We have a modular application. When modules aren't used, tables don't have any rows. Yet they still experience corruption. We've had corruption on empty tables, that never contained any rows, and we are sure of it (because the module was never activated and authorised). Edit: I've included this screenshot that indicates a table that was last modified in 2020. This was the most recently corrupted table (2 days ago). Pretty sure that this table was never written to in more than four years. Only the first byte again was corrupted (01 00 00 00).
            stephen.hames Stephen Hames added a comment -

            jan-willem, we see the same behaviour. Of interest, I added a service restart in one of my processes, and with that, I see corruption on a much more frequent basis. This makes me strongly suspicious that the corruption is happening at server shutdown.

            stephen.hames Stephen Hames added a comment - jan-willem , we see the same behaviour. Of interest, I added a service restart in one of my processes, and with that, I see corruption on a much more frequent basis. This makes me strongly suspicious that the corruption is happening at server shutdown.

            stephen.hames, does the problem actually happen on shutdown, or is it only becoming prominent during shutdown? I mean, could the pages that are corrupted have been written in corrupted form earlier during the mariadbd process lifetime? Have you tried checking the contents of the data files with od -Ax -t x1 -N 4 file.ibd while the server is running? If you force more frequent writes, for example, by SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.01, would you also observe this corruption more frequently before server shutdown?

            marko Marko Mäkelä added a comment - stephen.hames , does the problem actually happen on shutdown, or is it only becoming prominent during shutdown? I mean, could the pages that are corrupted have been written in corrupted form earlier during the mariadbd process lifetime? Have you tried checking the contents of the data files with od -Ax -t x1 -N 4 file.ibd while the server is running? If you force more frequent writes, for example, by SET GLOBAL innodb_max_dirty_pages_pct_lwm=0.01 , would you also observe this corruption more frequently before server shutdown?
            stephen.hames Stephen Hames added a comment - - edited

            HI marko,

            I actually added that check into the workflow where this is most obvious, and I can see it consistently showing up as soon as the server is restarted...

            There is no delay in the steps below:
            check table files (no output means no corruption)
            restart mysql
            check table files again

            checking tables BEFORE restart of mysql
            executing: systemctl restart mysql
            checking tables AFTER restart of mysql
            /var/lib/mysql/database_name/victim_table.ibd: 01 00 00 00
            

            stephen.hames Stephen Hames added a comment - - edited HI marko , I actually added that check into the workflow where this is most obvious, and I can see it consistently showing up as soon as the server is restarted... There is no delay in the steps below: check table files (no output means no corruption) restart mysql check table files again checking tables BEFORE restart of mysql executing: systemctl restart mysql checking tables AFTER restart of mysql /var/lib/mysql/database_name/victim_table.ibd: 01 00 00 00

            People

              marko Marko Mäkelä
              prosenjit.banerjee@cloudpay.net Prosenjit Banerjee
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.