Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24260

mariabackup and innochecksum detects page faults but all ok in application

Details

    Description

      Hi,

      maribackup fails on multiple tables with: Error: failed to read page after 10 retries. File <fname>.ibd seems to be corrupted.
      innochecksum finds errors on the same page, in most cases page 5.
      I managed to make the problem disappear (using OPTMIZE TABLE for each affected table) but then to check I stopped the instance, ran innochecksum for all tables which again found corrupted pages. Restarted instance and mariabackup fails again.
      No problem selecting data for all of the tables concerned.
      CHECK TABLE .. EXTENDED: no problem.
      Problem started on 10.5.5 (binary installation). Two instance on server.
      Exported database and imported on the other instance: no issue, but still mariabackup fails and innochecksum as well.
      Imported on an instance on another server (same OS): still same issue.
      Upgraded MariaDB to 10.5.8 on the other server: mysql_upgrade ran OK finding no errors.

      What did work was to import the dump on my Windows 10 Portable computer on 10.5.8.

      UPDATE: imported on a Centos 7 VM with same redhat-release/kernel: so, I guess this is not a bug but rather something bad in the configuration of the other two VMs. Main difference is that there are two instances. Or maybe some specific configuration that makes it fail.

      Tried different settings for innodb_checksum_algorithm to no avail.
      Tried inreasing nofile to 100000 soft and 200000 hard and 200000 in mariadb@server multi.conf.

      Did not see any message regarding corruption in error log.

      Seems quite similar to MDEV-21109 with the major difference being that both innochecksum and mariaback agree on the corruption. That bug mentions "ALTER TABLE <tname> FORCE" which I also tried and which worked in some cases but not all (with or without export/import of tha table). But when I restart the instance corruption reappears.

      Happens even on empty tables.

      UPDATES:

      • import database on another VM same Linux version, installed default. No corruption.
      • on the original "corrupt" database I had to copy the instance to another file system, drop the database, and import

      Best regards

      Attachments

        Issue Links

          Activity

            The differential characteristic of MDEV-21109 is that corrupted pages are not allocated in tablespace, that is why "CHECK TABLE" can't find the corruption. In the current issue "CHECK TABLE" does not see the problem while innochecksum and innobackup can detect it. So it looks very similar to MDEV-21109. To be sure we need to analyse .ibd file. The reported said that in most cases the corrupted page number is 5. 6tasticMDB, could you please provide us with the first 5 (or n, where n is the number of corrupted page) pages of the .ibd file? You can use "dd" command in Linux to cut off the rest of the table.

            vlad.lesin Vladislav Lesin added a comment - The differential characteristic of MDEV-21109 is that corrupted pages are not allocated in tablespace, that is why "CHECK TABLE" can't find the corruption. In the current issue "CHECK TABLE" does not see the problem while innochecksum and innobackup can detect it. So it looks very similar to MDEV-21109 . To be sure we need to analyse .ibd file. The reported said that in most cases the corrupted page number is 5. 6tasticMDB , could you please provide us with the first 5 (or n, where n is the number of corrupted page) pages of the .ibd file? You can use "dd" command in Linux to cut off the rest of the table.
            6tasticMDB Frank Olsen added a comment -

            Hi,
            Thanks for your investigation.
            When cutting out first 5 pages removed corruption :
            dd if=BAD.ibd of=BAD_first5.ibd bs=16384 count=5
            innochecksum: no error (-i show pages 0 through 4)
            file content (strings):
            Dinfimum
            supremum
            ENGLAND
            FRANCE
            infimum
            supremum
            ENGLAND
            FRANCE

            If I add another page I also get:
            $infimum
            supremum
            <valid data for some other table in the database>

            And now innochecksum complains about page 5 again.

            Best regards,
            Frank

            6tasticMDB Frank Olsen added a comment - Hi, Thanks for your investigation. When cutting out first 5 pages removed corruption : dd if=BAD.ibd of=BAD_first5.ibd bs=16384 count=5 innochecksum: no error (-i show pages 0 through 4) file content (strings): Dinfimum supremum ENGLAND FRANCE infimum supremum ENGLAND FRANCE If I add another page I also get: $infimum supremum <valid data for some other table in the database> And now innochecksum complains about page 5 again. Best regards, Frank
            6tasticMDB Frank Olsen added a comment -

            Just to tried importing the now OK database that was originally corrupted. On another server.
            After import lots of .ibd files are corrupted. Again with correct data at the start of the file and data from some other table.
            I already tried on the same server/instance to import the dump from the corrupted database and had the same symptom.
            => Again what worked was to move it to another file system and do the import.
            => innochecksum says for the same table 6 pages but strings on the file is the same as for the BAD_first5.ibd (there is no $infimum
            supremum, etc.)

            6tasticMDB Frank Olsen added a comment - Just to tried importing the now OK database that was originally corrupted. On another server. After import lots of .ibd files are corrupted. Again with correct data at the start of the file and data from some other table. I already tried on the same server/instance to import the dump from the corrupted database and had the same symptom. => Again what worked was to move it to another file system and do the import. => innochecksum says for the same table 6 pages but strings on the file is the same as for the BAD_first5.ibd (there is no $infimum supremum, etc.)
            6tasticMDB Frank Olsen added a comment -

            Hi,

            Another day another test.
            Just to be sure on a different server from the original one I redid the import of the now GOOD database: again lots of corrupted files.
            So next did:

            • save datadir in a TAR.GZ
            • umount FS, mkfs, mount
            • restore TAR.GZ
              => redid import: this time no corruption

            Best regards,
            Frank

            6tasticMDB Frank Olsen added a comment - Hi, Another day another test. Just to be sure on a different server from the original one I redid the import of the now GOOD database: again lots of corrupted files. So next did: save datadir in a TAR.GZ umount FS, mkfs, mount restore TAR.GZ => redid import: this time no corruption Best regards, Frank
            6tasticMDB Frank Olsen added a comment -

            Next test:
            On the other server where I did the test an hour ago I restored the originally corrupted database.
            (Recreated the file system before the restore.)
            Yes innochecksum still fails on the same 5 tables (that do contain data).
            However, the sequence from MDEV-21109 made it possible to repair the tables without any need for export/import, just:
            set OLD_ALTER_TABLE=1
            Alter table table_name engine=InnoDB
            Alter table table_name FORCE
            mariabackup and innochecksum no more errors.

            There were no errors in the mysql_error.log on the original instance to explain any corruption.
            From application point of view no errors (I don't manage the application but when I asked they did says there had been issues).

            A comparison of one of the corruped .ibd files showed that after the FORCE rebuild the followings to lines wen away at the end of the file:
            infimum
            supremum

            Meaning after FORCE I have:
            infimum
            supremum
            <table data>
            infimum
            supremum5
            ?<index data (I guess)>

            So to summarize/conclude at the moment:

            • How did the 5 tables get corrupted to start with?
            • Where they really corrupted? select * no problem. Exported without any issue. Nothing in MySQL logs. No application error.
            • Import on another instance on same server and also on another server: corruptions but not on the same tables. Corruptions even on empty tables in some cases.
              Repeat import after clean and mkfs the file system: no more corruptions after import

            It seems that there is some underlying systems issue at either Linux (Centos 7) or ESX level to explain the corruption. Not sure what though.

            Best regards,
            Frank

            6tasticMDB Frank Olsen added a comment - Next test: On the other server where I did the test an hour ago I restored the originally corrupted database. (Recreated the file system before the restore.) Yes innochecksum still fails on the same 5 tables (that do contain data). However, the sequence from MDEV-21109 made it possible to repair the tables without any need for export/import, just: set OLD_ALTER_TABLE=1 Alter table table_name engine=InnoDB Alter table table_name FORCE mariabackup and innochecksum no more errors. There were no errors in the mysql_error.log on the original instance to explain any corruption. From application point of view no errors (I don't manage the application but when I asked they did says there had been issues). A comparison of one of the corruped .ibd files showed that after the FORCE rebuild the followings to lines wen away at the end of the file: infimum supremum Meaning after FORCE I have: infimum supremum <table data> infimum supremum5 ?<index data (I guess)> So to summarize/conclude at the moment: How did the 5 tables get corrupted to start with? Where they really corrupted? select * no problem. Exported without any issue. Nothing in MySQL logs. No application error. Import on another instance on same server and also on another server: corruptions but not on the same tables. Corruptions even on empty tables in some cases. Repeat import after clean and mkfs the file system: no more corruptions after import It seems that there is some underlying systems issue at either Linux (Centos 7) or ESX level to explain the corruption. Not sure what though. Best regards, Frank

            6tasticMDB, you wrote:

            > If I add another page I also get:
            > $infimum
            > supremum
            > <valid data for some other table in the database>
            >
            > And now innochecksum complains about page 5 again.

            Yes, this looks very similar to MDEV-21109, we had the same symptom there.

            > Again what worked was to move it to another file system and do the import.

            What does it mean "to move it to another file system"? As I understand, you stopped the server, copied data directory to another file system, started the server, and then imported data from some mysqldump file? Is it correct?

            > There were no errors in the mysql_error.log on the original instance to explain any corruption.

            That means corrupted page is not reachable from the root of B-tree, the same thing we saw in MDEV-21109.

            > Where they really corrupted? select * no problem. Exported without any issue. Nothing in MySQL logs. No application error.

            For MDEV-21109 the "corrupted" pages were not allocated in tablespace. Such pages must be zero-filed, but for some unknown reason (we have not find the root case yet) such pages contain data from another tables. When some page is read, it should pass corruption test. During this test page and tablespace ids are also checked(if a page is not zero-filed). For such "corrupted" pages those id's are incorrect, that is why innochecksum and mariabackup complain. When data is exported with mysqldump, such pages are no read, as they are not allocated and not reachable from the root of B-tree, that is why such corruption stays hidden for the server(and CHECK TABLE). But innocheksum and mariabackup read all pages sequentially, that is why they can detect the corruption.

            As we don't understand the root case, we decided to add the ability to continue backup if corrupted page is reached (see MDEV-22929). We can't detect if corrupted page is allocated or not in tablespace during backup, but we can do this during prepare. So if there were corrupted pages during backup and those pages are not allocated in tablespace, they will be healed.

            > Import on another instance on same server and also on another server: corruptions but not on the same tables. Corruptions even on empty tables in some cases.

            This could help us to find the root case. We would very appreciate you if you would agree to run import under rr, and then let us analyse rr traces on your environment(unfortunately, rr traces in most cases can not be replayed on another environment, besides rr does not work well on virtual machines). This would let us debug the process of pages corruption during import.

            vlad.lesin Vladislav Lesin added a comment - 6tasticMDB , you wrote: > If I add another page I also get: > $infimum > supremum > <valid data for some other table in the database> > > And now innochecksum complains about page 5 again. Yes, this looks very similar to MDEV-21109 , we had the same symptom there. > Again what worked was to move it to another file system and do the import. What does it mean "to move it to another file system"? As I understand, you stopped the server, copied data directory to another file system, started the server, and then imported data from some mysqldump file? Is it correct? > There were no errors in the mysql_error.log on the original instance to explain any corruption. That means corrupted page is not reachable from the root of B-tree, the same thing we saw in MDEV-21109 . > Where they really corrupted? select * no problem. Exported without any issue. Nothing in MySQL logs. No application error. For MDEV-21109 the "corrupted" pages were not allocated in tablespace. Such pages must be zero-filed, but for some unknown reason (we have not find the root case yet) such pages contain data from another tables. When some page is read, it should pass corruption test. During this test page and tablespace ids are also checked(if a page is not zero-filed). For such "corrupted" pages those id's are incorrect, that is why innochecksum and mariabackup complain. When data is exported with mysqldump, such pages are no read, as they are not allocated and not reachable from the root of B-tree, that is why such corruption stays hidden for the server(and CHECK TABLE). But innocheksum and mariabackup read all pages sequentially, that is why they can detect the corruption. As we don't understand the root case, we decided to add the ability to continue backup if corrupted page is reached (see MDEV-22929 ). We can't detect if corrupted page is allocated or not in tablespace during backup, but we can do this during prepare. So if there were corrupted pages during backup and those pages are not allocated in tablespace, they will be healed. > Import on another instance on same server and also on another server: corruptions but not on the same tables. Corruptions even on empty tables in some cases. This could help us to find the root case. We would very appreciate you if you would agree to run import under rr , and then let us analyse rr traces on your environment(unfortunately, rr traces in most cases can not be replayed on another environment, besides rr does not work well on virtual machines). This would let us debug the process of pages corruption during import.

            6tasticMDB we consider this bug currently more as a duplicate of MDEV-21109 then blocking it. If you think this is blocking, please let us know why.
            As vlad.lesin asked in his latest comment, your help would be really appreciated is you could provide us with the rr results.

            julien.fritsch Julien Fritsch added a comment - 6tasticMDB we consider this bug currently more as a duplicate of MDEV-21109 then blocking it. If you think this is blocking, please let us know why. As vlad.lesin asked in his latest comment, your help would be really appreciated is you could provide us with the rr results.

            People

              vlad.lesin Vladislav Lesin
              6tasticMDB Frank Olsen
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.