[MDEV-24260] mariabackup and innochecksum detects page faults but all ok in application Created: 2020-11-20  Updated: 2021-02-08  Resolved: 2021-02-08

Status: Closed
Project: MariaDB Server
Component/s: Admin statements, Backup, mariabackup, Platform RedHat, Server, Storage Engine - InnoDB
Affects Version/s: 10.5.5, 10.5.8
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Frank Olsen Assignee: Vladislav Lesin
Resolution: Incomplete Votes: 0
Labels: need_feedback, need_rr
Environment:

Centos 7.8.2003 kernel 3.10.0-1127.19.1.el7.x_-_64


Issue Links:
Duplicate
duplicates MDEV-21109 Table corruption not detected with CH... Closed

 Description   

Hi,

maribackup fails on multiple tables with: Error: failed to read page after 10 retries. File <fname>.ibd seems to be corrupted.
innochecksum finds errors on the same page, in most cases page 5.
I managed to make the problem disappear (using OPTMIZE TABLE for each affected table) but then to check I stopped the instance, ran innochecksum for all tables which again found corrupted pages. Restarted instance and mariabackup fails again.
No problem selecting data for all of the tables concerned.
CHECK TABLE .. EXTENDED: no problem.
Problem started on 10.5.5 (binary installation). Two instance on server.
Exported database and imported on the other instance: no issue, but still mariabackup fails and innochecksum as well.
Imported on an instance on another server (same OS): still same issue.
Upgraded MariaDB to 10.5.8 on the other server: mysql_upgrade ran OK finding no errors.

What did work was to import the dump on my Windows 10 Portable computer on 10.5.8.

UPDATE: imported on a Centos 7 VM with same redhat-release/kernel: so, I guess this is not a bug but rather something bad in the configuration of the other two VMs. Main difference is that there are two instances. Or maybe some specific configuration that makes it fail.

Tried different settings for innodb_checksum_algorithm to no avail.
Tried inreasing nofile to 100000 soft and 200000 hard and 200000 in mariadb@server multi.conf.

Did not see any message regarding corruption in error log.

Seems quite similar to MDEV-21109 with the major difference being that both innochecksum and mariaback agree on the corruption. That bug mentions "ALTER TABLE <tname> FORCE" which I also tried and which worked in some cases but not all (with or without export/import of tha table). But when I restart the instance corruption reappears.

Happens even on empty tables.

UPDATES:

  • import database on another VM same Linux version, installed default. No corruption.
  • on the original "corrupt" database I had to copy the instance to another file system, drop the database, and import

Best regards



 Comments   
Comment by Vladislav Lesin [ 2020-11-30 ]

The differential characteristic of MDEV-21109 is that corrupted pages are not allocated in tablespace, that is why "CHECK TABLE" can't find the corruption. In the current issue "CHECK TABLE" does not see the problem while innochecksum and innobackup can detect it. So it looks very similar to MDEV-21109. To be sure we need to analyse .ibd file. The reported said that in most cases the corrupted page number is 5. 6tasticMDB, could you please provide us with the first 5 (or n, where n is the number of corrupted page) pages of the .ibd file? You can use "dd" command in Linux to cut off the rest of the table.

Comment by Frank Olsen [ 2020-11-30 ]

Hi,
Thanks for your investigation.
When cutting out first 5 pages removed corruption :
dd if=BAD.ibd of=BAD_first5.ibd bs=16384 count=5
innochecksum: no error (-i show pages 0 through 4)
file content (strings):
Dinfimum
supremum
ENGLAND
FRANCE
infimum
supremum
ENGLAND
FRANCE

If I add another page I also get:
$infimum
supremum
<valid data for some other table in the database>

And now innochecksum complains about page 5 again.

Best regards,
Frank

Comment by Frank Olsen [ 2020-11-30 ]

Just to tried importing the now OK database that was originally corrupted. On another server.
After import lots of .ibd files are corrupted. Again with correct data at the start of the file and data from some other table.
I already tried on the same server/instance to import the dump from the corrupted database and had the same symptom.
=> Again what worked was to move it to another file system and do the import.
=> innochecksum says for the same table 6 pages but strings on the file is the same as for the BAD_first5.ibd (there is no $infimum
supremum, etc.)

Comment by Frank Olsen [ 2020-12-02 ]

Hi,

Another day another test.
Just to be sure on a different server from the original one I redid the import of the now GOOD database: again lots of corrupted files.
So next did:

  • save datadir in a TAR.GZ
  • umount FS, mkfs, mount
  • restore TAR.GZ
    => redid import: this time no corruption

Best regards,
Frank

Comment by Frank Olsen [ 2020-12-02 ]

Next test:
On the other server where I did the test an hour ago I restored the originally corrupted database.
(Recreated the file system before the restore.)
Yes innochecksum still fails on the same 5 tables (that do contain data).
However, the sequence from MDEV-21109 made it possible to repair the tables without any need for export/import, just:
set OLD_ALTER_TABLE=1
Alter table table_name engine=InnoDB
Alter table table_name FORCE
mariabackup and innochecksum no more errors.

There were no errors in the mysql_error.log on the original instance to explain any corruption.
From application point of view no errors (I don't manage the application but when I asked they did says there had been issues).

A comparison of one of the corruped .ibd files showed that after the FORCE rebuild the followings to lines wen away at the end of the file:
infimum
supremum

Meaning after FORCE I have:
infimum
supremum
<table data>
infimum
supremum5
?<index data (I guess)>

So to summarize/conclude at the moment:

  • How did the 5 tables get corrupted to start with?
  • Where they really corrupted? select * no problem. Exported without any issue. Nothing in MySQL logs. No application error.
  • Import on another instance on same server and also on another server: corruptions but not on the same tables. Corruptions even on empty tables in some cases.
    Repeat import after clean and mkfs the file system: no more corruptions after import

It seems that there is some underlying systems issue at either Linux (Centos 7) or ESX level to explain the corruption. Not sure what though.

Best regards,
Frank

Comment by Vladislav Lesin [ 2020-12-09 ]

6tasticMDB, you wrote:

> If I add another page I also get:
> $infimum
> supremum
> <valid data for some other table in the database>
>
> And now innochecksum complains about page 5 again.

Yes, this looks very similar to MDEV-21109, we had the same symptom there.

> Again what worked was to move it to another file system and do the import.

What does it mean "to move it to another file system"? As I understand, you stopped the server, copied data directory to another file system, started the server, and then imported data from some mysqldump file? Is it correct?

> There were no errors in the mysql_error.log on the original instance to explain any corruption.

That means corrupted page is not reachable from the root of B-tree, the same thing we saw in MDEV-21109.

> Where they really corrupted? select * no problem. Exported without any issue. Nothing in MySQL logs. No application error.

For MDEV-21109 the "corrupted" pages were not allocated in tablespace. Such pages must be zero-filed, but for some unknown reason (we have not find the root case yet) such pages contain data from another tables. When some page is read, it should pass corruption test. During this test page and tablespace ids are also checked(if a page is not zero-filed). For such "corrupted" pages those id's are incorrect, that is why innochecksum and mariabackup complain. When data is exported with mysqldump, such pages are no read, as they are not allocated and not reachable from the root of B-tree, that is why such corruption stays hidden for the server(and CHECK TABLE). But innocheksum and mariabackup read all pages sequentially, that is why they can detect the corruption.

As we don't understand the root case, we decided to add the ability to continue backup if corrupted page is reached (see MDEV-22929). We can't detect if corrupted page is allocated or not in tablespace during backup, but we can do this during prepare. So if there were corrupted pages during backup and those pages are not allocated in tablespace, they will be healed.

> Import on another instance on same server and also on another server: corruptions but not on the same tables. Corruptions even on empty tables in some cases.

This could help us to find the root case. We would very appreciate you if you would agree to run import under rr, and then let us analyse rr traces on your environment(unfortunately, rr traces in most cases can not be replayed on another environment, besides rr does not work well on virtual machines). This would let us debug the process of pages corruption during import.

Comment by Julien Fritsch [ 2021-01-07 ]

6tasticMDB we consider this bug currently more as a duplicate of MDEV-21109 then blocking it. If you think this is blocking, please let us know why.
As vlad.lesin asked in his latest comment, your help would be really appreciated is you could provide us with the rr results.

Generated at Thu Feb 08 09:28:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.