[MDEV-31502] mariabackup: Failed to validate first page of the file error 39 Created: 2023-06-19 Updated: 2023-12-07 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera, Storage Engine - InnoDB |
| Affects Version/s: | 10.4.27 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Yakov Kushnirsky | Assignee: | Julius Goryavsky |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Description |
|
In the galera environment, wsrep_sst_method=mariabackup, backup failed with [00] FATAL ERROR: 2023-06-07 22:01:34 Failed to validate first page of the file dbname/tablename, error 39 on all 3 nodes of the cluster, after db was upgraded from 10.4.24 to 10.4.27. |
| Comments |
| Comment by Marko Mäkelä [ 2023-06-26 ] | |||||||||||||||||||||
|
Error 39 is simply DB_CORRUPTION. The message could simply mean that the checksum of the first page of the file is incorrect. What would innochecksum report on the file? I would expect mariadb-backup to fail if the checksum of the first page of any data file is incorrect. Could it be the case that the SST script ignored the error? | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-26 ] | |||||||||||||||||||||
|
We know that the file is corrupted, but we’d want to know how it might have become corrupted in the first place. Was this the first time that SST was being attempted? Could the file have been produced by prior use of wsrep_sst_method=rsync? | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-26 ] | |||||||||||||||||||||
|
One more message that had been displayed was this one in Datafile::validate_first_page():
The first page of the 128KiB .ibd file, which contains some metadata and a page allocation bitmap for the first innodb_page_size pages, differs between two copies of the file as follows:
For some reason, the first 32 bytes of the file were overwritten by something that looks like garbage. The 32-bit page number 0 would be stored at offset 4. The tablespace identifier at 34 and 38 is 0x023a in both files. At offset FIL_PAGE_LSN (0x10) we have the 64-bit log sequence number of the page. It is 0x07a3e684b9 in the correct file, and some garbage in the corrupted file. At the end of the correct file, before the 32-bit checksum we have the 32 least significant bits of FIL_PAGE_LSN, that is, 0xa3e684b9. In the corrupted file, those bytes are 0xab7f1c69. Assuming that the corrupted file is newer, its correct LSN must be 0x07ab7f1c69 or more. In any case, the FIL_PAGE_LSN at the start of the corrupted file 0x74c6f67951cab3f8a7 does not match the LSN at the end of the page. Something has corrupted the file. Theoretically, it could be anything that has write access to the file system or the block device. I think that it is unlikely that the page would have been corrupted in RAM when it was in the buffer pool of an InnoDB server or mariadb-backup, because right before when writing a page to disk, InnoDB would copy the least significant bits of the LSN and compute the page checksum. I would tend to blame the hardware on this. But, which hardware would use a 32-byte buffer size? In many processor caches and I suppose SDRAM transactions, the block size is 64 bytes. | |||||||||||||||||||||
| Comment by Andrew Garner [ 2023-12-07 ] | |||||||||||||||||||||
|
We observed a similarly weird data corruption bug that manifested slightly differently in our environment - the first 24 bytes of an .ibd were clobbered. We could reproduce this by running SST in loop. It can be tricky hitting the right timing to expose this easily - some environments would never fail this way. I think the hint here is the initial byte sequence "15 03 03 00 1a ... 26 more bytes ...". This seems like a TLS (v1.2) alert record - probably the "close notify" message for a connection. If TLSv1.3 were in use there, we might see "17 03 03 00 13 .. 19 more bytes .." (24 bytes) written instead (which is what observed in our environment). This sequence of events may be occurring:
Seems like this bug may have been around for a while, but this unfortunate sequence of events is rather rare. |