[MDEV-29364] CRITICAL - MariaDB 10.8.4 creating corrupted backups - "InnoDB: Failed to read page 4 from file 'database/table.ibd': Page read from tablespace is corrupted." Created: 2022-08-23 Updated: 2023-01-12 Resolved: 2022-10-07 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Backup |
| Affects Version/s: | None |
| Fix Version/s: | 10.10.1, 10.6.10, 10.7.6, 10.8.5, 10.9.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Nuno | Assignee: | Marko Mäkelä |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Hey --prepare fails to read pretty much every( ? ) table, with many of these logs:
or
and eventually crash with:
- EDIT - Ignore the below bit. This isn't an issue with using HDD, as the same failures are happening on SSD. - From original message (no longer relevant): I assume this might be related to this config I have in prod (which has NVMe): innodb_page_size = 16384 Once I have a page size like this, I can no longer go back to HDD, is it? Also, is it dangerous to use HDD as the "backup storage" - could it be corrupting the backups by storing them on these disks, or as long as I don't prepare them, the backup is Good & Correct, as long as I move them back to an NVMe before I prepare? |
| Comments |
| Comment by Nuno [ 2022-08-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm trying to understand this... Prepare backup from the 13th (MariaDB 10.8.3, fsync) -> works Prepare backup from the 16th (MariaDB 10.8.4, 2 hours after upgrade, had table renames in the meantime, fsync, innodb_log_file_buffering ON) -> works Prepare backup from the 17th (MariaDB 10.8.4, fsync, innodb_log_file_buffering ON) -> works Prepare backup from the 18th (MariaDB 10.8.4, O_DIRECT, innodb_log_file_buffering ON) -> works 19th - I don't have backup saved - I was trying manual experiments ( Prepare backup from the 20th (MariaDB 10.8.4, O_DIRECT, innodb_log_file_buffering OFF, I think) -> works, but fails later due to Prepare backup from the 21st (MariaDB 10.8.4, O_DIRECT, innodb_log_file_buffering ON) -> fails/corrupted Prepare backup from the 22nd (MariaDB 10.8.4, O_DIRECT, innodb_log_file_buffering ON) -> fails/corrupted Prepare backup from the 23rd (MariaDB 10.8.4, O_DIRECT, innodb_log_file_buffering ON) -> fails/corrupted Still not sure what it can be... Doesn't seem to be the combination of "O_DIRECT, innodb_log_file_buffering ON", because the backup of the 18th works. Unless the fact that MariaDB was restarted recently was what helped. Last restart was on the 17th. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I turned innodb_log_file_buffering = OFF (online), created a new backup, and it crashed again preparing, for the same corruption issue. — I sent the backup taken earlier today, to the SSD server (instead of HDD), and it also crashed for corruption.
So, basically the backups being produced by MariaDB are simply broken... and taking new backups isn't resolving the problem. I sent over the latest backup I created, to the SSD server, and tried again. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko - looks like MariaBackup are creating corrupted backups, even worse issue than Tomorrow I'll test restarting MariaDB and create a new fresh backup, to see if that gives me a valid backup. I wonder how others aren't complaining about these issues.. not sure if it's because not many are yet using MariaDB 10.8, or just not testing their backups. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I went ahead, restarted the database, and prepared again (on the SSD server). Seems it went OK. However, I've just noticed that SSD server is using 10.8.3.
I then tried sending to the HDD server and it worked too:
Still very concerned that the rest of the days the backups were corrupt... | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nunop, I’m sad to read that backups are not working reliably for you. We do test the backup component, but it is mostly with small data sets. Recently, I implemented the a log record of additional page checksums (
Note: It will slow down both the server and anything that processes the log (especially crash recovery and preparing backups). The only recovery bug that I am currently aware of as affecting the 10.8.4 release is I am afraid that we will need a reproducible test case before we are able to do anything. I have the feeling that the inter-process communication via the file system is a fragile design, and maybe even more so when the file system cache is disabled. We really should implement backup in the server (MDEV-14992), not only because of reliability but performance ( Which format were the affected data files created in? Were they created before innodb_checksum_algorihtm=full_crc32 ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi marko Thank you for your reply.
MDEV-14992 - I was trying to "prepare" on the actual production server, but it fails right away with "srv_start()" or something similar (I don't have the log with me now, but it didn't give much detail of the fail, just "srv_start()"), which I assume it's because "mariadb" is already running, and "prepare" likely means to start a new process or so. I can consider running a Docker container that prepares the backup, before sending it over to the separate server, though. innodb_checksum_algorithm=full_crc32 ( | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Strange... Just ran "mariabackup --prepare" on the main production server and it prepared the backup without any problem, this time... so I'm not sure why it was crashing right away yesterday - but it wasn't giving me any explanation, so I assumed it was because "mariadb" was already running. Anyway - today's backup was successfully prepared on both servers, so that's good... Also, surprisingly, even though there were many table swaps too (Renaming ...new to ...ibd), | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nunop, the code to "apply" (validate) the OPT_CHECKSUM record is always present. The code to write those records is only enabled in debug builds by default. The main purpose of these records is internal testing. Preparing a backup can consume quite a few resources. It could be a good idea to do that somewhere else than the host that runs the database server. The question regarding innodb_checksum_algorithm=full_crc32 is: When were your data files created or rebuilt (by something like OPTIMIZE TABLE)? It is possible that you still have a few files in an older format where the checksum is more sloppy, meaning that mariadb-backup is unable to detect some intermittent corruption when a data page is being read by backup and written by server at roughly the same time. You could try enabling the setting innodb_checksum_algorithm=strict_full_crc32 so that even for the old data files, only a crc32 checksum (actually an exclusive OR of two CRC-32C) will be accepted. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko, thanks! Ah yeah - in that case I'll keep "prepare" happening only on the other server. That's what it's been doing so far. Thank you - I understand now your question about innodb_checksum_algorithm. I tried now to prepare one of those corrupt backups after changing innodb_checksum_algorithm=strict_full_crc32 in the my.cnf on that server, but I got the same issue. To do this test, is the change only needed on the server that runs "prepare", or do I need to change it on the production server too, before the backup?
Yeah, I've been using MariaDB since 10.0 I think, so very likely there are many tables that haven't been rebuilt since. Although, looking at the tables appearing in the errors, I'm seeing some tables that have been for sure created while I was using 10.5. So it might not be innodb_checksum_algorithm's issue. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This could potentially be explained by | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am now even more convinced that this can be explained by | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko No problem. I'm thankful for MariaDB existing and being maintained, and I'm glad at least I was able to get a valid backup since I raised this Issue. I'm also glad you were able to find the cause for this! I thought it would be harder to figure out. Hopefully that's it. Once the fix gets tested, will it be released asap, or will it wait until the scheduled Quarter release? Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This started happening again. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In addition to | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-08-31 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks marko | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nunop, did you have a chance to upgrade to 10.8.5 yet? Does that work for you? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi marko I upgraded everything to 10.8.5 as soon as it came out! So far, so good. The backups and restores seem to be working fine (with DDL changes, crash recovery from checkpoint, segfault, SELECT *, etc). I have only tested the restore on the SSD server.
Have a great day. Thank you for all your support. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nunop, thank you, this is great news. To be on the safe side, I would wait for additional feedback from you in a week or two, before claiming that this was fixed in 10.8.5. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nunop, are your backups still doing fine in 10.8.5? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nuno [ 2022-10-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko - working perfectly!! | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
nunop, thank you. This was fixed by |