[MDEV-29938] InnoDB: Assertion failure in btr0pcur.cc line 532 Created: 2022-11-02 Updated: 2023-02-19 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Backup, Storage Engine - InnoDB |
| Affects Version/s: | 10.4.25, 10.4.26 |
| Fix Version/s: | 10.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Eugene | Assignee: | Marko Mäkelä |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | corruption | ||
| Environment: |
Gentoo linux 5.18 and newer, AMD EPYC 7451 CPU |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
The issue appeared after update to 10.4.25. While running musqldump for backup (recurses over all the databases present in the dataset), mysqld (a member of galera-26.4.12 cluster) started to crash with symptoms like this:
with following
and exiting with signal 6. This doesn't happen on every backup execution: after umping whole dataset successfully for 4-5 times a crash like this happens. The behavior looks like serious bug. Whole log can be found attached. Is there any possible workaround for this issue? |
| Comments |
| Comment by Marko Mäkelä [ 2022-11-03 ] | ||||
|
The crash should have been fixed in The cause of the corruption is somewhat of a mystery. It appears that a data page from the wrong tablespace has been written to a file. We added some debug assertion to catch an incorrect write, but that assertion has never fired in our internal testing. Up to MariaDB Server 10.4, the check in fil_io() looks like this:
In Are you using innodb_encrypt_tables or scrubbing? The latter was broken until | ||||
| Comment by Eugene [ 2022-11-03 ] | ||||
|
Hello Marko, Neither innodb_encrypt_tables nore scrubbing are used. This is not underlying storage (as same hardware is used on all the servers, and none of them has issues until you try to run backup with mysqldump. I moved the backup role across the servers, the behavior is the same: mariadb is stable while you are not performing backups, but in case backups are performed, the node crashes within few days. Always during dumping the data. I'll try to replace xfs with ext4fs and see whether behavior will change... Just for the case... Such a behavior was also noticed on the node running kernel 5.10, too. So this is not the kernel. And it never happened before mariadb-10.4.25 | ||||
| Comment by Marko Mäkelä [ 2022-11-15 ] | ||||
|
euglorg, if you are using mariadb-backup or file system snapshots for backups, then I believe that there are bugs in it, at least MDEV-29943 and MDEV-21403. MariaDB 10.5 and later releases are not affected by those, thanks to the new redo log record format ( You should be aware that once something is corrupted, the corruption does not usually heal by itself. Physical backups (as opposed to logical SQL dumps) will propagate such corruption. Sometimes, corruption can be healed by rebuilding the table, for example, by OPTIMIZE TABLE. I would recommend that you rebuild all your data from a SQL dump, to get into a known good state. Since you mention Galera, there are some known problems with its snapshot transfer, both with wsrep_sst_method=mariabackup (see the above recovery bugs) and with wsrep_sst_method=rsync. If you ever initiated Galera SST with wsrep_sst_method=rsync (the default setting) from a source node that is older than 10.4.25, a likely explanation of the corruption is that writes were actually not blocked during the snapshot transfer. Even after the snapshot transfer was refactored in | ||||
| Comment by Eugene [ 2022-11-15 ] | ||||
|
Hello Marko. Thank you for advise.
|