[MDEV-23464] mariabackup 10.3.21 fails every time since upgrading from 10.2.14 Created: 2020-08-12  Updated: 2023-04-16  Resolved: 2023-04-16

Status: Closed
Project: MariaDB Server
Component/s: Backup, mariabackup
Affects Version/s: 10.3.24
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Will Mayall Assignee: Marko Mäkelä
Resolution: Cannot Reproduce Votes: 0
Labels: crash, upgrade
Environment:

10.3.16, 10.3.21, 10.4.13, Centos 6, ext4


Attachments: Text File kochava-attribution-table-11aug2020.txt     File kochava-db111-mariabackup-corruption-11aug2020.txt.gz     File kochava-db111-pt-stalk-11aug2020.tar.gz     Text File kochava-db111-version-11aug2020.txt    

 Description   

Since upgrading to 10.3, I have not been able to get my Production databases backed up using mariabackup. I have 40 servers, 1TB to 30TB, most servers have thousands of identical databases. Here is a snipnet of one failure, however mariabackup has failed on other tables, and I am able to do a select * from the table without a crash.

The backup fails about an hour to 2 hours into the backup.

One thing in common with all the tables reporting corruption either have Data Types of TEXT or LONGTEXT. I also lftp'ed one image to a test and upgraded it to 10.4.13.

[03] 2020-08-07 22:57:50 Database page corruption detected at page 32425, retrying...
[00] 2020-08-07 22:57:50 >> log scanned up to (291691141754836)
[03] 2020-08-07 22:57:50 Error: failed to read page after 10 retries. File ./app_15303/postmeta.ibd seems to be corrupted.
2020-08-07 22:57:50 0 [Note] InnoDB: Page dump in ascii and hex (8192 bytes):
InnoDB: End of page dump
2020-08-07 22:57:50 0 [Note] InnoDB: Compressed page type (11); stored checksum in field1 4040847758; calculated checksums for field1: crc32 2237276139, innodb 4040847758, none 3735928559; page LSN 129654058625760; page number (if stored to page already) 32425; space id (if stored to page already) 587011

The command I used:
/usr/bin/mariabackup --defaults-file=/etc/my.cnf --socket=/var/lib/mysql/mysql.sock --use-memory=64G --user=USER --password=PASSWORD --parallel=12 --no-lock --kill-long-queries-timeout=60 --tmpdir=/database/tmp --stream=xbstream --backup --target-dir /database/production/ | nc XXX.XXX.XXX.XXXX 9999



 Comments   
Comment by Will Mayall [ 2020-09-12 ]

I found a viable but ugly work around.

Will Mayall

Comment by Vladislav Vaintroub [ 2021-05-10 ]

what's your workaround?
Why are you using --use-memory , mariabackup --backup does not have a buffer pool. Also xbstream and targetdir do not make sense.

Comment by Will Mayall [ 2021-05-10 ]

I am running the mariabackup from a Production Server, I don't care if it takes mariabackup more time to complete than cause issue to my server (increased load), therefore I limit the resources mariabackup can use.

I have been working on MySQL for around 20 years, I haven't had an issue using --use-memory with xtrabackup, the error doesn't indicate that --use-memory is the issue.

I have tried many combinations with xbstream and mbstream, same results, I have to uncompress the tables that fail for corruption, then the backup completes.

Again, I have around 100 Servers, each server has 2K databases with the same schemas, the data is different. mariabackup has failed on every server I have tried, but the corruption doesn't happen on every table, maybe 10 to 20 tables, and it is only ONE Table Definition that mariabackup fails on.

CREATE TABLE `attribution` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`device_id` bigint(20) unsigned NOT NULL DEFAULT 0,
`install_id` bigint(20) NOT NULL,
`attribution_prompt` varchar(32) DEFAULT NULL,
`attribution_action` varchar(32) DEFAULT NULL,
`prompt_date` datetime DEFAULT NULL,
`attribution_date` datetime DEFAULT NULL,
`network_id` int(11) DEFAULT NULL,
`campaign_id` bigint(20) DEFAULT NULL,
`site_id` varchar(512) DEFAULT NULL,
`creative_id` varchar(512) DEFAULT NULL,
`identifiers` text DEFAULT NULL,
`ad_information` text DEFAULT NULL,
`device_information` text DEFAULT NULL,
`country_code2` char(2) DEFAULT NULL,
`country_code3` char(3) DEFAULT NULL,
`date_last_updated` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_id` (`device_id`),
KEY `network_id` (`network_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED

But each time it fails, I have to uncompress that table and start the backup over.

Comment by Vladislav Vaintroub [ 2021-05-11 ]

Looks like page validation (page_is_corrupted) bug, which is either marko or thiru

Comment by Vladislav Vaintroub [ 2021-05-11 ]

Adding most probably cause MDEV-18025

Comment by Marko Mäkelä [ 2023-04-14 ]

Backups are supposed to be taken and prepared (logs applied) by the same major version of backup as the server.

That said, there should not have been any compatibility-breaking changes other than MDEV-12353 (10.5) and MDEV-14425 (10.8). Therefore, this could affect an upgrade to 10.4 as well. An upgrade to 10.5 or later will be refused with a message like the one in MDEV-24412.

Comment by Marko Mäkelä [ 2023-04-14 ]

Is this reproducible with newer versions of MariaDB?

Comment by Will Mayall [ 2023-04-14 ]

I no longer work for the company where I experienced the issue.  
Will

Sent from Yahoo Mail for iPhone

On Friday, April 14, 2023, 6:55 AM, Marko Mäkelä (Jira) <jira@mariadb.org> wrote:

    [ https://jira.mariadb.org/browse/MDEV-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marko Mäkelä updated MDEV-23464:
--------------------------------
    Status: Needs Feedback  (was: Open)

Is this reproducible with newer versions of MariaDB?


This message was sent by Atlassian Jira
(v8.20.16#820016)

Comment by Marko Mäkelä [ 2023-04-16 ]

Thank you.

In InnoDB itself, the fix of MDEV-18025 only affected the treatment of encrypted page_compressed=1 tables, while this error is about ROW_FORMAT=COMPRESSED tables.

For mariadb-backup, MDEV-18025 changed some things in xb_fil_cur_read(). The intention was to change the treatment of page_compressed=1 tables only. It seems possible that one cursor->page_size should actually have been cursor->zip_size:

+		const ulint* const end = reinterpret_cast<ulint*>(
+			page + cursor->page_size);

Already in MariaDB 10.3.21 the code looks different:

static bool page_is_corrupted(const byte *page, ulint page_no,
                              const xb_fil_cur_t *cursor,
                              const fil_space_t *space)
{
        byte tmp_frame[UNIV_PAGE_SIZE_MAX];
        byte tmp_page[UNIV_PAGE_SIZE_MAX];
        const ulint page_size = cursor->page_size.physical();
        ulint page_type = mach_read_from_2(page + FIL_PAGE_TYPE);
                const ulint* const end = reinterpret_cast<const ulint*>(
                        page + page_size);

This looks definitely correct to me. I do not believe that this is a regression due to MDEV-18025.

To analyze the bug, we would have needed a copy of the data file page that was claimed to be corrupted.

Generated at Thu Feb 08 09:22:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.