[MDEV-23464] mariabackup 10.3.21 fails every time since upgrading from 10.2.14 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 10.3.24
Fix Version/s: N/A
Component/s: Backup, mariabackup
Labels:
- crash
- upgrade
Environment:
10.3.16, 10.3.21, 10.4.13, Centos 6, ext4

Description

Since upgrading to 10.3, I have not been able to get my Production databases backed up using mariabackup. I have 40 servers, 1TB to 30TB, most servers have thousands of identical databases. Here is a snipnet of one failure, however mariabackup has failed on other tables, and I am able to do a select * from the table without a crash.

The backup fails about an hour to 2 hours into the backup.

One thing in common with all the tables reporting corruption either have Data Types of TEXT or LONGTEXT. I also lftp'ed one image to a test and upgraded it to 10.4.13.

[03] 2020-08-07 22:57:50 Database page corruption detected at page 32425, retrying...
[00] 2020-08-07 22:57:50 >> log scanned up to (291691141754836)
[03] 2020-08-07 22:57:50 Error: failed to read page after 10 retries. File ./app_15303/postmeta.ibd seems to be corrupted.
2020-08-07 22:57:50 0 [Note] InnoDB: Page dump in ascii and hex (8192 bytes):
InnoDB: End of page dump
2020-08-07 22:57:50 0 [Note] InnoDB: Compressed page type (11); stored checksum in field1 4040847758; calculated checksums for field1: crc32 2237276139, innodb 4040847758, none 3735928559; page LSN 129654058625760; page number (if stored to page already) 32425; space id (if stored to page already) 587011

The command I used:
/usr/bin/mariabackup --defaults-file=/etc/my.cnf --socket=/var/lib/mysql/mysql.sock --use-memory=64G --user=USER --password=PASSWORD --parallel=12 --no-lock --kill-long-queries-timeout=60 --tmpdir=/database/tmp --stream=xbstream --backup --target-dir /database/production/ | nc XXX.XXX.XXX.XXXX 9999

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

kochava-attribution-table-11aug2020.txt
1 kB
2020-08-12 20:35
kochava-db111-mariabackup-corruption-11aug2020.txt.gz
9 kB
2020-08-12 20:36
kochava-db111-pt-stalk-11aug2020.tar.gz
774 kB
2020-08-12 20:36
kochava-db111-version-11aug2020.txt
3 kB
2020-08-12 20:36

Activity

Ascending order - Click to sort in descending order

Will Mayall created issue - 2020-08-12 20:55

Elena Stepanova made changes - 2020-08-12 21:47

Field	Original Value	New Value
Component/s		mariabackup [ 14500 ]
Fix Version/s		10.3 [ 22126 ]
Assignee		Vladislav Lesin [ vlad.lesin ]

Will Mayall added a comment - 2020-09-12 18:06

I found a viable but ugly work around.

Will Mayall

Will Mayall added a comment - 2020-09-12 18:06 I found a viable but ugly work around. Will Mayall

Vladislav Vaintroub made changes - 2021-05-10 17:38

Assignee

Vladislav Lesin [ vlad.lesin ]

Vladislav Vaintroub [ wlad ]

Vladislav Vaintroub added a comment - 2021-05-10 17:39 - edited

what's your workaround?
Why are you using --use-memory , mariabackup --backup does not have a buffer pool. Also xbstream and targetdir do not make sense.

Vladislav Vaintroub added a comment - 2021-05-10 17:39 - edited what's your workaround? Why are you using --use-memory , mariabackup --backup does not have a buffer pool. Also xbstream and targetdir do not make sense.

Will Mayall added a comment - 2021-05-10 23:09

I am running the mariabackup from a Production Server, I don't care if it takes mariabackup more time to complete than cause issue to my server (increased load), therefore I limit the resources mariabackup can use.

I have been working on MySQL for around 20 years, I haven't had an issue using --use-memory with xtrabackup, the error doesn't indicate that --use-memory is the issue.

I have tried many combinations with xbstream and mbstream, same results, I have to uncompress the tables that fail for corruption, then the backup completes.

Again, I have around 100 Servers, each server has 2K databases with the same schemas, the data is different. mariabackup has failed on every server I have tried, but the corruption doesn't happen on every table, maybe 10 to 20 tables, and it is only ONE Table Definition that mariabackup fails on.

CREATE TABLE `attribution` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`device_id` bigint(20) unsigned NOT NULL DEFAULT 0,
`install_id` bigint(20) NOT NULL,
`attribution_prompt` varchar(32) DEFAULT NULL,
`attribution_action` varchar(32) DEFAULT NULL,
`prompt_date` datetime DEFAULT NULL,
`attribution_date` datetime DEFAULT NULL,
`network_id` int(11) DEFAULT NULL,
`campaign_id` bigint(20) DEFAULT NULL,
`site_id` varchar(512) DEFAULT NULL,
`creative_id` varchar(512) DEFAULT NULL,
`identifiers` text DEFAULT NULL,
`ad_information` text DEFAULT NULL,
`device_information` text DEFAULT NULL,
`country_code2` char(2) DEFAULT NULL,
`country_code3` char(3) DEFAULT NULL,
`date_last_updated` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_id` (`device_id`),
KEY `network_id` (`network_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED

But each time it fails, I have to uncompress that table and start the backup over.

Will Mayall added a comment - 2021-05-10 23:09 I am running the mariabackup from a Production Server, I don't care if it takes mariabackup more time to complete than cause issue to my server (increased load), therefore I limit the resources mariabackup can use. I have been working on MySQL for around 20 years, I haven't had an issue using --use-memory with xtrabackup, the error doesn't indicate that --use-memory is the issue. I have tried many combinations with xbstream and mbstream, same results, I have to uncompress the tables that fail for corruption, then the backup completes. Again, I have around 100 Servers, each server has 2K databases with the same schemas, the data is different. mariabackup has failed on every server I have tried, but the corruption doesn't happen on every table, maybe 10 to 20 tables, and it is only ONE Table Definition that mariabackup fails on. CREATE TABLE `attribution` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `device_id` bigint(20) unsigned NOT NULL DEFAULT 0, `install_id` bigint(20) NOT NULL, `attribution_prompt` varchar(32) DEFAULT NULL, `attribution_action` varchar(32) DEFAULT NULL, `prompt_date` datetime DEFAULT NULL, `attribution_date` datetime DEFAULT NULL, `network_id` int(11) DEFAULT NULL, `campaign_id` bigint(20) DEFAULT NULL, `site_id` varchar(512) DEFAULT NULL, `creative_id` varchar(512) DEFAULT NULL, `identifiers` text DEFAULT NULL, `ad_information` text DEFAULT NULL, `device_information` text DEFAULT NULL, `country_code2` char(2) DEFAULT NULL, `country_code3` char(3) DEFAULT NULL, `date_last_updated` datetime DEFAULT NULL, PRIMARY KEY (`id`), KEY `device_id` (`device_id`), KEY `network_id` (`network_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED But each time it fails, I have to uncompress that table and start the backup over.

Vladislav Vaintroub made changes - 2021-05-11 05:12

Assignee

Vladislav Vaintroub [ wlad ]

Marko Mäkelä [ marko ]

Vladislav Vaintroub added a comment - 2021-05-11 05:19

Looks like page validation (page_is_corrupted) bug, which is either marko or thiru

Vladislav Vaintroub added a comment - 2021-05-11 05:19 Looks like page validation (page_is_corrupted) bug, which is either marko or thiru

Vladislav Vaintroub added a comment - 2021-05-11 05:25

Adding most probably cause ~~MDEV-18025~~

Vladislav Vaintroub added a comment - 2021-05-11 05:25 Adding most probably cause MDEV-18025

Vladislav Vaintroub made changes - 2021-05-11 05:25

Link

This issue is caused by ~~MDEV-18025~~ [ ~~MDEV-18025~~ ]

Sergei Golubchik made changes - 2021-12-06 21:34

Workflow

MariaDB v3 [ 112357 ]

MariaDB v4 [ 142183 ]

Marko Mäkelä added a comment - 2023-04-14 13:53

Backups are supposed to be taken and prepared (logs applied) by the same major version of backup as the server.

That said, there should not have been any compatibility-breaking changes other than ~~MDEV-12353~~ (10.5) and ~~MDEV-14425~~ (10.8). Therefore, this could affect an upgrade to 10.4 as well. An upgrade to 10.5 or later will be refused with a message like the one in ~~MDEV-24412~~.

Marko Mäkelä added a comment - 2023-04-14 13:53 Backups are supposed to be taken and prepared (logs applied) by the same major version of backup as the server. That said, there should not have been any compatibility-breaking changes other than MDEV-12353 (10.5) and MDEV-14425 (10.8). Therefore, this could affect an upgrade to 10.4 as well. An upgrade to 10.5 or later will be refused with a message like the one in MDEV-24412 .

Marko Mäkelä made changes - 2023-04-14 13:53

Fix Version/s		10.4 [ 22408 ]
Labels	crash	crash upgrade

Marko Mäkelä added a comment - 2023-04-14 13:54

Is this reproducible with newer versions of MariaDB?

Marko Mäkelä added a comment - 2023-04-14 13:54 Is this reproducible with newer versions of MariaDB?

Marko Mäkelä made changes - 2023-04-14 13:54

Status

Open [ 1 ]

Needs Feedback [ 10501 ]

Will Mayall added a comment - 2023-04-14 15:10

I no longer work for the company where I experienced the issue.
Will

Sent from Yahoo Mail for iPhone

On Friday, April 14, 2023, 6:55 AM, Marko Mäkelä (Jira) <jira@mariadb.org> wrote:

[ https://jira.mariadb.org/browse/MDEV-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marko Mäkelä updated ~~MDEV-23464~~:
--------------------------------
Status: Needs Feedback (was: Open)

Is this reproducible with newer versions of MariaDB?

–
This message was sent by Atlassian Jira
(v8.20.16#820016)

Will Mayall added a comment - 2023-04-14 15:10 I no longer work for the company where I experienced the issue. Will Sent from Yahoo Mail for iPhone On Friday, April 14, 2023, 6:55 AM, Marko Mäkelä (Jira) <jira@mariadb.org> wrote: [ https://jira.mariadb.org/browse/MDEV-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marko Mäkelä updated MDEV-23464 : -------------------------------- Status: Needs Feedback (was: Open) Is this reproducible with newer versions of MariaDB? – This message was sent by Atlassian Jira (v8.20.16#820016)

Marko Mäkelä made changes - 2023-04-16 07:44

Status

Needs Feedback [ 10501 ]

Open [ 1 ]

Marko Mäkelä made changes - 2023-04-16 07:44

Link

This issue is caused by ~~MDEV-18025~~ [ ~~MDEV-18025~~ ]

Marko Mäkelä added a comment - 2023-04-16 07:58

Thank you.

In InnoDB itself, the fix of ~~MDEV-18025~~ only affected the treatment of encrypted page_compressed=1 tables, while this error is about ROW_FORMAT=COMPRESSED tables.

For mariadb-backup, ~~MDEV-18025~~ changed some things in xb_fil_cur_read(). The intention was to change the treatment of page_compressed=1 tables only. It seems possible that one cursor->page_size should actually have been cursor->zip_size:

+		const ulint* const end = reinterpret_cast<ulint*>(

+			page + cursor->page_size);

Already in MariaDB 10.3.21 the code looks different:

static bool page_is_corrupted(const byte *page, ulint page_no,

                              const xb_fil_cur_t *cursor,

                              const fil_space_t *space)

        byte tmp_frame[UNIV_PAGE_SIZE_MAX];

        byte tmp_page[UNIV_PAGE_SIZE_MAX];

        const ulint page_size = cursor->page_size.physical();

        ulint page_type = mach_read_from_2(page + FIL_PAGE_TYPE);

…

                const ulint* const end = reinterpret_cast<const ulint*>(

                        page + page_size);

This looks definitely correct to me. I do not believe that this is a regression due to ~~MDEV-18025~~.

To analyze the bug, we would have needed a copy of the data file page that was claimed to be corrupted.

Marko Mäkelä added a comment - 2023-04-16 07:58 Thank you. In InnoDB itself, the fix of MDEV-18025 only affected the treatment of encrypted page_compressed=1 tables, while this error is about ROW_FORMAT=COMPRESSED tables. For mariadb-backup , MDEV-18025 changed some things in xb_fil_cur_read() . The intention was to change the treatment of page_compressed=1 tables only. It seems possible that one cursor->page_size should actually have been cursor->zip_size : + const ulint* const end = reinterpret_cast<ulint*>( + page + cursor->page_size); Already in MariaDB 10.3.21 the code looks different: static bool page_is_corrupted( const byte *page, ulint page_no, const xb_fil_cur_t *cursor, const fil_space_t *space) { byte tmp_frame[UNIV_PAGE_SIZE_MAX]; byte tmp_page[UNIV_PAGE_SIZE_MAX]; const ulint page_size = cursor->page_size.physical(); ulint page_type = mach_read_from_2(page + FIL_PAGE_TYPE); … const ulint* const end = reinterpret_cast < const ulint*>( page + page_size); This looks definitely correct to me. I do not believe that this is a regression due to MDEV-18025 . To analyze the bug, we would have needed a copy of the data file page that was claimed to be corrupted.

Marko Mäkelä made changes - 2023-04-16 07:58

issue.field.resolutiondate

2023-04-16 07:58:34.0

2023-04-16 07:58:34.715

Marko Mäkelä made changes - 2023-04-16 07:58

Fix Version/s		N/A [ 14700 ]
Fix Version/s	10.3 [ 22126 ]
Fix Version/s	10.4 [ 22408 ]
Resolution		Cannot Reproduce [ 5 ]
Status	Open [ 1 ]	Closed [ 6 ]

People

Assignee:: Marko Mäkelä

Reporter:: Will Mayall

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020-08-12 20:55

Updated:: 2023-04-16 07:58

Resolved:: 2023-04-16 07:58

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration