Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23464

mariabackup 10.3.21 fails every time since upgrading from 10.2.14

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Cannot Reproduce
    • 10.3.24
    • N/A
    • Backup, mariabackup
    • 10.3.16, 10.3.21, 10.4.13, Centos 6, ext4

    Description

      Since upgrading to 10.3, I have not been able to get my Production databases backed up using mariabackup. I have 40 servers, 1TB to 30TB, most servers have thousands of identical databases. Here is a snipnet of one failure, however mariabackup has failed on other tables, and I am able to do a select * from the table without a crash.

      The backup fails about an hour to 2 hours into the backup.

      One thing in common with all the tables reporting corruption either have Data Types of TEXT or LONGTEXT. I also lftp'ed one image to a test and upgraded it to 10.4.13.

      [03] 2020-08-07 22:57:50 Database page corruption detected at page 32425, retrying...
      [00] 2020-08-07 22:57:50 >> log scanned up to (291691141754836)
      [03] 2020-08-07 22:57:50 Error: failed to read page after 10 retries. File ./app_15303/postmeta.ibd seems to be corrupted.
      2020-08-07 22:57:50 0 [Note] InnoDB: Page dump in ascii and hex (8192 bytes):
      InnoDB: End of page dump
      2020-08-07 22:57:50 0 [Note] InnoDB: Compressed page type (11); stored checksum in field1 4040847758; calculated checksums for field1: crc32 2237276139, innodb 4040847758, none 3735928559; page LSN 129654058625760; page number (if stored to page already) 32425; space id (if stored to page already) 587011

      The command I used:
      /usr/bin/mariabackup --defaults-file=/etc/my.cnf --socket=/var/lib/mysql/mysql.sock --use-memory=64G --user=USER --password=PASSWORD --parallel=12 --no-lock --kill-long-queries-timeout=60 --tmpdir=/database/tmp --stream=xbstream --backup --target-dir /database/production/ | nc XXX.XXX.XXX.XXXX 9999

      Attachments

        Activity

          wamayall Will Mayall created issue -
          elenst Elena Stepanova made changes -
          Field Original Value New Value
          Component/s mariabackup [ 14500 ]
          Fix Version/s 10.3 [ 22126 ]
          Assignee Vladislav Lesin [ vlad.lesin ]
          wamayall Will Mayall added a comment -

          I found a viable but ugly work around.

          Will Mayall

          wamayall Will Mayall added a comment - I found a viable but ugly work around. Will Mayall
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Lesin [ vlad.lesin ] Vladislav Vaintroub [ wlad ]
          wlad Vladislav Vaintroub added a comment - - edited

          what's your workaround?
          Why are you using --use-memory , mariabackup --backup does not have a buffer pool. Also xbstream and targetdir do not make sense.

          wlad Vladislav Vaintroub added a comment - - edited what's your workaround? Why are you using --use-memory , mariabackup --backup does not have a buffer pool. Also xbstream and targetdir do not make sense.
          wamayall Will Mayall added a comment -

          I am running the mariabackup from a Production Server, I don't care if it takes mariabackup more time to complete than cause issue to my server (increased load), therefore I limit the resources mariabackup can use.

          I have been working on MySQL for around 20 years, I haven't had an issue using --use-memory with xtrabackup, the error doesn't indicate that --use-memory is the issue.

          I have tried many combinations with xbstream and mbstream, same results, I have to uncompress the tables that fail for corruption, then the backup completes.

          Again, I have around 100 Servers, each server has 2K databases with the same schemas, the data is different. mariabackup has failed on every server I have tried, but the corruption doesn't happen on every table, maybe 10 to 20 tables, and it is only ONE Table Definition that mariabackup fails on.

          CREATE TABLE `attribution` (
          `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
          `device_id` bigint(20) unsigned NOT NULL DEFAULT 0,
          `install_id` bigint(20) NOT NULL,
          `attribution_prompt` varchar(32) DEFAULT NULL,
          `attribution_action` varchar(32) DEFAULT NULL,
          `prompt_date` datetime DEFAULT NULL,
          `attribution_date` datetime DEFAULT NULL,
          `network_id` int(11) DEFAULT NULL,
          `campaign_id` bigint(20) DEFAULT NULL,
          `site_id` varchar(512) DEFAULT NULL,
          `creative_id` varchar(512) DEFAULT NULL,
          `identifiers` text DEFAULT NULL,
          `ad_information` text DEFAULT NULL,
          `device_information` text DEFAULT NULL,
          `country_code2` char(2) DEFAULT NULL,
          `country_code3` char(3) DEFAULT NULL,
          `date_last_updated` datetime DEFAULT NULL,
          PRIMARY KEY (`id`),
          KEY `device_id` (`device_id`),
          KEY `network_id` (`network_id`)
          ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED

          But each time it fails, I have to uncompress that table and start the backup over.

          wamayall Will Mayall added a comment - I am running the mariabackup from a Production Server, I don't care if it takes mariabackup more time to complete than cause issue to my server (increased load), therefore I limit the resources mariabackup can use. I have been working on MySQL for around 20 years, I haven't had an issue using --use-memory with xtrabackup, the error doesn't indicate that --use-memory is the issue. I have tried many combinations with xbstream and mbstream, same results, I have to uncompress the tables that fail for corruption, then the backup completes. Again, I have around 100 Servers, each server has 2K databases with the same schemas, the data is different. mariabackup has failed on every server I have tried, but the corruption doesn't happen on every table, maybe 10 to 20 tables, and it is only ONE Table Definition that mariabackup fails on. CREATE TABLE `attribution` ( `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT, `device_id` bigint(20) unsigned NOT NULL DEFAULT 0, `install_id` bigint(20) NOT NULL, `attribution_prompt` varchar(32) DEFAULT NULL, `attribution_action` varchar(32) DEFAULT NULL, `prompt_date` datetime DEFAULT NULL, `attribution_date` datetime DEFAULT NULL, `network_id` int(11) DEFAULT NULL, `campaign_id` bigint(20) DEFAULT NULL, `site_id` varchar(512) DEFAULT NULL, `creative_id` varchar(512) DEFAULT NULL, `identifiers` text DEFAULT NULL, `ad_information` text DEFAULT NULL, `device_information` text DEFAULT NULL, `country_code2` char(2) DEFAULT NULL, `country_code3` char(3) DEFAULT NULL, `date_last_updated` datetime DEFAULT NULL, PRIMARY KEY (`id`), KEY `device_id` (`device_id`), KEY `network_id` (`network_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8 ROW_FORMAT=COMPRESSED But each time it fails, I have to uncompress that table and start the backup over.
          wlad Vladislav Vaintroub made changes -
          Assignee Vladislav Vaintroub [ wlad ] Marko Mäkelä [ marko ]

          Looks like page validation (page_is_corrupted) bug, which is either marko or thiru

          wlad Vladislav Vaintroub added a comment - Looks like page validation (page_is_corrupted) bug, which is either marko or thiru

          Adding most probably cause MDEV-18025

          wlad Vladislav Vaintroub added a comment - Adding most probably cause MDEV-18025
          wlad Vladislav Vaintroub made changes -
          serg Sergei Golubchik made changes -
          Workflow MariaDB v3 [ 112357 ] MariaDB v4 [ 142183 ]

          Backups are supposed to be taken and prepared (logs applied) by the same major version of backup as the server.

          That said, there should not have been any compatibility-breaking changes other than MDEV-12353 (10.5) and MDEV-14425 (10.8). Therefore, this could affect an upgrade to 10.4 as well. An upgrade to 10.5 or later will be refused with a message like the one in MDEV-24412.

          marko Marko Mäkelä added a comment - Backups are supposed to be taken and prepared (logs applied) by the same major version of backup as the server. That said, there should not have been any compatibility-breaking changes other than MDEV-12353 (10.5) and MDEV-14425 (10.8). Therefore, this could affect an upgrade to 10.4 as well. An upgrade to 10.5 or later will be refused with a message like the one in MDEV-24412 .
          marko Marko Mäkelä made changes -
          Fix Version/s 10.4 [ 22408 ]
          Labels crash crash upgrade

          Is this reproducible with newer versions of MariaDB?

          marko Marko Mäkelä added a comment - Is this reproducible with newer versions of MariaDB?
          marko Marko Mäkelä made changes -
          Status Open [ 1 ] Needs Feedback [ 10501 ]
          wamayall Will Mayall added a comment -

          I no longer work for the company where I experienced the issue.  
          Will

          Sent from Yahoo Mail for iPhone

          On Friday, April 14, 2023, 6:55 AM, Marko Mäkelä (Jira) <jira@mariadb.org> wrote:

              [ https://jira.mariadb.org/browse/MDEV-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

          Marko Mäkelä updated MDEV-23464:
          --------------------------------
              Status: Needs Feedback  (was: Open)

          Is this reproducible with newer versions of MariaDB?

          –
          This message was sent by Atlassian Jira
          (v8.20.16#820016)

          wamayall Will Mayall added a comment - I no longer work for the company where I experienced the issue.   Will Sent from Yahoo Mail for iPhone On Friday, April 14, 2023, 6:55 AM, Marko Mäkelä (Jira) <jira@mariadb.org> wrote:     [ https://jira.mariadb.org/browse/MDEV-23464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marko Mäkelä updated MDEV-23464 : --------------------------------     Status: Needs Feedback  (was: Open) Is this reproducible with newer versions of MariaDB? – This message was sent by Atlassian Jira (v8.20.16#820016)
          marko Marko Mäkelä made changes -
          Status Needs Feedback [ 10501 ] Open [ 1 ]
          marko Marko Mäkelä made changes -

          Thank you.

          In InnoDB itself, the fix of MDEV-18025 only affected the treatment of encrypted page_compressed=1 tables, while this error is about ROW_FORMAT=COMPRESSED tables.

          For mariadb-backup, MDEV-18025 changed some things in xb_fil_cur_read(). The intention was to change the treatment of page_compressed=1 tables only. It seems possible that one cursor->page_size should actually have been cursor->zip_size:

          +		const ulint* const end = reinterpret_cast<ulint*>(
          +			page + cursor->page_size);
          

          Already in MariaDB 10.3.21 the code looks different:

          static bool page_is_corrupted(const byte *page, ulint page_no,
                                        const xb_fil_cur_t *cursor,
                                        const fil_space_t *space)
          {
                  byte tmp_frame[UNIV_PAGE_SIZE_MAX];
                  byte tmp_page[UNIV_PAGE_SIZE_MAX];
                  const ulint page_size = cursor->page_size.physical();
                  ulint page_type = mach_read_from_2(page + FIL_PAGE_TYPE);
          …
                          const ulint* const end = reinterpret_cast<const ulint*>(
                                  page + page_size);
          

          This looks definitely correct to me. I do not believe that this is a regression due to MDEV-18025.

          To analyze the bug, we would have needed a copy of the data file page that was claimed to be corrupted.

          marko Marko Mäkelä added a comment - Thank you. In InnoDB itself, the fix of MDEV-18025 only affected the treatment of encrypted page_compressed=1 tables, while this error is about ROW_FORMAT=COMPRESSED tables. For mariadb-backup , MDEV-18025 changed some things in xb_fil_cur_read() . The intention was to change the treatment of page_compressed=1 tables only. It seems possible that one cursor->page_size should actually have been cursor->zip_size : + const ulint* const end = reinterpret_cast<ulint*>( + page + cursor->page_size); Already in MariaDB 10.3.21 the code looks different: static bool page_is_corrupted( const byte *page, ulint page_no, const xb_fil_cur_t *cursor, const fil_space_t *space) { byte tmp_frame[UNIV_PAGE_SIZE_MAX]; byte tmp_page[UNIV_PAGE_SIZE_MAX]; const ulint page_size = cursor->page_size.physical(); ulint page_type = mach_read_from_2(page + FIL_PAGE_TYPE); … const ulint* const end = reinterpret_cast < const ulint*>( page + page_size); This looks definitely correct to me. I do not believe that this is a regression due to MDEV-18025 . To analyze the bug, we would have needed a copy of the data file page that was claimed to be corrupted.
          marko Marko Mäkelä made changes -
          issue.field.resolutiondate 2023-04-16 07:58:34.0 2023-04-16 07:58:34.715
          marko Marko Mäkelä made changes -
          Fix Version/s N/A [ 14700 ]
          Fix Version/s 10.3 [ 22126 ]
          Fix Version/s 10.4 [ 22408 ]
          Resolution Cannot Reproduce [ 5 ]
          Status Open [ 1 ] Closed [ 6 ]

          People

            marko Marko Mäkelä
            wamayall Will Mayall
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.