Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22929

MariaBackup option to report and/or continue when corruption is encountered

Details

    Description

      Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

      In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

      This is obviously not practical when corruption in one table prevents making backups of the entire server.

      Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
      -----------------
      From Julien - Here is an additonal explanaition why this would be important to be done in 10.6 ralf.gebhardt@mariadb.com.

      -------------------
      From Vlad Lesin - Here is detailed description of the feature from commit message:

      The new option --log-innodb-page-corruption is introduced.

      When this option is set, backup is not interrupted if innodb corrupted
      page is detected. Instead it logs all found corrupted pages in
      innodb_corrupted_pages file in backup directory and finishes with error.

      For incremental backup corrupted pages are also copied to .delta file,
      because we can't do LSN check for such pages during backup,
      innodb_corrupted_pages will also be created in incremental backup
      directory.

      During --prepare, corrupted pages list is read from the file just after
      redo log is applied, and each page from the list is checked if it is allocated
      in it's tablespace or not. If it is not allocated, then it is zeroed out,
      flushed to the tablespace and removed from the list. If all pages are removed
      from the list, then --prepare is finished successfully and
      innodb_corrupted_pages file is removed from backup directory. Otherwise
      --prepare is finished with error message and innodb_corrupted_pages contains
      the list of the pages, which are detected as corrupted during backup, and are
      allocated in their tablespaces, what means backup directory contains corrupted
      innodb pages, and backup can not be considered as consistent.

      For incremental --prepare corrupted pages from .delta files are applied
      to the base backup, innodb_corrupted_pages is read from both base in
      incremental directories, and the same action is proceded for corrupted
      pages list as for full --prepare. innodb_corrupted_pages file is
      modified or removed only in base directory.

      If DDL happens during backup, it is also processed at the end of backup
      to have correct tablespace names in innodb_corrupted_pages.

      Attachments

        Issue Links

          Activity

            juan.vera Juan created issue -
            juan.vera Juan made changes -
            Field Original Value New Value
            juan.vera Juan made changes -
            Description In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding an innodb_focre_recovery=1 option to mariabackup, for instance?
            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            julien.fritsch Julien Fritsch made changes -
            Assignee Ralf Gebhardt [ ralf.gebhardt@mariadb.com ]
            nicklamb Nick (Inactive) made changes -
            Description In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            nicklamb Nick (Inactive) made changes -
            Summary Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup. MariaBackup option to report and/or continue when corruption is encountered
            julien.fritsch Julien Fritsch made changes -
            Description Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].
            nicklamb Nick (Inactive) made changes -
            Description Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].
            Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].

            --------------
            From Nick, Additional Thoughts: I believe we could achieve this simply by reverting or modifying https://jira.mariadb.org/browse/MDEV-20607. That changed the behavior of backup to instead of reporting errors in the log and then *Completed: Ok* to crash when an error occurs. My thought is that it would be better to continue the backup (or have a flag to allow this) and report* Completed - Errors encountered, check log instead*. This would allow customers to use backup as a corruption detection tool and allow for partial backups in the case of corrupted tables.

            MDEV-20607 is only for innodb initialization. It does not touch the code of page consistency verification, so, no, this issue can not be implemented with just MDEV-20607 reverting.

            vlad.lesin Vladislav Lesin added a comment - MDEV-20607 is only for innodb initialization. It does not touch the code of page consistency verification, so, no, this issue can not be implemented with just MDEV-20607 reverting.

            I would not use mariabackup as a corruption detection tool. The backup tool must not be used for unintended purposes. We have another tools for it. CHECK TABLE, innochecksum, for example. It's better to add this functionality to one of those tools. See also this https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160912&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160912 comment.

            vlad.lesin Vladislav Lesin added a comment - I would not use mariabackup as a corruption detection tool. The backup tool must not be used for unintended purposes. We have another tools for it. CHECK TABLE, innochecksum, for example. It's better to add this functionality to one of those tools. See also this https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160912&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160912 comment.

            I have some doubts in necessity of such option implementation. This option can potentially lead to data inconsistency in restored data. We already have such dangerous option like --no-lock, which I would like to remove, because our customers use it, then ask us to find the problem in backup, while documentation says the option does not guarantee data consistency. I think, if we implement it, it will add work for support and engineering, as it will be harder to diagnose the issues.

            vlad.lesin Vladislav Lesin added a comment - I have some doubts in necessity of such option implementation. This option can potentially lead to data inconsistency in restored data. We already have such dangerous option like --no-lock, which I would like to remove, because our customers use it, then ask us to find the problem in backup, while documentation says the option does not guarantee data consistency. I think, if we implement it, it will add work for support and engineering, as it will be harder to diagnose the issues.
            vlad.lesin Vladislav Lesin made changes -
            nicklamb Nick (Inactive) made changes -
            Description Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].

            --------------
            From Nick, Additional Thoughts: I believe we could achieve this simply by reverting or modifying https://jira.mariadb.org/browse/MDEV-20607. That changed the behavior of backup to instead of reporting errors in the log and then *Completed: Ok* to crash when an error occurs. My thought is that it would be better to continue the backup (or have a flag to allow this) and report* Completed - Errors encountered, check log instead*. This would allow customers to use backup as a corruption detection tool and allow for partial backups in the case of corrupted tables.
            Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].

            Not that it matters much, but the "corruption_found" flag seems a bit redundant. The sheer presence of the "backup_corrupted" file with the list of corrupted tables already means that the corruption was found, doesn't it?

            serg Sergei Golubchik added a comment - Not that it matters much, but the "corruption_found" flag seems a bit redundant. The sheer presence of the "backup_corrupted" file with the list of corrupted tables already means that the corruption was found, doesn't it?
            vlad.lesin Vladislav Lesin added a comment - - edited

            If we decide to implement it, then we could also check during --prepare if corrupted page is allocated or not in tablespace, and zero out it if it is not allocated, and does not treat it as corrupted page. In MDEV-21109 there are non-allocated pages in tablespace, which does not pass validation during backup because they contain wrong page id and/or page number, but there must not be non-zeroed non-allocated pages in tablespaces.

            vlad.lesin Vladislav Lesin added a comment - - edited If we decide to implement it, then we could also check during --prepare if corrupted page is allocated or not in tablespace, and zero out it if it is not allocated, and does not treat it as corrupted page. In MDEV-21109 there are non-allocated pages in tablespace, which does not pass validation during backup because they contain wrong page id and/or page number, but there must not be non-zeroed non-allocated pages in tablespaces.
            vlad.lesin Vladislav Lesin made changes -
            ralf.gebhardt Ralf Gebhardt made changes -
            Assignee Ralf Gebhardt [ ralf.gebhardt@mariadb.com ] Vladislav Lesin [ vlad.lesin ]

            According to our discussion in slack, this and MDEV-23971 should be joined, as they have the same source and solve the same issue.

            So we introduce new --log-innodb-pages-corruption. When this option is used, mariabackup do not stop backup process if innodb page corruption is detected, it continues backup and logs corrupted pages in "backup_corrupted" file in backup destination directory, after backup is taken, mariabackup finishes execution with error and error message in backup log. On --prepare phase, mariabackup checks each page from the list in "backup_corrupted" file, if the page is not allocated in the tablespace, it's zeroed out, flushed to data file, and removed from corrupted pages list, the corresponding message is logged to backup log(stdout). If all pages from the list were restored successfully with such a manner, "backup_corrupted" file is deleted and "mariabackup --prepare" returns success. Otherwise "backup_corrupted" file will contain list of pages, which were not restored, "mariabackup --prepare" will be finished with error and error message in backup log.

            vlad.lesin Vladislav Lesin added a comment - According to our discussion in slack, this and MDEV-23971 should be joined, as they have the same source and solve the same issue. So we introduce new --log-innodb-pages-corruption. When this option is used, mariabackup do not stop backup process if innodb page corruption is detected, it continues backup and logs corrupted pages in "backup_corrupted" file in backup destination directory, after backup is taken, mariabackup finishes execution with error and error message in backup log. On --prepare phase, mariabackup checks each page from the list in "backup_corrupted" file, if the page is not allocated in the tablespace, it's zeroed out, flushed to data file, and removed from corrupted pages list, the corresponding message is logged to backup log(stdout). If all pages from the list were restored successfully with such a manner, "backup_corrupted" file is deleted and "mariabackup --prepare" returns success. Otherwise "backup_corrupted" file will contain list of pages, which were not restored, "mariabackup --prepare" will be finished with error and error message in backup log.
            vlad.lesin Vladislav Lesin made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            vlad.lesin Vladislav Lesin made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            vlad.lesin Vladislav Lesin added a comment - - edited

            I pushed bb-10.2-MDEV-22929-log_corrupted_pages branch for testing. There will be conflicts on merging it to 10.[2345]. The conflicts are resolved in branches 10.[345]-MDEV-22929-log_corrupted_pages.
            wlad, could you please review it?

            vlad.lesin Vladislav Lesin added a comment - - edited I pushed bb-10.2- MDEV-22929 -log_corrupted_pages branch for testing. There will be conflicts on merging it to 10. [2345] . The conflicts are resolved in branches 10. [345] -MDEV-22929-log_corrupted_pages. wlad , could you please review it?
            vlad.lesin Vladislav Lesin made changes -
            Assignee Vladislav Lesin [ vlad.lesin ] Vladislav Vaintroub [ wlad ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            vlad.lesin Vladislav Lesin made changes -
            Description Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].
            Currently Mariabackup aborts when it detects any InnoDB corruption. Needs an option to complete the backup and flag or log the corruption rather than leaving the entire server with no backup.

            In situations where Mariabackup detects corruption while taking a backup, it currently aborts where InnoDB would assert, making backing up a corrupted server impossible.

            This is obviously not practical when corruption in one table prevents making backups of the entire server.

            Would it be possible to address this need by adding a force option like innodb_focre_recovery=1 to mariabackup, for instance?
            -----------------
            From Julien - Here is an additonal [explanaition |https://jira.mariadb.org/browse/MDEV-21109?focusedCommentId=160067&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-160067]why this would be important to be done in 10.6 [~ralf.gebhardt@mariadb.com].

            -------------------
            From Vlad Lesin - Here is detailed description of the feature from commit message:

            The new option --log-innodb-page-corruption is introduced.

            When this option is set, backup is not interrupted if innodb corrupted
            page is detected. Instead it logs all found corrupted pages in
            innodb_corrupted_pages file in backup directory and finishes with error.

            For incremental backup corrupted pages are also copied to .delta file,
            because we can't do LSN check for such pages during backup,
            innodb_corrupted_pages will also be created in incremental backup
            directory.

            During --prepare, corrupted pages list is read from the file just after
            redo log is applied, and each page from the list is checked if it is allocated
            in it's tablespace or not. If it is not allocated, then it is zeroed out,
            flushed to the tablespace and removed from the list. If all pages are removed
            from the list, then --prepare is finished successfully and
            innodb_corrupted_pages file is removed from backup directory. Otherwise
            --prepare is finished with error message and innodb_corrupted_pages contains
            the list of the pages, which are detected as corrupted during backup, and are
            allocated in their tablespaces, what means backup directory contains corrupted
            innodb pages, and backup can not be considered as consistent.

            For incremental --prepare corrupted pages from .delta files are applied
            to the base backup, innodb_corrupted_pages is read from both base in
            incremental directories, and the same action is proceded for corrupted
            pages list as for full --prepare. innodb_corrupted_pages file is
            modified or removed only in base directory.

            If DDL happens during backup, it is also processed at the end of backup
            to have correct tablespace names in innodb_corrupted_pages.
            vlad.lesin Vladislav Lesin added a comment - Testing looks good to me: https://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-MDEV-22929-log_corrupted_pages
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Vladislav Lesin [ vlad.lesin ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            Looks fine.

            wlad Vladislav Vaintroub added a comment - Looks fine.
            vlad.lesin Vladislav Lesin made changes -
            Fix Version/s 10.2.37 [ 25112 ]
            Fix Version/s 10.3.28 [ 25111 ]
            Fix Version/s 10.4.18 [ 25110 ]
            Fix Version/s 10.5.9 [ 25109 ]
            Fix Version/s 10.6.0 [ 24431 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            vlad.lesin Vladislav Lesin made changes -
            greenman Ian Gilfillan made changes -
            greenman Ian Gilfillan added a comment -

            This needs to be documented - created MDEV-24479

            greenman Ian Gilfillan added a comment - This needs to be documented - created MDEV-24479
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 110171 ] MariaDB v4 [ 134292 ]
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 110821 185032 126984

            People

              vlad.lesin Vladislav Lesin
              juan.vera Juan
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.