Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29911

InnoDB recovery and mariadb-backup --prepare fail to report detailed progress

Details

    Description

      A user is facing this scenario where mariabackup --prepare has been running for 12+ hours, and it's reporting page numbers to recover. Then it goes on to multiple iterations of recovery with no indication of how much work is remaining or how much time remains to complete the process.

      It will be great if `--prepare` can report the percentage of pages recovery completed per iteration and a global percentage completion consolidating all iteration batches.

      Attachments

        Issue Links

          Activity

            The out-of-memory handling in the 10.6 (pre MDEV-14425) version of this will be inferior to the later version in two aspects:

            If recovery runs of of memory in the 10.6 version,

            • It will restart parsing and storing the log from the checkpoint. It can never resume from the "OOM LSN".
            • Each batch will scan the log to the end. I did not get my optimization of this to work; some DDL crash recovery tests would occasionally fail.
            marko Marko Mäkelä added a comment - The out-of-memory handling in the 10.6 (pre MDEV-14425 ) version of this will be inferior to the later version in two aspects: If recovery runs of of memory in the 10.6 version, It will restart parsing and storing the log from the checkpoint. It can never resume from the "OOM LSN". Each batch will scan the log to the end. I did not get my optimization of this to work; some DDL crash recovery tests would occasionally fail.

            Unfortunately, both the 10.8 and 10.6 version of this are acting up. One test that occasionally fails is innodb.alter_copy. I have not yet tested if that test is stable in the original development version. In 11.0 the buffer pool and recovery were simplified thanks to MDEV-29694 (removing the InnoDB change buffer). That could play a role here.

            marko Marko Mäkelä added a comment - Unfortunately, both the 10.8 and 10.6 version of this are acting up. One test that occasionally fails is innodb.alter_copy . I have not yet tested if that test is stable in the original development version. In 11.0 the buffer pool and recovery were simplified thanks to MDEV-29694 (removing the InnoDB change buffer). That could play a role here.
            marko Marko Mäkelä added a comment - - edited

            I got an rr replay trace of the test innodb.alter_copy from the 10.8 version where fil_validate() fails during IORequest::write_complete(), on a tablespace that is in the process of being created in recv_sys_t::recover_deferred(). I think that we must protect that better by holding fil_system.mutex until the tablespace has been fully created. This was fixed in MDEV-31080.

            marko Marko Mäkelä added a comment - - edited I got an rr replay trace of the test innodb.alter_copy from the 10.8 version where fil_validate() fails during IORequest::write_complete() , on a tablespace that is in the process of being created in recv_sys_t::recover_deferred() . I think that we must protect that better by holding fil_system.mutex until the tablespace has been fully created. This was fixed in MDEV-31080 .

            I was asked to improve the wording of the "Recovery ran out of memory at LSN" message. I think that the following should be more informative:

            [Note] InnoDB: Multi-batch recovery needed at LSN 4189599815
            

            marko Marko Mäkelä added a comment - I was asked to improve the wording of the "Recovery ran out of memory at LSN" message. I think that the following should be more informative: [Note] InnoDB: Multi-batch recovery needed at LSN 4189599815

            There is not enough time to test this thoroughly before the upcoming quarterly releases.

            With the current fix that I have for 10.6 (and which might be feasible to port to 10.5 as well), some excessive or unnecessary log scanning will take place. When the buffer pool is small, the log reads will basically be O(n²) instead of O. This will be better in the version that uses the MDEV-14425 log format.

            marko Marko Mäkelä added a comment - There is not enough time to test this thoroughly before the upcoming quarterly releases. With the current fix that I have for 10.6 (and which might be feasible to port to 10.5 as well), some excessive or unnecessary log scanning will take place. When the buffer pool is small, the log reads will basically be O(n²) instead of O . This will be better in the version that uses the MDEV-14425 log format.

            People

              marko Marko Mäkelä
              Faisal Faisal Saeed (Inactive)
              Votes:
              2 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.