Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24449

Corruption of system tablespace or last recovered page

Details

    Description

      On Buildbot, we have a few cases of assertion failures in mariabackup --prepare, because buf_pool.n_pend_reads is not zero. mleich was able to reproduce the error under rr record, and I analyzed a case which is reported in MDEV-24448.

      In the analyzed trace, we are recovering 4 page reads. The read callback for 3 of the pages was finished (with no log to be applied, because the pages were up to date), but for the last page, the read callback was still in progress. Yet, srv_start() would concurrently initiate shutdown. We failed to wait for all asynchronous reads to complete. This omission could cause the failure to apply log to some data pages.

      While this error started to occur on Buildbot after we replaced the InnoDB mutexes with native ones (MDEV-21452), I think that it is possible in 10.5, and probably in older versions as well, even though the code was heavily refactored in 10.5, starting with MDEV-16264. The earliest known failure for the 10.6 main branch was as follows:

      10.6 cf2480dd77a45cf56389015a1eabf567

      mariabackup.xb_fulltext_encrypted        w2 [ fail ]
              Test ended at 2020-12-16 04:19:07
       
      CURRENT_TEST: mariabackup.xb_fulltext_encrypted
      mysqltest: At line 19: exec of '/mnt/buildbot/build/mariadb-10.6.0/extra/mariabackup/mariabackup --innobackupex --apply-log /mnt/buildbot/build/mariadb-10.6.0/mysql-test/var/2/tmp/backup 2>&1' failed, error: 34304, status: 134, errno: 11
      …
      2020-12-16  4:19:05 0 [Note] InnoDB: Starting final batch to recover 7 pages from redo log.
      mariabackup: /home/buildbot/buildbot/build/mariadb-10.6.0/storage/innobase/srv/srv0start.cc:1582: dberr_t srv_start(bool): Assertion `!buf_pool.any_io_pending()' failed.
      

      I think that there are two causes to this bug in 10.5 and 10.6:

      • io_callback() was invoking read_slots->release() before invoking the callback function (for write callbacks the existing order is OK)
      • recv_sys_t::apply() is missing a call to read_slots->wait() before flushing the buffer pool. The recv_sys.apply(true) was called by srv_start(), so this should fix the observed race condition (MDEV-9663).

      Attachments

        Issue Links

          Activity

            marko Marko Mäkelä created issue -
            marko Marko Mäkelä made changes -
            Field Original Value New Value
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Component/s Storage Engine - InnoDB [ 10129 ]
            Summary mariabackup --prepare may fails to apply some log Crash recovery or mariabackup --prepare may fail to apply some log
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Labels corruption rr-profile-analyzed rr-profile-analyzed
            marko Marko Mäkelä made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Affects Version/s 5.5 [ 15800 ]
            Affects Version/s 10.1 [ 16100 ]
            Affects Version/s 10.2 [ 14601 ]
            Affects Version/s 10.3 [ 22126 ]
            Affects Version/s 10.4 [ 22408 ]
            Affects Version/s 10.0 [ 16000 ]
            Affects Version/s 10.5 [ 23123 ]
            Affects Version/s 10.6 [ 24028 ]
            Assignee Marko Mäkelä [ marko ] Thirunarayanan Balathandayuthapani [ thiru ]
            Description Note: I did not analyze this in older versions than 10.5. On Buildbot, we have a few cases of assertion failures in {{mariabackup --prepare}}, because {{buf_pool.n_pend_reads}} is not zero. [~mleich] was able to reproduce the error under {{rr record}}, and I analyzed a case.

            In the analyzed trace, we are recovering 4 page reads. The read callback for 3 of the pages was finished (with no log to be applied, because the pages were up to date), but for the last page, the read callback was still in progress. Yet, {{srv_start()}} would concurrently initiate shutdown. We failed to wait for all asynchronous reads to complete. This omission could cause the failure to apply log to some data pages.

            While this error started to occur on Buildbot after we replaced the InnoDB mutexes with native ones (MDEV-21452), I think that it is possible in 10.5, and probably in older versions as well, even though the code was heavily refactored in 10.5, starting with MDEV-16264. The [earliest known failure for the 10.6 main branch|http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/26092] was as follows:
            {noformat:title=10.6 cf2480dd77a45cf56389015a1eabf567}
            mariabackup.xb_fulltext_encrypted w2 [ fail ]
                    Test ended at 2020-12-16 04:19:07

            CURRENT_TEST: mariabackup.xb_fulltext_encrypted
            mysqltest: At line 19: exec of '/mnt/buildbot/build/mariadb-10.6.0/extra/mariabackup/mariabackup --innobackupex --apply-log /mnt/buildbot/build/mariadb-10.6.0/mysql-test/var/2/tmp/backup 2>&1' failed, error: 34304, status: 134, errno: 11
            …
            2020-12-16 4:19:05 0 [Note] InnoDB: Starting final batch to recover 7 pages from redo log.
            mariabackup: /home/buildbot/buildbot/build/mariadb-10.6.0/storage/innobase/srv/srv0start.cc:1582: dberr_t srv_start(bool): Assertion `!buf_pool.any_io_pending()' failed.
            {noformat}

            I think that there are two causes to this bug in 10.5 and 10.6:
            * {{io_callback()}} was invoking {{read_slots->release()}} before invoking the callback function (for write callbacks the existing order is OK)
            * {{recv_sys_t::apply()}} is missing a call to {{read_slots->wait()}} before flushing the buffer pool. The {{recv_sys.apply(true)}} was called by {{srv_start()}}, so this should fix the observed race condition.
            On Buildbot, we have a few cases of assertion failures in {{mariabackup --prepare}}, because {{buf_pool.n_pend_reads}} is not zero. [~mleich] was able to reproduce the error under {{rr record}}, and I analyzed a case which is reported in MDEV-24448.

            In the analyzed trace, we are recovering 4 page reads. The read callback for 3 of the pages was finished (with no log to be applied, because the pages were up to date), but for the last page, the read callback was still in progress. Yet, {{srv_start()}} would concurrently initiate shutdown. We failed to wait for all asynchronous reads to complete. This omission could cause the failure to apply log to some data pages.

            While this error started to occur on Buildbot after we replaced the InnoDB mutexes with native ones (MDEV-21452), I think that it is possible in 10.5, and probably in older versions as well, even though the code was heavily refactored in 10.5, starting with MDEV-16264. The [earliest known failure for the 10.6 main branch|http://buildbot.askmonty.org/buildbot/builders/kvm-fulltest2/builds/26092] was as follows:
            {noformat:title=10.6 cf2480dd77a45cf56389015a1eabf567}
            mariabackup.xb_fulltext_encrypted w2 [ fail ]
                    Test ended at 2020-12-16 04:19:07

            CURRENT_TEST: mariabackup.xb_fulltext_encrypted
            mysqltest: At line 19: exec of '/mnt/buildbot/build/mariadb-10.6.0/extra/mariabackup/mariabackup --innobackupex --apply-log /mnt/buildbot/build/mariadb-10.6.0/mysql-test/var/2/tmp/backup 2>&1' failed, error: 34304, status: 134, errno: 11
            …
            2020-12-16 4:19:05 0 [Note] InnoDB: Starting final batch to recover 7 pages from redo log.
            mariabackup: /home/buildbot/buildbot/build/mariadb-10.6.0/storage/innobase/srv/srv0start.cc:1582: dberr_t srv_start(bool): Assertion `!buf_pool.any_io_pending()' failed.
            {noformat}

            I think that there are two causes to this bug in 10.5 and 10.6:
            * {{io_callback()}} was invoking {{read_slots->release()}} before invoking the callback function (for write callbacks the existing order is OK)
            * {{recv_sys_t::apply()}} is missing a call to {{read_slots->wait()}} before flushing the buffer pool. The {{recv_sys.apply(true)}} was called by {{srv_start()}}, so this should fix the observed race condition (MDEV-9663).
            Labels rr-profile-analyzed corruption not-10.5
            Summary Crash recovery or mariabackup --prepare may fail to apply some log Potential corruption of last recovered page
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            marko Marko Mäkelä made changes -
            Assignee Thirunarayanan Balathandayuthapani [ thiru ] Marko Mäkelä [ marko ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            Summary Potential corruption of last recovered page Corruption of system tablespace or last recovered page
            marko Marko Mäkelä made changes -
            issue.field.resolutiondate 2020-12-28 10:36:40.0 2020-12-28 10:36:40.204
            marko Marko Mäkelä made changes -
            Fix Version/s 10.2.37 [ 25112 ]
            Fix Version/s 10.3.28 [ 25111 ]
            Fix Version/s 10.4.18 [ 25110 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            ccalender Chris Calender (Inactive) made changes -
            Labels corruption not-10.5 CS0180575 corruption not-10.5
            julien.fritsch Julien Fritsch made changes -
            marko Marko Mäkelä made changes -
            mpflaum Maria M Pflaum (Inactive) made changes -
            Labels CS0180575 corruption not-10.5 CS0156770 CS0180575 corruption not-10.5
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels CS0156770 CS0180575 corruption not-10.5 CS0156770 CS0180575 ServiceNow corruption not-10.5
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            Labels CS0156770 CS0180575 ServiceNow corruption not-10.5 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z CS0156770 CS0180575 corruption not-10.5
            serg Sergei Golubchik made changes -
            Labels 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z CS0156770 CS0180575 corruption not-10.5 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z corruption not-10.5
            serg Sergei Golubchik made changes -
            Labels 76qDvLB8Gju6Hs7nk3VY3EX42G795W5z corruption not-10.5 corruption not-10.5
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 117227 ] MariaDB v4 [ 158711 ]
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            marko Marko Mäkelä made changes -
            rob.schwyzer@mariadb.com Rob Schwyzer (Inactive) made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 201658 171635
            Zendesk active tickets 201658

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.