Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27053

Crash on assertion failure in btr0cur.cc - apparent index corruption

Details

    Description

      Server version: 10.5.12-MariaDB-1:10.5.12+maria~buster-log

      At some point in time prior to 2021-10-28 09:13 UTC the index on the serial column in the certificates table became corrupted. The attempt to insert a new certificate caused the following error:

      Oct 28 09:13:51 2021-10-28 09:13:51 0x7ff648064700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.5.12/storage/innobase/btr/btr0cur.cc line 336
      

      This type of entry is one of the most common inserts on this database. We were unable to identify anything unusual or problematic about this specific insert. This database exists in a Galera cluster and the service accesses the cluster via ProxySQL. Once the first node failed, ProxySQL shifted to the next node and the service resent the insert. This killed the next database node as well and so on until quorum was lost and nodes began to refuse queries.

      Our cluster nodes go through a regular wipe and rebuild process. That means that state transfers are relatively common in our cluster, and we expect that that is how the corruption affected all nodes in the cluster equally.

      To diagnose we shut off all Galera functionality and brought up a backup of the DB on a single node. We attempted to insert the query and were able to consistently duplicate the crash with the previously listed error. We have retained this backup for purposes of duplicating this failure as needed. We performed CHECK TABLE certificates and this also caused the database to crash, but with the following error:

      2021-10-28 20:36:12 4 [ERROR] InnoDB: In pages [page id: space=30, page number=924] and [page id: space=30, page number=3830] of index `serial` of table `boulder`.`certificates` /* Partition `p_start` */
      InnoDB: broken FIL_PAGE_NEXT or FIL_PAGE_PREV links
      2021-10-28 20:36:12 4 [ERROR] InnoDB: Corruption of an index tree: table `boulder`.`certificates` /* Partition `p_start` */ index `serial`, father ptr page no 7545, child page no 924
      

      We found that running optimize tables was adequate to repair the corrupted index and we were then able to perform the insert that had been failing. In our review of our database and logs we do not appear to have any data loss.

      At this point we have no clear idea as to what caused the index corruption or when/if it will reoccur. In the hope that it may help others in the community, we're opening this bug report.

      Attachments

        Issue Links

          Activity

            The assertion fails because a corruption of the doubly linked list of sibling pages is noticed. Apparently the same error is reported in a friendlier way by CHECK TABLE.

            We’d better propagate the error to the client connection, instead of crashing the server. That is the topic of MDEV-13542.

            Normally, such inconsistency between linked pages is prevented by atomic mini-transactions, which combine page locking and logging. Any operation that would modify multiple pages is logged as an atomic mini-transaction. If for some reason the ib_logfile0 is not being copied in a consistent way and some logged changes are lost, you can get such corruption. (Note: Also starting up the server with innodb_force_recovery=6 or without the ib_logfile0 file is a recipe for disaster.)

            Because Galera has been mentioned, I would primarily suspect an error in the snapshot transfer mechanisms that were used. Any "backup" or "restore" step could potentially cause this type of corruption. Can you trace the history of this data directory in terms of such corruption?

            marko Marko Mäkelä added a comment - The assertion fails because a corruption of the doubly linked list of sibling pages is noticed. Apparently the same error is reported in a friendlier way by CHECK TABLE . We’d better propagate the error to the client connection, instead of crashing the server. That is the topic of MDEV-13542 . Normally, such inconsistency between linked pages is prevented by atomic mini-transactions, which combine page locking and logging. Any operation that would modify multiple pages is logged as an atomic mini-transaction. If for some reason the ib_logfile0 is not being copied in a consistent way and some logged changes are lost, you can get such corruption. (Note: Also starting up the server with innodb_force_recovery=6 or without the ib_logfile0 file is a recipe for disaster.) Because Galera has been mentioned, I would primarily suspect an error in the snapshot transfer mechanisms that were used. Any "backup" or "restore" step could potentially cause this type of corruption. Can you trace the history of this data directory in terms of such corruption?
            lneva Lex Neva added a comment -

            Hi, I work with Dan and was involved with this incident. This is a really interesting theory! I guess I had just assumed that the corruption happened and then the state transfer spread it to the rest of the cluster. It hadn't occurred to me that a state transfer could be the source of the corruption.

            We can't precisely trace the origin of this data directory, but I can tell you a couple of things about it. This data directory has been passed around from node to node, because each of our nodes is wiped and re-installed from the ground up every two weeks. We have always used mariabackup-based state transfers. Over the life of this database, no node has been restored from an external backup and joined the cluster. It is easily possible that corruption was introduced during an SST and then passed to the rest of the nodes through subsequent SSTs.

            Based on your theory, we're wondering if we should switch to the mysqldump or rsync SST method. We care quite a bit about the integrity of the data in this cluster, and much less about the speed of SSTs. Do you feel that mariabackup is risky? Could the data itself be at risk and not just the indices?

            lneva Lex Neva added a comment - Hi, I work with Dan and was involved with this incident. This is a really interesting theory! I guess I had just assumed that the corruption happened and then the state transfer spread it to the rest of the cluster. It hadn't occurred to me that a state transfer could be the source of the corruption. We can't precisely trace the origin of this data directory, but I can tell you a couple of things about it. This data directory has been passed around from node to node, because each of our nodes is wiped and re-installed from the ground up every two weeks. We have always used mariabackup-based state transfers. Over the life of this database, no node has been restored from an external backup and joined the cluster. It is easily possible that corruption was introduced during an SST and then passed to the rest of the nodes through subsequent SSTs. Based on your theory, we're wondering if we should switch to the mysqldump or rsync SST method. We care quite a bit about the integrity of the data in this cluster, and much less about the speed of SSTs. Do you feel that mariabackup is risky? Could the data itself be at risk and not just the indices?
            lneva Lex Neva added a comment -

            And if a mariabackup state transfer is risky, maybe Galera should run a CHECK TABLES after the SST completes just to be safe?

            lneva Lex Neva added a comment - And if a mariabackup state transfer is risky, maybe Galera should run a CHECK TABLES after the SST completes just to be safe?

            I am not deeply familiar with the Galera cluster or its snapshot transfers (SST), but I know InnoDB.

            Theoretically speaking, the mysqldump method should be safest. But it would also likely be the slowest.

            The rsync method is inherently risky if the source can theoretically be modified during the transfer. I suppose that this method would be protected by the setting innodb_disallow_writes, which causes some problems on its own (MDEV-24845). I have not checked if all relevant writes are being blocked by that setting. (There are also writes outside InnoDB.)

            The mariabackup a.k.a. mariadb-backup method is what we would recommend for backups. It is possible that you are hitting some bug in it. It is also possible that the Galera SST scripting around it is ignoring some error, and letting corruption propagate. A repeatable test case that mimics your workload (similar table structures and DML operations) would be very much appreciated.

            marko Marko Mäkelä added a comment - I am not deeply familiar with the Galera cluster or its snapshot transfers (SST), but I know InnoDB. Theoretically speaking, the mysqldump method should be safest. But it would also likely be the slowest. The rsync method is inherently risky if the source can theoretically be modified during the transfer. I suppose that this method would be protected by the setting innodb_disallow_writes , which causes some problems on its own ( MDEV-24845 ). I have not checked if all relevant writes are being blocked by that setting. (There are also writes outside InnoDB.) The mariabackup a.k.a. mariadb-backup method is what we would recommend for backups. It is possible that you are hitting some bug in it. It is also possible that the Galera SST scripting around it is ignoring some error, and letting corruption propagate. A repeatable test case that mimics your workload (similar table structures and DML operations) would be very much appreciated.
            lneva Lex Neva added a comment -

            Thanks, that agrees with my impression that mysqldump is the safest. We have been using mariabackup for backups as well as SSTs. One of our learnings from this incident is that we also need to do regular mysqldump backups, because this corruption was propagated into our backups.

            A repeatable test case is not feasible, I'm afraid. The exact circumstances that introduced the corruption are unknown, and we also have no clear idea when or how the corruption was introduced. The only thing we do know is that our query pattern did not change, nor did our cluster management / SST patterns.

            lneva Lex Neva added a comment - Thanks, that agrees with my impression that mysqldump is the safest. We have been using mariabackup for backups as well as SSTs. One of our learnings from this incident is that we also need to do regular mysqldump backups, because this corruption was propagated into our backups. A repeatable test case is not feasible, I'm afraid. The exact circumstances that introduced the corruption are unknown, and we also have no clear idea when or how the corruption was introduced. The only thing we do know is that our query pattern did not change, nor did our cluster management / SST patterns.

            In MDEV-24845 I wrote some notes how I suspect that the rsync snapshot method is currently broken except when innodb_use_native_aio=0 is being used.

            I believe that with a correct implementation of necessary and sufficient locks, the rsync snapshot method should be the safest and fastest. But, currently we do not seem to be there.

            marko Marko Mäkelä added a comment - In MDEV-24845 I wrote some notes how I suspect that the rsync snapshot method is currently broken except when innodb_use_native_aio=0 is being used. I believe that with a correct implementation of necessary and sufficient locks, the rsync snapshot method should be the safest and fastest. But, currently we do not seem to be there.

            The error message about corrupted FIL_PAGE_NEXT or FIL_PAGE_PREV links is written to the error log by CHECK TABLE.

            The assertion that failed and caused the server crash is in the function btr_cur_latch_leaves():

                                    ut_a(btr_page_get_prev(get_block->frame)
                                         == block->page.id().page_no());
            

            I have rewritten that function in MDEV-13542 which is currently being tested. This particular assertion has been removed; it is not being tested even in the rewritten function btr_page_get(). I think that it is acceptable, because the purpose of btr_cur_latch_leaves() is to only acquire latches on the sibling pages if those pages can be loaded. The caller would check for errors elsewhere, when actually accessing those sibling pages. I realize that I still have some work to do, to ensure that such conditions are in place and to replace similar assertions in btr_discard_page() and btr_attach_half_pages() and possibly elsewhere.

            When it comes to the root cause of this corruption, I believe that the fix of MDEV-25975 made the default wsrep_sst_method=rsync actually safe to use. I think that this ticket is about corruption caused by Galera snapshot transfer, while MDEV-13542 is about preventing crashes due to corruption.

            marko Marko Mäkelä added a comment - The error message about corrupted FIL_PAGE_NEXT or FIL_PAGE_PREV links is written to the error log by CHECK TABLE . The assertion that failed and caused the server crash is in the function btr_cur_latch_leaves() : ut_a(btr_page_get_prev(get_block->frame) == block->page.id().page_no()); I have rewritten that function in MDEV-13542 which is currently being tested. This particular assertion has been removed; it is not being tested even in the rewritten function btr_page_get() . I think that it is acceptable, because the purpose of btr_cur_latch_leaves() is to only acquire latches on the sibling pages if those pages can be loaded. The caller would check for errors elsewhere, when actually accessing those sibling pages. I realize that I still have some work to do, to ensure that such conditions are in place and to replace similar assertions in btr_discard_page() and btr_attach_half_pages() and possibly elsewhere. When it comes to the root cause of this corruption, I believe that the fix of MDEV-25975 made the default wsrep_sst_method=rsync actually safe to use. I think that this ticket is about corruption caused by Galera snapshot transfer, while MDEV-13542 is about preventing crashes due to corruption.

            If wsrep_sst_method=mariabackup was used, this corruption might be explained by MDEV-28731. The crash on this corruption was fixed by MDEV-13542.

            marko Marko Mäkelä added a comment - If wsrep_sst_method=mariabackup was used, this corruption might be explained by MDEV-28731 . The crash on this corruption was fixed by MDEV-13542 .

            People

              jplindst Jan Lindström (Inactive)
              sohelpful Daniel Jeffery
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.