Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23806

Undo page corruption on recovery

    XMLWordPrintable

Details

    Description

      I analyzed an rr replay trace where an undo log page was corrupted. The undo log page contents was recovered entirely based on redo log records (thanks to MDEV-12699).

      It turns out that when we removed the MLOG_UNDO_ERASE_END record in MariaDB 10.3.3 in an attempt to reduce our redo log volume, we created a technical debt that was collected when MDEV-12353 optimized the redo log volume further. The MDEV-12353 replacement of mlog_write_ulint() would avoid logging the first bytes that were not actually changed in the page. But, because trx_undo_report_row_operation() is invoking memset() without writing redo log about it, the page images would differ between the time the server was killed, and the time the page was recovered.

      To avoid this corruption, we must write redo log for the memset() operation unless the entire undo log page will be freed in the mini-transaction.

      This bug is apparently very hard to hit, because even though MDEV-12353 introduced it already in 10.5.2, we first saw it a week ago when testing a development version of MDEV-23399 (which changes the page flushing algorithm and could therefore affect timings).

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.