Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-18959

Engine transaction recovery through persistent binlog

Details

    Description

      A de-facto present recovery-related requirement of two calls of fsync() at
      transaction prepare and commit by Engine per transaction
      can be relaxed in favor of replacing the first fsync() by a group-fsync
      of Binlog. Since when Binlog is turned ON transactions
      group-committed/prepared the only fsync() per group resolves
      optimization requests such as MDEV-11376.

      When a trx is deposited into an fsynced binlog file its image
      consisting of xid and payload suffices for its recovery. Specifically the
      payload part can be effectively made use of to replay the transaction should
      it have missed out the Engine write to disk.

      As long as Engine maintains its last committed in binlog order durable
      transaction tracking all the transactions above the last if found in binlog upon a
      crash could are regarded as lost and be restored by re-applying of
      their payload, that is their binlogged replication events.

      The existing binlog checkpoint mechanism will continue to serve to
      limit binlog files for recovery.

      In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
      MDEV-24386 shows up to 3 times grown latency and halved throughput with the new default value
      and remained default of innodb_flush_log_at_trx_commit = 1.

      At the same time innodb_flush_log_at_trx_commit = 0 still allows for recovery (though to be
      extended) and
      further benchmarking sysbench4.pdf of MDEV-24386 ensures the latency and performance of
      (B = 1, I = 0) may be even better compare to (B = 0, I = 1) of the current (10.5) default.
      Here B stands for sync_binlog, I for innodb_flush_log_at_trx_commit.

      To the refined recovery, it needs to know engines involved in a transaction in doubt.
      Specifically whether all the engines maintain the last committed transaction's binlog offset
      in their persistent metadata.
      For instance Innodb does so. This piece of info is crucial as at recovery
      the engine may have the transaction or its branch
      either a) already committed or b) not even prepared and which of the two is the case can be resolved only
      with an "external" help such as the tracking facility: when the transaction starts in binlog
      at an offset greater than that that the engine remembers of its last committed then
      this transaction obviously is not yet committed.

      Unlike all other cases in case of the single Innodb engine transaction
      there is no need to specify the engine explicitly in the transaction's
      binlog events.

      The recovery procedure follows most of the conventional one's steps and adds up
      the following rule, simplified here to a single engine:

      when a transaction updates an engine that track binlog offset of their commits and
       its binlog offset is greater than one of the last committed trx in the engine
       then the transaction is to be re\-executed (unless it's already prepared then it is to
       commit by the regular rules).
      
      

      For the multiple engine and not-Innodb cases the property of involved engines can be
      specified through extended Gtid_log_event. Consider a bitmap with the bits mapped to engines
      on that local server.
      The mapping is local for the server so it must be mere stable through crashes.
      Gtid_log_event remembers the engines involved (except there is only
      one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

      When there's an engine that does not track this transaction can't be re-executed, otherwise
      branches of the in-doubt multi-engine transaction are considered individually taking into account
      what the engine branch remembers of its last committed and the transaction binlog offset.

      For re-execution consider MDEV-21469 as a template. MIXED binlog format guarantees re-execution
      to repeat/reproduce the original changes.

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            marko, as discussed on slack, there have been two major issues that raise in your comments. This ticket is about roll-forward recovery (the subject #1) which infers (the subject #2) how to find the last stably committed transaction so next to it would be the 1st one to start the roll-forward "replay".
            Through implementing mechanisms responsible for #2 we'd optimize away a complicated and troublesome binlog-background-thread and binlog-checkpoint.

            To #1 and your 'misuse' qualification though, identification with XID at trx prepare is still not a bad idea as the prepared trx:s might just need the commit decision/operation for their roll-forward (otherwise it'd be a full trx replay). The engine prepared does not require `fsync()` under #1.

            Elkin Andrei Elkin added a comment - marko , as discussed on slack, there have been two major issues that raise in your comments. This ticket is about roll-forward recovery (the subject #1) which infers (the subject #2) how to find the last stably committed transaction so next to it would be the 1st one to start the roll-forward "replay". Through implementing mechanisms responsible for #2 we'd optimize away a complicated and troublesome binlog-background-thread and binlog-checkpoint. To #1 and your 'misuse' qualification though, identification with XID at trx prepare is still not a bad idea as the prepared trx:s might just need the commit decision/operation for their roll-forward (otherwise it'd be a full trx replay). The engine prepared does not require `fsync()` under #1.

            Somewhat related to this and https://smalldatum.blogspot.com/2022/10/early-lock-release-and-innodb.html I dug up the commit that imported the InnoDB revision history when it had been maintained separately from MySQL. The log can be viewed with the following command:

            git log --name-only 5f9ba24f91989d68ff90d453dbfbc189464b89b9^..5f9ba24f91989d68ff90d453dbfbc189464b89b9^2^
            

            The log includes an interesting change: Enable group commit functionality. I think that some cleanup (review and removal) of this kind of code needs to be part of this task.

            marko Marko Mäkelä added a comment - Somewhat related to this and https://smalldatum.blogspot.com/2022/10/early-lock-release-and-innodb.html I dug up the commit that imported the InnoDB revision history when it had been maintained separately from MySQL . The log can be viewed with the following command: git log --name-only 5f9ba24f91989d68ff90d453dbfbc189464b89b9^..5f9ba24f91989d68ff90d453dbfbc189464b89b9^2^ The log includes an interesting change: Enable group commit functionality . I think that some cleanup (review and removal) of this kind of code needs to be part of this task.

            Last week, I discussed an alternative solution with knielsen: implement an API that allows a storage engine to durably write binlog. For InnoDB, this would involve buffering a page-oriented binlog in the buffer pool and using the normal write-ahead logging mechanism.

            Over the weekend, I realized that in case the binlog is written strictly append-only, there is no need to introduce any additional page framing, checksums, or fields like FIL_PAGE_LSN. Not having a field like FIL_PAGE_LSN means that recovery will ‘blindly’ execute any recovered binlog writes to the file even though the data might already have been written.

            A further optimization might be that instead of writing the binlog via the InnoDB buffer pool, we could write it roughly in the current way, but with a few additions:

            • write the binlog also to the InnoDB redo log (in MDEV-12353 we reserved record type codes that can be used for this)
            • implement an InnoDB log_checkpoint() hook that would ensure that fdatasync() is called on the pending binlog writes that would be ‘discarded’ by the checkpoint
            • on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes)
            • use asynchronous writes rather than synchronous ones (this was found to help a lot in MDEV-23855 and MDEV-23399)

            A major benefit of this approach is that it is possible to get the binlog and the InnoDB transactions completely consistent with each other, even when there are no fdatasync() calls at all during normal operation. Around InnoDB log checkpoints they are unavoidable, to ensure the correct ordering of page and log writes.

            marko Marko Mäkelä added a comment - Last week, I discussed an alternative solution with knielsen : implement an API that allows a storage engine to durably write binlog. For InnoDB, this would involve buffering a page-oriented binlog in the buffer pool and using the normal write-ahead logging mechanism. Over the weekend, I realized that in case the binlog is written strictly append-only, there is no need to introduce any additional page framing, checksums, or fields like FIL_PAGE_LSN . Not having a field like FIL_PAGE_LSN means that recovery will ‘blindly’ execute any recovered binlog writes to the file even though the data might already have been written. A further optimization might be that instead of writing the binlog via the InnoDB buffer pool, we could write it roughly in the current way, but with a few additions: write the binlog also to the InnoDB redo log (in MDEV-12353 we reserved record type codes that can be used for this) implement an InnoDB log_checkpoint() hook that would ensure that fdatasync() is called on the pending binlog writes that would be ‘discarded’ by the checkpoint on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes) use asynchronous writes rather than synchronous ones (this was found to help a lot in MDEV-23855 and MDEV-23399 ) A major benefit of this approach is that it is possible to get the binlog and the InnoDB transactions completely consistent with each other, even when there are no fdatasync() calls at all during normal operation. Around InnoDB log checkpoints they are unavoidable, to ensure the correct ordering of page and log writes.

            Thanks for the ideas marko! I have to review the existing patch more in-depth (which was originally authored by Sachin and Sujatha), but from my understanding, it is similar to your suggestion of

            on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes)

            I'll review with a closer eye to that point.

            Perhaps we can create individual follow-up JIRAs for the other optimization suggestions.

            bnestere Brandon Nesterenko added a comment - Thanks for the ideas marko ! I have to review the existing patch more in-depth (which was originally authored by Sachin and Sujatha), but from my understanding, it is similar to your suggestion of on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes) I'll review with a closer eye to that point. Perhaps we can create individual follow-up JIRAs for the other optimization suggestions.

            An alternative to this has been presented in MDEV-34705.

            marko Marko Mäkelä added a comment - An alternative to this has been presented in MDEV-34705 .

            People

              bnestere Brandon Nesterenko
              Elkin Andrei Elkin
              Votes:
              4 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.