[MDEV-18959] Engine transaction recovery through persistent binlog Created: 2019-03-18 Updated: 2024-01-18 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Replication, Server |
| Fix Version/s: | 11.6 |
| Type: | New Feature | Priority: | Critical |
| Reporter: | Andrei Elkin | Assignee: | Brandon Nesterenko |
| Resolution: | Unresolved | Votes: | 4 |
| Labels: | groupcommit, recovery | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
A de-facto present recovery-related requirement of two calls of fsync() at When a trx is deposited into an fsynced binlog file its image As long as Engine maintains its last committed in binlog order durable The existing binlog checkpoint mechanism will continue to serve to In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern. At the same time innodb_flush_log_at_trx_commit = 0 still allows for recovery (though to be To the refined recovery, it needs to know engines involved in a transaction in doubt. Unlike all other cases in case of the single Innodb engine transaction The recovery procedure follows most of the conventional one's steps and adds up
For the multiple engine and not-Innodb cases the property of involved engines can be When there's an engine that does not track this transaction can't be re-executed, otherwise For re-execution consider MDEV-21469 as a template. MIXED binlog format guarantees re-execution |
| Comments |
| Comment by Marko Mäkelä [ 2019-03-18 ] | |||||||||||||||||||||||||||||||||||||||||||
|
As far as I understand, if sync_binlog=1, at transaction commit we could skip not only the fsync() call for the InnoDB redo log files, but also the call log_write_up_to(mtr.commit_lsn()). That is, we could group all writes from the log_sys buffer to the InnoDB redo log files in bigger batches. Furthermore, my understanding is that the internal use of 2-phase commit (XA distributed transactions) can be removed in this case. That mechanism would only be needed when XA START/END/PREPARE/COMMIT/ROLLBACK statements are being issued from SQL. The fsync() in InnoDB would still be needed for preventing harmful reordering of writes (to stick to write-ahead logging). The primary mechanisms for driving that should be redo log checkpoints and dirty page replacement in the buffer pool. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2019-04-02 ] | |||||||||||||||||||||||||||||||||||||||||||
|
The MDEV would implement a MDEV-16589 requirement. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-05-03 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I wonder whether we need innobase_flush_logs() or handlerton::flush_logs at all. In InnoDB, this function is invoking log_buffer_flush_to_disk(), which in turn is initiating a write of all buffered redo log to the log files, instead of merely flushing the log up to the state change of the current transaction (trx->commit_lsn, which is what trx_flush_log_if_needed() should have written already. All this code should be reviewed and cleaned up as part of this task. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sujatha Sivakumar (Inactive) [ 2019-11-19 ] | |||||||||||||||||||||||||||||||||||||||||||
|
sysbench 1.1.0-1327e79 Sysbench commands:
(1,0), (1,1) and (0,1) correspond to 'innodb_flush_log_at_trx_commit' and 'sync_binlog' respectively. SYSBENCH Results
The more the binlog group commit rate the new proposal will fare better. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-04 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I believe that the correct operation of this change depends on the ability of RESET MASTER to reset the binlog position that is persisted in InnoDB ( | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2020-05-04 ] | |||||||||||||||||||||||||||||||||||||||||||
|
That's correct Marko. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-07-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Alibaba seems to have implemented something similar: 云数据库 RDS > AliSQL 内核 > Binlog in Redo
Google translation:
They introduced 2 parameters to control this:
The main difference from this task is that the binlog is (almost) guaranteed to lag behind the InnoDB redo log at all times. MDEV-18959 aims to guarantee that the redo log is never ahead of the binlog. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-03-24 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Here, a set of thoughts/suggestions:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2021-04-16 ] | |||||||||||||||||||||||||||||||||||||||||||
|
serg, thanks for a constructive feedback. To
though, flush_log_at_trx_commit = 0|2 actually must increment binlog_pos (I believe it does so currently), just not that eagerly as the value 1 does, to reflect the last safely/persistently committed. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2021-09-20 ] | |||||||||||||||||||||||||||||||||||||||||||
|
While debugging MDEV-26603, I was reminded again that the XA PREPARE step that is internally (mis)used by the binlog (using an internally generated MySQLXID identifier) will require an fsync() or fdatasync() operation inside InnoDB. I think that when a single storage engine is being used, we must replace the internal 3-phase commit mechanism with 2-phase commit. We only have to ensure that everything up to the commit has been durably written to the binlog before a (normal) commit is written to the engine log. Edit: The main point of this task is to ensure that only one log write (the binlog) needs to be durable. To achieve acceptable performance, I think that we’d want something similar to This could require rewriting the current group commit logic, and turning the current binlog/InnoDB notification mechanism ‘upside down’. The current mechanism is known to be incorrect, as reported in MDEV-25611. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2022-10-13 ] | |||||||||||||||||||||||||||||||||||||||||||
|
marko, as discussed on slack, there have been two major issues that raise in your comments. This ticket is about roll-forward recovery (the subject #1) which infers (the subject #2) how to find the last stably committed transaction so next to it would be the 1st one to start the roll-forward "replay". To #1 and your 'misuse' qualification though, identification with XID at trx prepare is still not a bad idea as the prepared trx:s might just need the commit decision/operation for their roll-forward (otherwise it'd be a full trx replay). The engine prepared does not require `fsync()` under #1. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-26 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Somewhat related to this and https://smalldatum.blogspot.com/2022/10/early-lock-release-and-innodb.html I dug up the commit that imported the InnoDB revision history when it had been maintained separately from MySQL. The log can be viewed with the following command:
The log includes an interesting change: Enable group commit functionality. I think that some cleanup (review and removal) of this kind of code needs to be part of this task. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-10-09 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Last week, I discussed an alternative solution with knielsen: implement an API that allows a storage engine to durably write binlog. For InnoDB, this would involve buffering a page-oriented binlog in the buffer pool and using the normal write-ahead logging mechanism. Over the weekend, I realized that in case the binlog is written strictly append-only, there is no need to introduce any additional page framing, checksums, or fields like FIL_PAGE_LSN. Not having a field like FIL_PAGE_LSN means that recovery will ‘blindly’ execute any recovered binlog writes to the file even though the data might already have been written. A further optimization might be that instead of writing the binlog via the InnoDB buffer pool, we could write it roughly in the current way, but with a few additions:
A major benefit of this approach is that it is possible to get the binlog and the InnoDB transactions completely consistent with each other, even when there are no fdatasync() calls at all during normal operation. Around InnoDB log checkpoints they are unavoidable, to ensure the correct ordering of page and log writes. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-10-23 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the ideas marko! I have to review the existing patch more in-depth (which was originally authored by Sachin and Sujatha), but from my understanding, it is similar to your suggestion of
I'll review with a closer eye to that point. Perhaps we can create individual follow-up JIRAs for the other optimization suggestions. |