[MDEV-18959] Engine transaction recovery through persistent binlog - Jira

Andrei Elkin created issue - 2019-03-18 17:06

Andrei Elkin made changes - 2019-03-18 17:06

Field	Original Value	New Value
Link		This issue duplicates MDEV-11376 [ MDEV-11376 ]

Andrei Elkin made changes - 2019-03-18 17:07

Link

This issue relates to ~~MDEV-11937~~ [ ~~MDEV-11937~~ ]

Marko Mäkelä made changes - 2019-03-18 19:40

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Marko Mäkelä added a comment - 2019-03-18 19:40

As far as I understand, if sync_binlog=1, at transaction commit we could skip not only the fsync() call for the InnoDB redo log files, but also the call log_write_up_to(mtr.commit_lsn()). That is, we could group all writes from the log_sys buffer to the InnoDB redo log files in bigger batches.

Furthermore, my understanding is that the internal use of 2-phase commit (XA distributed transactions) can be removed in this case. That mechanism would only be needed when XA START/END/PREPARE/COMMIT/ROLLBACK statements are being issued from SQL.

The fsync() in InnoDB would still be needed for preventing harmful reordering of writes (to stick to write-ahead logging). The primary mechanisms for driving that should be redo log checkpoints and dirty page replacement in the buffer pool.

Marko Mäkelä added a comment - 2019-03-18 19:40 As far as I understand, if sync_binlog=1 , at transaction commit we could skip not only the fsync() call for the InnoDB redo log files, but also the call log_write_up_to(mtr.commit_lsn()) . That is, we could group all writes from the log_sys buffer to the InnoDB redo log files in bigger batches. Furthermore, my understanding is that the internal use of 2-phase commit (XA distributed transactions) can be removed in this case. That mechanism would only be needed when XA START/END/PREPARE/COMMIT/ROLLBACK statements are being issued from SQL. The fsync() in InnoDB would still be needed for preventing harmful reordering of writes (to stick to write-ahead logging). The primary mechanisms for driving that should be redo log checkpoints and dirty page replacement in the buffer pool.

Marko Mäkelä made changes - 2019-03-22 14:28

NRE Projects

RM_105_CANDIDATE

Andrei Elkin added a comment - 2019-04-02 17:39 - edited

The MDEV would implement a MDEV-16589 requirement.

Andrei Elkin added a comment - 2019-04-02 17:39 - edited The MDEV would implement a MDEV-16589 requirement.

Andrei Elkin made changes - 2019-04-02 17:39

Link

This issue relates to MDEV-16589 [ MDEV-16589 ]

Marko Mäkelä added a comment - 2019-05-03 03:46

I wonder whether we need innobase_flush_logs() or handlerton::flush_logs at all. In InnoDB, this function is invoking log_buffer_flush_to_disk(), which in turn is initiating a write of all buffered redo log to the log files, instead of merely flushing the log up to the state change of the current transaction (trx->commit_lsn, which is what trx_flush_log_if_needed() should have written already.

All this code should be reviewed and cleaned up as part of this task.

Marko Mäkelä added a comment - 2019-05-03 03:46 I wonder whether we need innobase_flush_logs() or handlerton::flush_logs at all. In InnoDB, this function is invoking log_buffer_flush_to_disk() , which in turn is initiating a write of all buffered redo log to the log files, instead of merely flushing the log up to the state change of the current transaction ( trx->commit_lsn , which is what trx_flush_log_if_needed() should have written already. All this code should be reviewed and cleaned up as part of this task.

Marko Mäkelä made changes - 2019-05-06 05:36

Link

This issue relates to MDEV-16589 [ MDEV-16589 ]

Andrei Elkin made changes - 2019-07-05 09:57

Assignee

Sujatha Sivakumar [ sujatha.sivakumar ]

Ralf Gebhardt made changes - 2019-08-29 10:46

Link

This issue relates to MDEV-16589 [ MDEV-16589 ]

Sujatha Sivakumar (Inactive) made changes - 2019-10-29 13:13

Status

Open [ 1 ]

In Progress [ 3 ]

Sujatha Sivakumar (Inactive) added a comment - 2019-11-19 13:00

sysbench 1.1.0-1327e79
Time taken to do '1000000' inserts is measured using sysbench on latest 10.5.

Sysbench commands:

./src/sysbench /sysbench-master/src/lua/oltp_insert.lua --tables=1 --threads=8 --time=0  --table-size=1000000 --mysql-user=root --mysql-socket=/10.5/bld/var/tmp/mysqld.1.sock --mysql-password=root prepare

./src/sysbench/sysbench-master/src/lua/oltp_insert.lua --tables=1 --threads=8 --time=0 --events=1000000  --table-size=1000000 --mysql-user=root --mysql-socket=/10.5/bld/var/tmp/mysqld.1.sock --mysql-password=root run

(1,0), (1,1) and (0,1) correspond to 'innodb_flush_log_at_trx_commit' and 'sync_binlog' respectively.
"binlog_commit_wait_usec" is mentioned in table as "usec"
"binlog_commit_wait_count" is mentioned in table as "wait_count"
"tables_N" refers to number of sysbench tables used.
"tables_1" corresponds to "sbtest1".
"threads" refers to number of threads used by sysbench

SYSBENCH Results

Benchmark_Parameters	Current Default(1,0)	Crash Safe Setting(1,1)	New Proposal(0,1)
usec_100ms_wait_count0_tables_1_threads_8	149.9369s	475.4177s	304.5749s
usec_100ms_wait_count0_tables_1_threads_16	52.1699s	244.9183s	147.7131s
usec_200ms_wait_count0_tables_1_threads_8	167.8036s	443.5522s	277.0976s
usec_200ms_wait_count0_tables_1_threads_16	46.8295s	241.5203s	146.8888s
usec_300ms_wait_count0_tables_1_threads_8	173.0177s	477.6018s	282.1719s
usec_100ms_wait_count4_tables_1_threads_8	259.5711s	425.3143s	304.7635s
usec_100ms_wait_count8_tables_1_threads_8	266.0714s	464.8017s	222.7254s
usec_100ms_wait_count8_tables_1_threads_16	136.5344s	281.8066s	143.0373s
usec_100ms_wait_count4_tables_2_threads_8	262.9354s	461.2272s	324.1720s

The more the binlog group commit rate the new proposal will fare better.

Sujatha Sivakumar (Inactive) added a comment - 2019-11-19 13:00 sysbench 1.1.0-1327e79 Time taken to do '1000000' inserts is measured using sysbench on latest 10.5. Sysbench commands: ./src/sysbench /sysbench-master/src/lua/oltp_insert.lua --tables=1 --threads=8 --time=0 --table-size=1000000 --mysql-user=root --mysql-socket=/10.5/bld/var/tmp/mysqld.1.sock --mysql-password=root prepare ./src/sysbench/sysbench-master/src/lua/oltp_insert.lua --tables=1 --threads=8 --time=0 --events=1000000 --table-size=1000000 --mysql-user=root --mysql-socket=/10.5/bld/var/tmp/mysqld.1.sock --mysql-password=root run (1,0), (1,1) and (0,1) correspond to 'innodb_flush_log_at_trx_commit' and 'sync_binlog' respectively. "binlog_commit_wait_usec" is mentioned in table as "usec" "binlog_commit_wait_count" is mentioned in table as "wait_count" "tables_N" refers to number of sysbench tables used. "tables_1" corresponds to "sbtest1". "threads" refers to number of threads used by sysbench SYSBENCH Results Benchmark_Parameters Current Default(1,0) Crash Safe Setting(1,1) New Proposal(0,1) usec_100ms_wait_count0_tables_1_threads_8 149.9369s 475.4177s 304.5749s usec_100ms_wait_count0_tables_1_threads_16 52.1699s 244.9183s 147.7131s usec_200ms_wait_count0_tables_1_threads_8 167.8036s 443.5522s 277.0976s usec_200ms_wait_count0_tables_1_threads_16 46.8295s 241.5203s 146.8888s usec_300ms_wait_count0_tables_1_threads_8 173.0177s 477.6018s 282.1719s usec_100ms_wait_count4_tables_1_threads_8 259.5711s 425.3143s 304.7635s usec_100ms_wait_count8_tables_1_threads_8 266.0714s 464.8017s 222.7254s usec_100ms_wait_count8_tables_1_threads_16 136.5344s 281.8066s 143.0373s usec_100ms_wait_count4_tables_2_threads_8 262.9354s 461.2272s 324.1720s The more the binlog group commit rate the new proposal will fare better.

Andrei Elkin made changes - 2020-02-18 11:09

Fix Version/s		10.6 [ 24028 ]
Fix Version/s	10.5 [ 23123 ]

Sergei Golubchik made changes - 2020-05-04 11:53

Link

This issue is blocked by ~~MDEV-22351~~ [ ~~MDEV-22351~~ ]

Marko Mäkelä added a comment - 2020-05-04 11:54

I believe that the correct operation of this change depends on the ability of RESET MASTER to reset the binlog position that is persisted in InnoDB (~~MDEV-22351~~).

Marko Mäkelä added a comment - 2020-05-04 11:54 I believe that the correct operation of this change depends on the ability of RESET MASTER to reset the binlog position that is persisted in InnoDB ( MDEV-22351 ).

Andrei Elkin added a comment - 2020-05-04 12:27

That's correct Marko.

Andrei Elkin added a comment - 2020-05-04 12:27 That's correct Marko.

Marko Mäkelä added a comment - 2020-07-07 06:52

Alibaba seems to have implemented something similar: 云数据库 RDS > AliSQL 内核 > Binlog in Redo

当事务提交时，只需要将Redo Log保存到磁盘中，从而减少一次对磁盘的操作，而Binlog文件则采用异步的方式，用单独的线程周期性的保存到磁盘中。

Google translation:

When the transaction is committed, only the Redo Log needs to be saved to the disk, thereby reducing the operation on the disk, and the Binlog file uses an asynchronous method and is periodically saved to the disk with a separate thread.

They introduced 2 parameters to control this:

persist_binlog_to_redo: a Boolean flag to enable the functionality (to write a little more to the redo log)
sync_binlog_interval (default: 50ms) the time period for flushing the binlog files (which are not removed)

The main difference from this task is that the binlog is (almost) guaranteed to lag behind the InnoDB redo log at all times. MDEV-18959 aims to guarantee that the redo log is never ahead of the binlog.

Marko Mäkelä added a comment - 2020-07-07 06:52 Alibaba seems to have implemented something similar: 云数据库 RDS > AliSQL 内核 > Binlog in Redo 当事务提交时，只需要将Redo Log保存到磁盘中，从而减少一次对磁盘的操作，而Binlog文件则采用异步的方式，用单独的线程周期性的保存到磁盘中。 Google translation: When the transaction is committed, only the Redo Log needs to be saved to the disk, thereby reducing the operation on the disk, and the Binlog file uses an asynchronous method and is periodically saved to the disk with a separate thread. They introduced 2 parameters to control this: persist_binlog_to_redo : a Boolean flag to enable the functionality (to write a little more to the redo log) sync_binlog_interval (default: 50ms) the time period for flushing the binlog files (which are not removed) The main difference from this task is that the binlog is (almost) guaranteed to lag behind the InnoDB redo log at all times. MDEV-18959 aims to guarantee that the redo log is never ahead of the binlog.

Marko Mäkelä made changes - 2020-07-07 06:52

Remote Link

This issue links to "(AliSQL) Binlog in Redo (Web Link)" [ 29904 ]

Max Mether made changes - 2020-09-09 12:26

Priority

Major [ 3 ]

Critical [ 2 ]

Marko Mäkelä made changes - 2020-10-16 06:35

Link

This issue relates to ~~MDEV-16526~~ [ ~~MDEV-16526~~ ]

Andrei Elkin made changes - 2021-02-24 11:48

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effecively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effecively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extened) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
already *committed* or *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

  when a transaction updates an engine that track binlog offset of their commits and
  its binlog offset is greater than one of the last committed trx in the engine
  then the transaction is to be re-executed (unless it's already prepared then it is to commit by
  the regular rules).

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-dbout multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

Andrei Elkin made changes - 2021-02-24 11:48

Assignee

Sujatha Sivakumar [ sujatha.sivakumar ]

Sergei Golubchik [ serg ]

Andrei Elkin made changes - 2021-02-24 11:57

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effecively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extened) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
already *committed* or *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

  when a transaction updates an engine that track binlog offset of their commits and
  its binlog offset is greater than one of the last committed trx in the engine
  then the transaction is to be re-executed (unless it's already prepared then it is to commit by
  the regular rules).

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-dbout multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
already *committed* or *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

  when a transaction updates an engine that track binlog offset of their commits and
  its binlog offset is greater than one of the last committed trx in the engine
  then the transaction is to be re-executed (unless it's already prepared then it is to commit by
  the regular rules).

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

Andrei Elkin made changes - 2021-02-24 12:00

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
already *committed* or *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

  when a transaction updates an engine that track binlog offset of their commits and
  its binlog offset is greater than one of the last committed trx in the engine
  then the transaction is to be re-executed (unless it's already prepared then it is to commit by
  the regular rules).

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
already *committed* or *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re-executed (unless it's already prepared then it is to
commit by the regular rules).
{noformat}

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

Andrei Elkin made changes - 2021-03-02 14:18

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
already *committed* or *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re-executed (unless it's already prepared then it is to
commit by the regular rules).
{noformat}

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re-executed (unless it's already prepared then it is to
commit by the regular rules).
{noformat}

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

Sergei Golubchik made changes - 2021-03-18 14:03

Status

In Progress [ 3 ]

In Review [ 10002 ]

Julien Fritsch made changes - 2021-03-19 17:40

Link

This issue relates to MDEV-21469 [ MDEV-21469 ]

Sergei Golubchik added a comment - 2021-03-24 13:04

Here, a set of thoughts/suggestions:

handlerton gets a new method, binlog_pos(). Returns last binlog position persistently committed.
If at least one engine in a transaction has binlog_pos != NULL — it means a transaction should be recovered from a binlog
corollary: InnoDB sets binlog_pos == NULL if flush_log_at_trx_commit is 0 or 2
for recovery to work every binlog event must modify at most one engine. In ROW it happens automatically, in STMT or MIXED it's not guaranteed
- let's cover only ROW in this MDEV-18959
- then let's cover MIXED as: mark a statement as unsafe if it affects more than one engine and at least one of those engines has binlog_pos != NULL.

Sergei Golubchik added a comment - 2021-03-24 13:04 Here, a set of thoughts/suggestions: handlerton gets a new method, binlog_pos() . Returns last binlog position persistently committed. If at least one engine in a transaction has binlog_pos != NULL — it means a transaction should be recovered from a binlog corollary: InnoDB sets binlog_pos == NULL if flush_log_at_trx_commit is 0 or 2 for recovery to work every binlog event must modify at most one engine. In ROW it happens automatically, in STMT or MIXED it's not guaranteed let's cover only ROW in this MDEV-18959 then let's cover MIXED as: mark a statement as unsafe if it affects more than one engine and at least one of those engines has binlog_pos != NULL .

Sergei Golubchik made changes - 2021-03-24 13:07

Assignee	Sergei Golubchik [ serg ]	Andrei Elkin [ elkin ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Julien Fritsch made changes - 2021-03-30 15:00

Link

This issue blocks MDEV-16589 [ MDEV-16589 ]

Andrei Elkin added a comment - 2021-04-16 08:45

serg, thanks for a constructive feedback. To

InnoDB sets binlog_pos == NULL if flush_log_at_trx_commit is 0 or 2

though, flush_log_at_trx_commit = 0|2 actually must increment binlog_pos (I believe it does so currently), just not that eagerly as the value 1 does, to reflect the last safely/persistently committed.

Andrei Elkin added a comment - 2021-04-16 08:45 serg , thanks for a constructive feedback. To InnoDB sets binlog_pos == NULL if flush_log_at_trx_commit is 0 or 2 though, flush_log_at_trx_commit = 0|2 actually must increment binlog_pos (I believe it does so currently), just not that eagerly as the value 1 does, to reflect the last safely/persistently committed.

Julien Fritsch made changes - 2021-04-16 15:42

Link

This issue relates to MDEV-16589 [ MDEV-16589 ]

Andrei Elkin made changes - 2021-05-11 11:22

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Ralf Gebhardt made changes - 2021-06-16 18:48

Fix Version/s		10.7 [ 24805 ]
Fix Version/s	10.6 [ 24028 ]

Sergei Golubchik made changes - 2021-09-07 07:38

Priority

Critical [ 2 ]

Major [ 3 ]

Andrei Elkin made changes - 2021-09-07 12:39

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Marko Mäkelä made changes - 2021-09-20 10:27

Link

This issue relates to ~~MDEV-26603~~ [ ~~MDEV-26603~~ ]

Marko Mäkelä made changes - 2021-09-20 10:27

Link

This issue relates to ~~MDEV-24341~~ [ ~~MDEV-24341~~ ]

Marko Mäkelä added a comment - 2021-09-20 10:27 - edited

While debugging ~~MDEV-26603~~, I was reminded again that the XA PREPARE step that is internally (mis)used by the binlog (using an internally generated MySQLXID identifier) will require an fsync() or fdatasync() operation inside InnoDB.

I think that when a single storage engine is being used, we must replace the internal 3-phase commit mechanism with 2-phase commit. We only have to ensure that everything up to the commit has been durably written to the binlog before a (normal) commit is written to the engine log. Edit: The main point of this task is to ensure that only one log write (the binlog) needs to be durable.

To achieve acceptable performance, I think that we’d want something similar to ~~MDEV-24341~~ to cover binlog writes, so that no transaction commits inside InnoDB would be prematurely written to the InnoDB write-ahead log.

This could require rewriting the current group commit logic, and turning the current binlog/InnoDB notification mechanism ‘upside down’. The current mechanism is known to be incorrect, as reported in ~~MDEV-25611~~.

Marko Mäkelä added a comment - 2021-09-20 10:27 - edited While debugging MDEV-26603 , I was reminded again that the XA PREPARE step that is internally (mis)used by the binlog (using an internally generated MySQLXID identifier) will require an fsync() or fdatasync() operation inside InnoDB. I think that when a single storage engine is being used, we must replace the internal 3-phase commit mechanism with 2-phase commit. We only have to ensure that everything up to the commit has been durably written to the binlog before a (normal) commit is written to the engine log. Edit: The main point of this task is to ensure that only one log write (the binlog) needs to be durable. To achieve acceptable performance, I think that we’d want something similar to MDEV-24341 to cover binlog writes, so that no transaction commits inside InnoDB would be prematurely written to the InnoDB write-ahead log. This could require rewriting the current group commit logic, and turning the current binlog/InnoDB notification mechanism ‘upside down’. The current mechanism is known to be incorrect, as reported in MDEV-25611 .

Marko Mäkelä made changes - 2021-09-20 10:27

Link

This issue relates to ~~MDEV-25611~~ [ ~~MDEV-25611~~ ]

Ralf Gebhardt made changes - 2021-09-30 15:57

Fix Version/s		10.8 [ 26121 ]
Fix Version/s	10.7 [ 24805 ]

Sergei Golubchik made changes - 2021-12-06 21:22

Workflow

MariaDB v3 [ 93351 ]

MariaDB v4 [ 131717 ]

Sergei Golubchik made changes - 2022-02-01 11:32

Fix Version/s		10.9 [ 26905 ]
Fix Version/s	10.8 [ 26121 ]

Ralf Gebhardt made changes - 2022-04-05 15:23

Fix Version/s		10.10 [ 27530 ]
Fix Version/s	10.9 [ 26905 ]

Sergei Golubchik made changes - 2022-06-15 13:36

Fix Version/s		10.11 [ 27614 ]
Fix Version/s	10.10 [ 27530 ]

Sergei Golubchik made changes - 2022-08-02 19:58

Fix Version/s		10.12 [ 28320 ]
Fix Version/s	10.11 [ 27614 ]

AirFocus made changes - 2022-08-09 16:10

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group-fsync
of Binlog. Since when Binlog is turned ON transactions
group-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re-executed (unless it's already prepared then it is to
commit by the regular rules).
{noformat}

For the multiple engine and not-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV\-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV\-16589 sync_binlog = 1\_ performance becomes a more concern.
MDEV\-24386 shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of MDEV\-24386 ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

\{noformat\}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

\{noformat\}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

Julien Fritsch made changes - 2022-08-10 08:38

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV\-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV\-16589 sync_binlog = 1\_ performance becomes a more concern.
MDEV\-24386 shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of MDEV\-24386 ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

\{noformat\}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

\{noformat\}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1\_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of MDEV\-24386 ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

{noformat}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

Julien Fritsch made changes - 2022-08-11 09:15

Fix Version/s

10.11 [ 27614 ]

Julien Fritsch made changes - 2022-08-11 09:16

Priority

Major [ 3 ]

Critical [ 2 ]

Sergei Golubchik made changes - 2022-08-11 13:36

Fix Version/s

10.11 [ 27614 ]

Andrei Elkin added a comment - 2022-10-13 13:26

marko, as discussed on slack, there have been two major issues that raise in your comments. This ticket is about roll-forward recovery (the subject #1) which infers (the subject #2) how to find the last stably committed transaction so next to it would be the 1st one to start the roll-forward "replay".
Through implementing mechanisms responsible for #2 we'd optimize away a complicated and troublesome binlog-background-thread and binlog-checkpoint.

To #1 and your 'misuse' qualification though, identification with XID at trx prepare is still not a bad idea as the prepared trx:s might just need the commit decision/operation for their roll-forward (otherwise it'd be a full trx replay). The engine prepared does not require `fsync()` under #1.

Andrei Elkin added a comment - 2022-10-13 13:26 marko , as discussed on slack, there have been two major issues that raise in your comments. This ticket is about roll-forward recovery (the subject #1) which infers (the subject #2) how to find the last stably committed transaction so next to it would be the 1st one to start the roll-forward "replay". Through implementing mechanisms responsible for #2 we'd optimize away a complicated and troublesome binlog-background-thread and binlog-checkpoint. To #1 and your 'misuse' qualification though, identification with XID at trx prepare is still not a bad idea as the prepared trx:s might just need the commit decision/operation for their roll-forward (otherwise it'd be a full trx replay). The engine prepared does not require `fsync()` under #1.

Ralf Gebhardt made changes - 2022-10-26 10:10

Fix Version/s		10.13 [ 28501 ]
Fix Version/s	10.12 [ 28320 ]

Marko Mäkelä added a comment - 2022-10-26 14:04

Somewhat related to this and https://smalldatum.blogspot.com/2022/10/early-lock-release-and-innodb.html I dug up the commit that imported the InnoDB revision history when it had been maintained separately from MySQL. The log can be viewed with the following command:

git log --name-only 5f9ba24f91989d68ff90d453dbfbc189464b89b9^..5f9ba24f91989d68ff90d453dbfbc189464b89b9^2^

The log includes an interesting change: Enable group commit functionality. I think that some cleanup (review and removal) of this kind of code needs to be part of this task.

Marko Mäkelä added a comment - 2022-10-26 14:04 Somewhat related to this and https://smalldatum.blogspot.com/2022/10/early-lock-release-and-innodb.html I dug up the commit that imported the InnoDB revision history when it had been maintained separately from MySQL . The log can be viewed with the following command: git log --name-only 5f9ba24f91989d68ff90d453dbfbc189464b89b9^..5f9ba24f91989d68ff90d453dbfbc189464b89b9^2^ The log includes an interesting change: Enable group commit functionality . I think that some cleanup (review and removal) of this kind of code needs to be part of this task.

Brandon Nesterenko made changes - 2022-11-14 16:48

Link

This issue relates to MDEV-21469 [ MDEV-21469 ]

Brandon Nesterenko made changes - 2022-11-14 16:48

Link

This issue is blocked by MDEV-21469 [ MDEV-21469 ]

Julien Fritsch made changes - 2022-12-28 13:05

Fix Version/s

11.1 [ 28549 ]

Julien Fritsch made changes - 2022-12-28 13:05

Fix Version/s

10.13 [ 28501 ]

Ralf Gebhardt made changes - 2023-04-13 14:35

Fix Version/s		11.2 [ 28603 ]
Fix Version/s	11.1 [ 28549 ]

Brandon Nesterenko made changes - 2023-08-04 16:09

Assignee

Andrei Elkin [ elkin ]

Brandon Nesterenko [ JIRAUSER48702 ]

Brandon Nesterenko made changes - 2023-08-10 15:31

Link

This issue is blocked by MDEV-21469 [ MDEV-21469 ]

Brandon Nesterenko made changes - 2023-08-10 15:33

Link

This issue relates to MDEV-21469 [ MDEV-21469 ]

Sergei Golubchik made changes - 2023-08-15 13:23

Fix Version/s		11.3 [ 28565 ]
Fix Version/s	11.2 [ 28603 ]

Brandon Nesterenko made changes - 2023-08-28 21:46

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Sergei Golubchik made changes - 2023-09-17 18:02

Fix Version/s		11.4 [ 29301 ]
Fix Version/s	11.3 [ 28565 ]

Marko Mäkelä added a comment - 2023-10-09 09:57

Last week, I discussed an alternative solution with knielsen: implement an API that allows a storage engine to durably write binlog. For InnoDB, this would involve buffering a page-oriented binlog in the buffer pool and using the normal write-ahead logging mechanism.

Over the weekend, I realized that in case the binlog is written strictly append-only, there is no need to introduce any additional page framing, checksums, or fields like FIL_PAGE_LSN. Not having a field like FIL_PAGE_LSN means that recovery will ‘blindly’ execute any recovered binlog writes to the file even though the data might already have been written.

A further optimization might be that instead of writing the binlog via the InnoDB buffer pool, we could write it roughly in the current way, but with a few additions:

write the binlog also to the InnoDB redo log (in ~~MDEV-12353~~ we reserved record type codes that can be used for this)
implement an InnoDB log_checkpoint() hook that would ensure that fdatasync() is called on the pending binlog writes that would be ‘discarded’ by the checkpoint
on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes)
use asynchronous writes rather than synchronous ones (this was found to help a lot in ~~MDEV-23855~~ and ~~MDEV-23399~~)

A major benefit of this approach is that it is possible to get the binlog and the InnoDB transactions completely consistent with each other, even when there are no fdatasync() calls at all during normal operation. Around InnoDB log checkpoints they are unavoidable, to ensure the correct ordering of page and log writes.

Marko Mäkelä added a comment - 2023-10-09 09:57 Last week, I discussed an alternative solution with knielsen : implement an API that allows a storage engine to durably write binlog. For InnoDB, this would involve buffering a page-oriented binlog in the buffer pool and using the normal write-ahead logging mechanism. Over the weekend, I realized that in case the binlog is written strictly append-only, there is no need to introduce any additional page framing, checksums, or fields like FIL_PAGE_LSN . Not having a field like FIL_PAGE_LSN means that recovery will ‘blindly’ execute any recovered binlog writes to the file even though the data might already have been written. A further optimization might be that instead of writing the binlog via the InnoDB buffer pool, we could write it roughly in the current way, but with a few additions: write the binlog also to the InnoDB redo log (in MDEV-12353 we reserved record type codes that can be used for this) implement an InnoDB log_checkpoint() hook that would ensure that fdatasync() is called on the pending binlog writes that would be ‘discarded’ by the checkpoint on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes) use asynchronous writes rather than synchronous ones (this was found to help a lot in MDEV-23855 and MDEV-23399 ) A major benefit of this approach is that it is possible to get the binlog and the InnoDB transactions completely consistent with each other, even when there are no fdatasync() calls at all during normal operation. Around InnoDB log checkpoints they are unavoidable, to ensure the correct ordering of page and log writes.

Brandon Nesterenko added a comment - 2023-10-23 18:23

Thanks for the ideas marko! I have to review the existing patch more in-depth (which was originally authored by Sachin and Sujatha), but from my understanding, it is similar to your suggestion of

on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes)

I'll review with a closer eye to that point.

Perhaps we can create individual follow-up JIRAs for the other optimization suggestions.

Brandon Nesterenko added a comment - 2023-10-23 18:23 Thanks for the ideas marko ! I have to review the existing patch more in-depth (which was originally authored by Sachin and Sujatha), but from my understanding, it is similar to your suggestion of on recovery, recover the binlog to correspond to exactly the InnoDB redo log (rewrite what was missed, and truncate any extra writes) I'll review with a closer eye to that point. Perhaps we can create individual follow-up JIRAs for the other optimization suggestions.

Ralf Gebhardt made changes - 2023-10-26 14:23

Fix Version/s		11.5 [ 29506 ]
Fix Version/s	11.4 [ 29301 ]

Julien Fritsch made changes - 2023-11-30 16:30

Issue Type

Task [ 3 ]

New Feature [ 2 ]

Julien Fritsch made changes - 2023-12-07 10:26

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Andrei Elkin made changes - 2023-12-13 16:08

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1\_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of MDEV\-24386 ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

{noformat}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21465 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1\_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of MDEV\-24386 ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

{noformat}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21469 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

Ralf Gebhardt made changes - 2024-01-16 11:33

Fix Version/s		11.6 [ 29515 ]
Fix Version/s	11.5 [ 29506 ]

Sergei Golubchik made changes - 2024-02-20 15:13

Description

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1\_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of MDEV\-24386 ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

{noformat}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21469 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

A de-facto present recovery-related requirement of two calls of {{fsync()}} at
transaction prepare and commit by Engine per transaction
can be relaxed in favor of replacing the first {{fsync()}} by a group\-fsync
of Binlog. Since when Binlog is turned ON transactions
group\-committed/prepared the only {{fsync()}} per group resolves
optimization requests such as MDEV-11376.

When a trx is deposited into an fsynced binlog file its image
consisting of xid and payload suffices for its recovery. Specifically the
payload part can be effectively made use of to replay the transaction should
it have missed out the Engine write to disk.

As long as Engine maintains its last committed in binlog order durable
transaction tracking all the transactions above the last if found in binlog upon a
crash could are regarded as lost and be restored by re\-applying of
their payload, that is their binlogged replication events.

The existing binlog checkpoint mechanism will continue to serve to
limit binlog files for recovery.

In the light of _MDEV-16589 sync_binlog = 1\_ performance becomes a more concern.
~~MDEV-24386~~ shows up to *3 times* grown latency and *halved* throughput with the new default value
and remained default of {{innodb_flush_log_at_trx_commit = 1}}.

At the same time {{innodb_flush_log_at_trx_commit = 0}} still allows for recovery (though to be
extended) *and*
further benchmarking *sysbench4.pdf* of ~~MDEV-24386~~ ensures the latency and performance of
{{(B = 1, I = 0)}} may be even better compare to {{(B = 0, I = 1)}} of the current (10.5) default.
Here {{B}} stands for {{sync_binlog}}, {{I}} for {{innodb_flush_log_at_trx_commit}}.

To the refined recovery, it needs to know engines involved in a transaction in doubt.
Specifically whether all the engines maintain the last committed transaction's binlog offset
in their persistent metadata.
For instance Innodb does so. This piece of info is crucial as at recovery
the engine may have the transaction or its branch
either a) already *committed* or b) *not even prepared* and which of the two is the case can be resolved only
with an "external" help such as the tracking facility: when the transaction starts in binlog
at an offset greater than that that the engine remembers of its last committed then
this transaction obviously is not yet committed.

Unlike all other cases in case of the single Innodb engine transaction
there is no need to specify the engine explicitly in the transaction's
binlog events.

The recovery procedure follows most of the conventional one's steps and adds up
the following rule, simplified here to a single engine:

{noformat}
when a transaction updates an engine that track binlog offset of their commits and
its binlog offset is greater than one of the last committed trx in the engine
then the transaction is to be re\-executed (unless it's already prepared then it is to
commit by the regular rules).

{noformat}

For the multiple engine and not\-Innodb cases the property of involved engines can be
specified through extended {{Gtid_log_event}}. Consider a bitmap with the bits mapped to engines
on that local server.
The mapping is local for the server so it must be mere stable through crashes.
Gtid_log_event remembers the engines involved (except there is only
one Innodb) and at recovery the engines will be found and asked for the last commit binlog offset.

When there's an engine that does not track this transaction can't be re\-executed, otherwise
branches of the in-doubt multi-engine transaction are considered individually taking into account
what the engine branch remembers of its last committed and the transaction binlog offset.

For re-execution consider MDEV-21469 as a template. MIXED binlog format guarantees re\-execution
to repeat/reproduce the original changes.

Sergei Golubchik made changes - 2024-06-04 15:39

Fix Version/s		11.7 [ 29815 ]
Fix Version/s	11.6 [ 29515 ]

Dave Gosselin made changes - 2024-07-26 14:14

Link

This issue relates to ~~MDEV-21117~~ [ ~~MDEV-21117~~ ]

Dave Gosselin made changes - 2024-07-26 14:14

Link

This issue relates to MDEV-21469 [ MDEV-21469 ]

Marko Mäkelä made changes - 2024-08-06 10:57

Link

This issue relates to MDEV-21469 [ MDEV-21469 ]

Marko Mäkelä added a comment - 2024-08-06 10:57

An alternative to this has been presented in MDEV-34705.

Marko Mäkelä added a comment - 2024-08-06 10:57 An alternative to this has been presented in MDEV-34705 .

Marko Mäkelä made changes - 2024-08-06 10:57

Link

This issue relates to MDEV-34705 [ MDEV-34705 ]

Sergei Golubchik made changes - 2024-08-27 14:18

Fix Version/s		11.8 [ 29921 ]
Fix Version/s	11.7 [ 29815 ]

Brandon Nesterenko made changes - 2024-09-06 19:34

Priority

Critical [ 2 ]

Major [ 3 ]

Sergei Golubchik made changes - 2024-11-21 10:05

Fix Version/s

11.8 [ 29921 ]

MariaDB Server

Engine transaction recovery through persistent binlog

Details

Description

Attachments

Issue Links

Activity

SYSBENCH Results

People

Dates

Git Integration