[MDEV-162] Enhanced semisync replication - Jira

Rasmus Johansson (Inactive) created issue - 2012-02-23 13:47

Rasmus Johansson (Inactive) made changes - 2012-02-23 15:22

Field	Original Value	New Value
Assignee	Rasmus Johansson [ ratzpo ]	Kristian Nielsen [ knielsen ]

Rasmus Johansson (Inactive) made changes - 2012-02-23 15:23

Description

See discussion in comments on:
http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/comment-page-1/#comment-878447

See discussion in comments on:
http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/comment-page-1/#comment-878447

See also the email discussion on this.

Kristian Nielsen made changes - 2012-03-07 15:17

Status

Open [ 1 ]

In Progress [ 3 ]

Kristian Nielsen made changes - 2012-03-08 10:05

Description

See discussion in comments on:
http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/comment-page-1/#comment-878447

See also the email discussion on this.

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phatom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

-----------------------------------------------------------------------
Implementation in MariaDB group commit:

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

Kristian Nielsen made changes - 2012-03-08 10:21

Description

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phatom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

-----------------------------------------------------------------------
Implementation in MariaDB group commit:

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phatom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

-----------------------------------------------------------------------
Implementation in MariaDB group commit:

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.

Kristian Nielsen made changes - 2012-03-12 16:46

Description

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phatom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

-----------------------------------------------------------------------
Implementation in MariaDB group commit:

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phatom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

-----------------------------------------------------------------------
Implementation in MariaDB group commit:

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.

-----------------------------------------------------------------------

Crash scenarios:

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up (SHOW MASTER
STATUS on master and select master_pos_wait() on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless _all_ connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet -
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case - however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

-----------------------------------------------------------------------

Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

    http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
LOCK_log mutex and holds it while it waits for all pending commits to
finish. But LOCK_log prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course - as eg. done in the Google enhanced
semisync patch. But I do not like this work-around - in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem - that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

    https://mariadb.atlassian.net/browse/MDEV-181

Colin Charles made changes - 2012-05-16 09:47

Link

This issue relates to TODO-160 [ TODO-160 ]

Kristian Nielsen added a comment - 2012-06-22 14:11

Adding a bit to the estimate, as the semisync code has shown itself to be complex and rather bug-ridden. So better anticipate a couple extra days to deal with this.

Kristian Nielsen added a comment - 2012-06-22 14:11 Adding a bit to the estimate, as the semisync code has shown itself to be complex and rather bug-ridden. So better anticipate a couple extra days to deal with this.

Kristian Nielsen made changes - 2012-06-27 14:03

Status

In Progress [ 3 ]

Open [ 1 ]

Sergei Golubchik made changes - 2013-11-14 22:48

Description

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phatom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

-----------------------------------------------------------------------
Implementation in MariaDB group commit:

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.

-----------------------------------------------------------------------

Crash scenarios:

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up (SHOW MASTER
STATUS on master and select master_pos_wait() on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless _all_ connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet -
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case - however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

-----------------------------------------------------------------------

Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

    http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
LOCK_log mutex and holds it while it waits for all pending commits to
finish. But LOCK_log prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course - as eg. done in the Google enhanced
semisync patch. But I do not like this work-around - in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem - that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

    https://mariadb.atlassian.net/browse/MDEV-181

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
{{prepare_commit_mutex}} is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phantom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

h3. Implementation in MariaDB group commit

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
{{LOCK_log}} before the wait and wait with taking {{LOCK_commit_ordered}} until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):
{noformat}
    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...
{noformat}
See the code in {{sql/log.cc}}, {{MYSQL_BIN_LOG::trx_group_commit_leader()}} for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
{{--rpl_semi_sync_master_wait_before_commit=1}} option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When {{--rpl_semi_sync_master_wait_before_commit=1}}, semi-sync plugin can use the
new hook instead of the current after_commit hook.

h3. Crash scenarios

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up
({{SHOW MASTER STATUS}} on master and {{select master_pos_wait()}} on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless *all* connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet -
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case — however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

h3. Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

    http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
{{LOCK_log}} mutex and holds it while it waits for all pending commits to
finish. But {{LOCK_log}} prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course — as eg. done in the Google enhanced
semisync patch. But I do not like this work-around — in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem — that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

    https://mariadb.atlassian.net/browse/MDEV-181

Sergei Golubchik made changes - 2013-11-14 22:48

Description

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
{{prepare_commit_mutex}} is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phantom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

h3. Implementation in MariaDB group commit

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
{{LOCK_log}} before the wait and wait with taking {{LOCK_commit_ordered}} until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):
{noformat}
    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...
{noformat}
See the code in {{sql/log.cc}}, {{MYSQL_BIN_LOG::trx_group_commit_leader()}} for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
{{--rpl_semi_sync_master_wait_before_commit=1}} option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When {{--rpl_semi_sync_master_wait_before_commit=1}}, semi-sync plugin can use the
new hook instead of the current after_commit hook.

h3. Crash scenarios

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up
({{SHOW MASTER STATUS}} on master and {{select master_pos_wait()}} on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless *all* connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet -
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case — however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

h3. Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

    http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
{{LOCK_log}} mutex and holds it while it waits for all pending commits to
finish. But {{LOCK_log}} prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course — as eg. done in the Google enhanced
semisync patch. But I do not like this work-around — in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem — that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

    https://mariadb.atlassian.net/browse/MDEV-181

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
   events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
{{prepare_commit_mutex}} is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phantom
read issue.

For more discussion see

    http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

h3. Implementation in MariaDB group commit

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
{{LOCK_log}} before the wait and wait with taking {{LOCK_commit_ordered}} until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):
{noformat}
    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...
{noformat}
See the code in {{sql/log.cc}}, {{MYSQL_BIN_LOG::trx_group_commit_leader()}} for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
{{--rpl_semi_sync_master_wait_before_commit=1}} option to enable enhanced
semi-synchronous replication, following the Google patch

    http://code.google.com/p/enhanced-semi-sync-replication/

When {{--rpl_semi_sync_master_wait_before_commit=1}}, semi-sync plugin can use the
new hook instead of the current {{after_commit}} hook.

h3. Crash scenarios

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up
({{SHOW MASTER STATUS}} on master and {{select master_pos_wait()}} on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless *all* connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet —
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case — however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

h3. Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

    http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
{{LOCK_log}} mutex and holds it while it waits for all pending commits to
finish. But {{LOCK_log}} prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course — as eg. done in the Google enhanced
semisync patch. But I do not like this work-around — in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem — that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

    https://mariadb.atlassian.net/browse/MDEV-181

Sergei Golubchik made changes - 2013-11-14 22:48

Link

This issue relates to ~~MDEV-181~~ [ ~~MDEV-181~~ ]

Pavel Ivanov added a comment - 2013-11-14 23:15

It looks like this bug lived long enough that one can just back-port the implementation of rpl_semi_sync_master_wait_point from 5.7.2, see http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_rpl_semi_sync_master_wait_point.

Pavel Ivanov added a comment - 2013-11-14 23:15 It looks like this bug lived long enough that one can just back-port the implementation of rpl_semi_sync_master_wait_point from 5.7.2, see http://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_rpl_semi_sync_master_wait_point .

Sergei Golubchik made changes - 2014-01-15 18:00

Fix Version/s

10.1.0 [ 12200 ]

VAROQUI Stephane added a comment - 2014-04-03 00:28

Details about some improvement merge into mysql :
http://yoshinorimatsunobu.blogspot.fr/2014/04/semi-synchronous-replication-at-facebook.html

VAROQUI Stephane added a comment - 2014-04-03 00:28 Details about some improvement merge into mysql : http://yoshinorimatsunobu.blogspot.fr/2014/04/semi-synchronous-replication-at-facebook.html

Sergei Golubchik made changes - 2014-04-16 23:29

Labels

pf1

pf1 replication

Sergei Golubchik made changes - 2014-06-13 15:06

Workflow

defaullt [ 10906 ]

MariaDB v2 [ 44284 ]

Sergei Golubchik made changes - 2014-06-30 01:16

Fix Version/s		10.1 [ 16100 ]
Fix Version/s	10.1.0 [ 12200 ]

Kristian Nielsen made changes - 2014-09-03 10:39

Assignee

Kristian Nielsen [ knielsen ]

Sergei Golubchik made changes - 2014-09-04 15:40

Fix Version/s		10.2.0 [ 14601 ]
Fix Version/s	10.1 [ 16100 ]

Sergei Golubchik made changes - 2014-09-04 15:40

Priority

Minor [ 4 ]

Major [ 3 ]

Jonas Oreland added a comment - 2014-10-08 15:28

FYI: i've just completed implementing this (and a few other things on semi-sync)...and will submit patch as soon as it's review on our side.
How do upload it in the easiest way ?

/Jonas

Jonas Oreland added a comment - 2014-10-08 15:28 FYI: i've just completed implementing this (and a few other things on semi-sync)...and will submit patch as soon as it's review on our side. How do upload it in the easiest way ? /Jonas

Sergei Golubchik added a comment - 2014-10-12 23:48

jonaso, you can submit a pull request on github or merge request on launchpad or attach a patch here to the issue. Whatever you prefer.

Sergei Golubchik added a comment - 2014-10-12 23:48 jonaso , you can submit a pull request on github or merge request on launchpad or attach a patch here to the issue. Whatever you prefer.

Jonas Oreland added a comment - 2014-10-20 15:19

patch on top of 10.0.15
feedback most welcome

Jonas Oreland added a comment - 2014-10-20 15:19 patch on top of 10.0.15 feedback most welcome

Jonas Oreland made changes - 2014-10-20 15:19

Attachment

mdev162.patch [ 34900 ]

Jonas Oreland added a comment - 2014-11-12 16:37

Hi again Kristian,

I don't know if you looked at my submission, but I haven't heard anything.
Anyway, I implemented it based directly on your suggestion without giving it much thought,
but now this has come back to bite me (when trying to benchmark it

The problem is that code dead-locks with "high" concurrency.
Reason for this is that DUMP thread (which makes sure that semi-sync makes forward progress) is using LOCK_log
when reading log-events. But LOCK_log can be help by thread trying to acquire LOCK_enhanced_semisync (I called this LOCK_after_binlog_sync).
which in it's turn is waiting for semi-sync.
Do you think my analysis sounds reasonable ?

My suggested solution is to make the change described in http://my-replication-life.blogspot.se/2013/09/dump-thread-enhancement.html.
I looked at the code, I see no real reason that LOCK_log must be held when reading binlog (other than what is explained in above link).
Can you think of any ?

Do you think the proposed solution sounds reasonable ?

Furthermore, the 5.7 code base has been refactored quite a lot as compared to mariadb. Most of is very good.
What are your thoughts/plans on this topic ?

/Jonas

Jonas Oreland added a comment - 2014-11-12 16:37 Hi again Kristian, I don't know if you looked at my submission, but I haven't heard anything. Anyway, I implemented it based directly on your suggestion without giving it much thought, but now this has come back to bite me (when trying to benchmark it The problem is that code dead-locks with "high" concurrency. Reason for this is that DUMP thread (which makes sure that semi-sync makes forward progress) is using LOCK_log when reading log-events. But LOCK_log can be help by thread trying to acquire LOCK_enhanced_semisync (I called this LOCK_after_binlog_sync). which in it's turn is waiting for semi-sync. Do you think my analysis sounds reasonable ? My suggested solution is to make the change described in http://my-replication-life.blogspot.se/2013/09/dump-thread-enhancement.html . I looked at the code, I see no real reason that LOCK_log must be held when reading binlog (other than what is explained in above link). Can you think of any ? Do you think the proposed solution sounds reasonable ? Furthermore, the 5.7 code base has been refactored quite a lot as compared to mariadb. Most of is very good. What are your thoughts/plans on this topic ? /Jonas

Jonas Oreland added a comment - 2014-11-24 16:22

ping Kristian!

1) What do you think about my comment above.

2) I started on the http://my-replication-life.blogspot.se/2013/09/dump-thread-enhancement.html,
but soon concluded that it was next to impossible wo/ doing the refactorings.
the code is already convoluted and error prone.

But, if doing the refactorings, this will be a quite big/intrusive patch.
Which is (obviously?) something that I prefer not to have only in our tree.
Hence I find my self in a catch-22.
What do you think about this ?
Are you interested in the enhanced-semi-sync, in the refactorings, both or none ?

/Jonas

Jonas Oreland added a comment - 2014-11-24 16:22 ping Kristian! 1) What do you think about my comment above. 2) I started on the http://my-replication-life.blogspot.se/2013/09/dump-thread-enhancement.html , but soon concluded that it was next to impossible wo/ doing the refactorings. the code is already convoluted and error prone. But, if doing the refactorings, this will be a quite big/intrusive patch. Which is (obviously?) something that I prefer not to have only in our tree. Hence I find my self in a catch-22. What do you think about this ? Are you interested in the enhanced-semi-sync, in the refactorings, both or none ? /Jonas

Kristian Nielsen added a comment - 2014-12-03 21:54

Jonas, I did see any of your comments until by accident just now.

Kristian Nielsen added a comment - 2014-12-03 21:54 Jonas, I did see any of your comments until by accident just now.

Kristian Nielsen made changes - 2014-12-03 21:54

Assignee

Kristian Nielsen [ knielsen ]

Jonas Oreland added a comment - 2014-12-04 16:25

Hi again,

I've now backported the dump thread enhancements,
including the big refactorings...https://mariadb.atlassian.net/browse/MDEV-7257

I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct.

I haven't (yet) tested if that patch fixes the live-lock that occured previously.

/Jonas

Jonas Oreland added a comment - 2014-12-04 16:25 Hi again, I've now backported the dump thread enhancements, including the big refactorings... https://mariadb.atlassian.net/browse/MDEV-7257 I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct. I haven't (yet) tested if that patch fixes the live-lock that occured previously. /Jonas

Kristian Nielsen added a comment - 2014-12-09 15:26 - edited

> I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct.

I think it sounds right. LOCK_log should not be needed by binlog dump threads, as the binlog is write-only.

(There is one thing to check though. When a binlog file is closed, a flag is updated in the Format_description event at the start of the binlog. But I don't expect it could cause any problem).

BTW, it seems to me that group commit also should not need LOCK_log to ensure only one thread at a time is doing (group) commit. It could be a new mutex like LOCK_after_binlog_sync. But I suppose there is no longer much contention on LOCK_log, so no need to introduce another mutex, unless we discover another deadlock issue.

I will take a look at the patch you attached to this bug.

Kristian Nielsen added a comment - 2014-12-09 15:26 - edited > I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct. I think it sounds right. LOCK_log should not be needed by binlog dump threads, as the binlog is write-only. (There is one thing to check though. When a binlog file is closed, a flag is updated in the Format_description event at the start of the binlog. But I don't expect it could cause any problem). BTW, it seems to me that group commit also should not need LOCK_log to ensure only one thread at a time is doing (group) commit. It could be a new mutex like LOCK_after_binlog_sync. But I suppose there is no longer much contention on LOCK_log, so no need to introduce another mutex, unless we discover another deadlock issue. I will take a look at the patch you attached to this bug.

Kristian Nielsen added a comment - 2014-12-23 15:28

Pushed to 10.1. Thanks, Jonas!

Kristian Nielsen added a comment - 2014-12-23 15:28 Pushed to 10.1. Thanks, Jonas!

Kristian Nielsen made changes - 2014-12-23 15:28

Component/s		Replication [ 10100 ]
Fix Version/s		10.1.3 [ 18000 ]
Fix Version/s	10.2.0 [ 14601 ]
Resolution		Fixed [ 1 ]
Status	Open [ 1 ]	Closed [ 6 ]

Rasmus Johansson (Inactive) made changes - 2015-05-18 17:51

Workflow

MariaDB v2 [ 44284 ]

MariaDB v3 [ 63602 ]

Geoff Montee (Inactive) made changes - 2019-03-20 22:46

Link

This issue relates to MDEV-18983 [ MDEV-18983 ]

Sergei Golubchik made changes - 2021-12-06 21:22

Workflow

MariaDB v3 [ 63602 ]

MariaDB v4 [ 131907 ]

MariaDB Server

Enhanced semisync replication

Details

Description

Implementation in MariaDB group commit

Crash scenarios

Pending XID issue

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration