Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
Description
Enhanced semi-synchronous replication does COMMIT in the following way:
1. Prepare the transaction in the storage engine(s).
2. Write the transaction to the binlog, flush the binlog to disk.
3. Wait for at least one slave to acknowledge the reception of the binlog
events for the transaction.
4. Commit the transaction to the storage engine(s).
This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.
This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.
A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phantom
read issue.
For more discussion see
http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/
Implementation in MariaDB group commit
In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).
Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.
To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.
The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):
... stage (2) end ...
|
lock LOCK_enhanced_semisync
|
unlock LOCK_log
|
... stage (3) wait for slave ...
|
lock LOCK_commit_ordered
|
unlock LOCK_enhanced_semisync
|
... stage (4) begin ...
|
See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.
The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch
http://code.google.com/p/enhanced-semi-sync-replication/
When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.
Crash scenarios
If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.
If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up
(SHOW MASTER STATUS on master and select master_pos_wait() on slave).
If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).
If a slave crashes during the commit on master, nothing special should
happen, unless all connected slaves crash, leaving the master without any
slaves connected.
In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet —
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case — however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.
Pending XID issue
One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):
http://bugs.mysql.com/bug.php?id=44058
The problem is that when the server wants to rotate the binlog, it takes the
LOCK_log mutex and holds it while it waits for all pending commits to
finish. But LOCK_log prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.
This can be worked around, of course — as eg. done in the Google enhanced
semisync patch. But I do not like this work-around — in introduces even more
complication into what is already a bad design.
I would prefer to instead solve the root problem — that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:
Attachments
Issue Links
- relates to
-
MDEV-181 XID crash recovery across binlog boundaries
-
- Closed
-
-
MDEV-18983 Port rpl_semi_sync_master_wait_for_slave_count from MySQL
-
- Open
-
Activity
Field | Original Value | New Value |
---|---|---|
Assignee | Rasmus Johansson [ ratzpo ] | Kristian Nielsen [ knielsen ] |
Description |
See discussion in comments on: http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/comment-page-1/#comment-878447 |
See discussion in comments on: http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/comment-page-1/#comment-878447 See also the email discussion on this. |
Status | Open [ 1 ] | In Progress [ 3 ] |
Description |
See discussion in comments on: http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/comment-page-1/#comment-878447 See also the email discussion on this. |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global prepare_commit_mutex is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phatom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ ----------------------------------------------------------------------- Implementation in MariaDB group commit: In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release LOCK_log before the wait and wait with taking LOCK_commit_ordered until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for details. |
Description |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global prepare_commit_mutex is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phatom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ ----------------------------------------------------------------------- Implementation in MariaDB group commit: In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release LOCK_log before the wait and wait with taking LOCK_commit_ordered until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for details. |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global prepare_commit_mutex is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phatom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ ----------------------------------------------------------------------- Implementation in MariaDB group commit: In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release LOCK_log before the wait and wait with taking LOCK_commit_ordered until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the --rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the new hook instead of the current after_commit hook. |
Description |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global prepare_commit_mutex is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phatom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ ----------------------------------------------------------------------- Implementation in MariaDB group commit: In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release LOCK_log before the wait and wait with taking LOCK_commit_ordered until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the --rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the new hook instead of the current after_commit hook. |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global prepare_commit_mutex is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phatom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ ----------------------------------------------------------------------- Implementation in MariaDB group commit: In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release LOCK_log before the wait and wait with taking LOCK_commit_ordered until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the --rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the new hook instead of the current after_commit hook. ----------------------------------------------------------------------- Crash scenarios: If a master crashes before a transaction T is written into the binlog, that transaction will be rolled back during crash recovery upon server restart, as normal. If T was written (and synced) into binlog, but not yet acknowledged by any slave, and master crashes, then T will be committed during crash recovery. In this case, it is possible for a connection to see T committed on the master before any slave has had time to connect to the master and receive it. Thus, if we crash again right after crash recovery and completely loose the master, it is possible for a connection to have seen T on the master while T is now effectively missing from the system. To fix this, one option is to somehow have the master wait after crash recovery for at least one slave to connect and acknowledge all recovered commits, thus extending the semi-sync to crash recovery phase. An alternative may be for the DBA to prevent connections to the server after a crash until at least one slave has caught up (SHOW MASTER STATUS on master and select master_pos_wait() on slave). If T was acknowledged by at least one slave, then we know that T exists both in master binlog (which is synced before sending to slaves) and slave relay-log. Thus, when master crash recovery is done, T will be on both master and that slave. And if we completely loose the master, T will still eventually be applied on the slave (unless we loose both master and slave at the same time). If a slave crashes during the commit on master, nothing special should happen, unless _all_ connected slaves crash, leaving the master without any slaves connected. In this case the situation is much as with normal semisync. Commits will be stalled until timeout. They will be stalled a bit earlier (before InnoDB commit rather than after), so row locks will not have been released yet - otherwise the result is much the same. I need to check if semisync is able to detect the TCP close from all slaves and fail faster in this case - however, this does not help for the case when power failure takes out the slave without any notice sent on the network. ----------------------------------------------------------------------- Pending XID issue One issue that needs to be dealt with is the potential deadlock described in this bug report (point 5): http://bugs.mysql.com/bug.php?id=44058 The problem is that when the server wants to rotate the binlog, it takes the LOCK_log mutex and holds it while it waits for all pending commits to finish. But LOCK_log prevents slaves from receiving events, which prevents slave acks, which prevents pending commits to finish. This can be worked around, of course - as eg. done in the Google enhanced semisync patch. But I do not like this work-around - in introduces even more complication into what is already a bad design. I would prefer to instead solve the root problem - that server needs to stall commits when rotating the binlog. This solves a number of issues. See a description for this here: https://mariadb.atlassian.net/browse/MDEV-181 |
Link | This issue relates to TODO-160 [ TODO-160 ] |
Status | In Progress [ 3 ] | Open [ 1 ] |
Description |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global prepare_commit_mutex is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phatom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ ----------------------------------------------------------------------- Implementation in MariaDB group commit: In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release LOCK_log before the wait and wait with taking LOCK_commit_ordered until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the --rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the new hook instead of the current after_commit hook. ----------------------------------------------------------------------- Crash scenarios: If a master crashes before a transaction T is written into the binlog, that transaction will be rolled back during crash recovery upon server restart, as normal. If T was written (and synced) into binlog, but not yet acknowledged by any slave, and master crashes, then T will be committed during crash recovery. In this case, it is possible for a connection to see T committed on the master before any slave has had time to connect to the master and receive it. Thus, if we crash again right after crash recovery and completely loose the master, it is possible for a connection to have seen T on the master while T is now effectively missing from the system. To fix this, one option is to somehow have the master wait after crash recovery for at least one slave to connect and acknowledge all recovered commits, thus extending the semi-sync to crash recovery phase. An alternative may be for the DBA to prevent connections to the server after a crash until at least one slave has caught up (SHOW MASTER STATUS on master and select master_pos_wait() on slave). If T was acknowledged by at least one slave, then we know that T exists both in master binlog (which is synced before sending to slaves) and slave relay-log. Thus, when master crash recovery is done, T will be on both master and that slave. And if we completely loose the master, T will still eventually be applied on the slave (unless we loose both master and slave at the same time). If a slave crashes during the commit on master, nothing special should happen, unless _all_ connected slaves crash, leaving the master without any slaves connected. In this case the situation is much as with normal semisync. Commits will be stalled until timeout. They will be stalled a bit earlier (before InnoDB commit rather than after), so row locks will not have been released yet - otherwise the result is much the same. I need to check if semisync is able to detect the TCP close from all slaves and fail faster in this case - however, this does not help for the case when power failure takes out the slave without any notice sent on the network. ----------------------------------------------------------------------- Pending XID issue One issue that needs to be dealt with is the potential deadlock described in this bug report (point 5): http://bugs.mysql.com/bug.php?id=44058 The problem is that when the server wants to rotate the binlog, it takes the LOCK_log mutex and holds it while it waits for all pending commits to finish. But LOCK_log prevents slaves from receiving events, which prevents slave acks, which prevents pending commits to finish. This can be worked around, of course - as eg. done in the Google enhanced semisync patch. But I do not like this work-around - in introduces even more complication into what is already a bad design. I would prefer to instead solve the root problem - that server needs to stall commits when rotating the binlog. This solves a number of issues. See a description for this here: https://mariadb.atlassian.net/browse/MDEV-181 |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global {{prepare_commit_mutex}} is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phantom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ h3. Implementation in MariaDB group commit In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release {{LOCK_log}} before the wait and wait with taking {{LOCK_commit_ordered}} until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): {noformat} ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... {noformat} See the code in {{sql/log.cc}}, {{MYSQL_BIN_LOG::trx_group_commit_leader()}} for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the {{--rpl_semi_sync_master_wait_before_commit=1}} option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When {{--rpl_semi_sync_master_wait_before_commit=1}}, semi-sync plugin can use the new hook instead of the current after_commit hook. h3. Crash scenarios If a master crashes before a transaction T is written into the binlog, that transaction will be rolled back during crash recovery upon server restart, as normal. If T was written (and synced) into binlog, but not yet acknowledged by any slave, and master crashes, then T will be committed during crash recovery. In this case, it is possible for a connection to see T committed on the master before any slave has had time to connect to the master and receive it. Thus, if we crash again right after crash recovery and completely loose the master, it is possible for a connection to have seen T on the master while T is now effectively missing from the system. To fix this, one option is to somehow have the master wait after crash recovery for at least one slave to connect and acknowledge all recovered commits, thus extending the semi-sync to crash recovery phase. An alternative may be for the DBA to prevent connections to the server after a crash until at least one slave has caught up ({{SHOW MASTER STATUS}} on master and {{select master_pos_wait()}} on slave). If T was acknowledged by at least one slave, then we know that T exists both in master binlog (which is synced before sending to slaves) and slave relay-log. Thus, when master crash recovery is done, T will be on both master and that slave. And if we completely loose the master, T will still eventually be applied on the slave (unless we loose both master and slave at the same time). If a slave crashes during the commit on master, nothing special should happen, unless *all* connected slaves crash, leaving the master without any slaves connected. In this case the situation is much as with normal semisync. Commits will be stalled until timeout. They will be stalled a bit earlier (before InnoDB commit rather than after), so row locks will not have been released yet - otherwise the result is much the same. I need to check if semisync is able to detect the TCP close from all slaves and fail faster in this case — however, this does not help for the case when power failure takes out the slave without any notice sent on the network. h3. Pending XID issue One issue that needs to be dealt with is the potential deadlock described in this bug report (point 5): http://bugs.mysql.com/bug.php?id=44058 The problem is that when the server wants to rotate the binlog, it takes the {{LOCK_log}} mutex and holds it while it waits for all pending commits to finish. But {{LOCK_log}} prevents slaves from receiving events, which prevents slave acks, which prevents pending commits to finish. This can be worked around, of course — as eg. done in the Google enhanced semisync patch. But I do not like this work-around — in introduces even more complication into what is already a bad design. I would prefer to instead solve the root problem — that server needs to stall commits when rotating the binlog. This solves a number of issues. See a description for this here: https://mariadb.atlassian.net/browse/MDEV-181 |
Description |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global {{prepare_commit_mutex}} is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phantom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ h3. Implementation in MariaDB group commit In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release {{LOCK_log}} before the wait and wait with taking {{LOCK_commit_ordered}} until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): {noformat} ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... {noformat} See the code in {{sql/log.cc}}, {{MYSQL_BIN_LOG::trx_group_commit_leader()}} for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the {{--rpl_semi_sync_master_wait_before_commit=1}} option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When {{--rpl_semi_sync_master_wait_before_commit=1}}, semi-sync plugin can use the new hook instead of the current after_commit hook. h3. Crash scenarios If a master crashes before a transaction T is written into the binlog, that transaction will be rolled back during crash recovery upon server restart, as normal. If T was written (and synced) into binlog, but not yet acknowledged by any slave, and master crashes, then T will be committed during crash recovery. In this case, it is possible for a connection to see T committed on the master before any slave has had time to connect to the master and receive it. Thus, if we crash again right after crash recovery and completely loose the master, it is possible for a connection to have seen T on the master while T is now effectively missing from the system. To fix this, one option is to somehow have the master wait after crash recovery for at least one slave to connect and acknowledge all recovered commits, thus extending the semi-sync to crash recovery phase. An alternative may be for the DBA to prevent connections to the server after a crash until at least one slave has caught up ({{SHOW MASTER STATUS}} on master and {{select master_pos_wait()}} on slave). If T was acknowledged by at least one slave, then we know that T exists both in master binlog (which is synced before sending to slaves) and slave relay-log. Thus, when master crash recovery is done, T will be on both master and that slave. And if we completely loose the master, T will still eventually be applied on the slave (unless we loose both master and slave at the same time). If a slave crashes during the commit on master, nothing special should happen, unless *all* connected slaves crash, leaving the master without any slaves connected. In this case the situation is much as with normal semisync. Commits will be stalled until timeout. They will be stalled a bit earlier (before InnoDB commit rather than after), so row locks will not have been released yet - otherwise the result is much the same. I need to check if semisync is able to detect the TCP close from all slaves and fail faster in this case — however, this does not help for the case when power failure takes out the slave without any notice sent on the network. h3. Pending XID issue One issue that needs to be dealt with is the potential deadlock described in this bug report (point 5): http://bugs.mysql.com/bug.php?id=44058 The problem is that when the server wants to rotate the binlog, it takes the {{LOCK_log}} mutex and holds it while it waits for all pending commits to finish. But {{LOCK_log}} prevents slaves from receiving events, which prevents slave acks, which prevents pending commits to finish. This can be worked around, of course — as eg. done in the Google enhanced semisync patch. But I do not like this work-around — in introduces even more complication into what is already a bad design. I would prefer to instead solve the root problem — that server needs to stall commits when rotating the binlog. This solves a number of issues. See a description for this here: https://mariadb.atlassian.net/browse/MDEV-181 |
Enhanced semi-synchronous replication does COMMIT in the following way: 1. Prepare the transaction in the storage engine(s). 2. Write the transaction to the binlog, flush the binlog to disk. 3. Wait for at least one slave to acknowledge the reception of the binlog events for the transaction. 4. Commit the transaction to the storage engine(s). This is different from normal semi-synchronous replication, where steps (3) and (4) are reversed. This task is about implementing enhanced semi-synchronous replication in a way that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced semi-synchronous replication would be very expensive, as the global {{prepare_commit_mutex}} is held over the entire operation, which would seriously limit throughput. With MariaDB, a whole group of transactions can enter each stage in parallel, so high thoughput can be maintained. A benefit of enhanced semi-synchronous replication is that a transaction does not become visible until at least one slave has acknowledged the reception of it. This means that if a master is completely lost, any transaction seen by other connections will be replicated somewhere, avoiding a potential phantom read issue. For more discussion see http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/ h3. Implementation in MariaDB group commit In MariaDB group commit, a group of commits queue up while waiting for the previous group to finish. This happens during/just after the prepare step (1). Once the previous group finishes, we have in step (2) a list of commits that we write to the binary log. To implement enhanced semi-synchronous replication, we simply add a step just after (2) where we wait for slave acknowledgement of the last binlog position of the entire group. We introduce a new mutex for this, so that we can release {{LOCK_log}} before the wait and wait with taking {{LOCK_commit_ordered}} until after the wait; this allows stage (3) to run in parallel with stage (2) and (4), while still preserving correct ordering and avoiding one stage getting ahead of the other. The mutexes must be chained, meaning that we must take the next lock before releasing the previous (otherwise one group might overtake previous group, causing incorrect ordering of events): {noformat} ... stage (2) end ... lock LOCK_enhanced_semisync unlock LOCK_log ... stage (3) wait for slave ... lock LOCK_commit_ordered unlock LOCK_enhanced_semisync ... stage (4) begin ... {noformat} See the code in {{sql/log.cc}}, {{MYSQL_BIN_LOG::trx_group_commit_leader()}} for details. The stage (3) should be added as another kind of hook (semi-sync replication is plugin-based using such hooks). We will use the {{--rpl_semi_sync_master_wait_before_commit=1}} option to enable enhanced semi-synchronous replication, following the Google patch http://code.google.com/p/enhanced-semi-sync-replication/ When {{--rpl_semi_sync_master_wait_before_commit=1}}, semi-sync plugin can use the new hook instead of the current {{after_commit}} hook. h3. Crash scenarios If a master crashes before a transaction T is written into the binlog, that transaction will be rolled back during crash recovery upon server restart, as normal. If T was written (and synced) into binlog, but not yet acknowledged by any slave, and master crashes, then T will be committed during crash recovery. In this case, it is possible for a connection to see T committed on the master before any slave has had time to connect to the master and receive it. Thus, if we crash again right after crash recovery and completely loose the master, it is possible for a connection to have seen T on the master while T is now effectively missing from the system. To fix this, one option is to somehow have the master wait after crash recovery for at least one slave to connect and acknowledge all recovered commits, thus extending the semi-sync to crash recovery phase. An alternative may be for the DBA to prevent connections to the server after a crash until at least one slave has caught up ({{SHOW MASTER STATUS}} on master and {{select master_pos_wait()}} on slave). If T was acknowledged by at least one slave, then we know that T exists both in master binlog (which is synced before sending to slaves) and slave relay-log. Thus, when master crash recovery is done, T will be on both master and that slave. And if we completely loose the master, T will still eventually be applied on the slave (unless we loose both master and slave at the same time). If a slave crashes during the commit on master, nothing special should happen, unless *all* connected slaves crash, leaving the master without any slaves connected. In this case the situation is much as with normal semisync. Commits will be stalled until timeout. They will be stalled a bit earlier (before InnoDB commit rather than after), so row locks will not have been released yet — otherwise the result is much the same. I need to check if semisync is able to detect the TCP close from all slaves and fail faster in this case — however, this does not help for the case when power failure takes out the slave without any notice sent on the network. h3. Pending XID issue One issue that needs to be dealt with is the potential deadlock described in this bug report (point 5): http://bugs.mysql.com/bug.php?id=44058 The problem is that when the server wants to rotate the binlog, it takes the {{LOCK_log}} mutex and holds it while it waits for all pending commits to finish. But {{LOCK_log}} prevents slaves from receiving events, which prevents slave acks, which prevents pending commits to finish. This can be worked around, of course — as eg. done in the Google enhanced semisync patch. But I do not like this work-around — in introduces even more complication into what is already a bad design. I would prefer to instead solve the root problem — that server needs to stall commits when rotating the binlog. This solves a number of issues. See a description for this here: https://mariadb.atlassian.net/browse/MDEV-181 |
Fix Version/s | 10.1.0 [ 12200 ] |
Labels | pf1 | pf1 replication |
Workflow | defaullt [ 10906 ] | MariaDB v2 [ 44284 ] |
Fix Version/s | 10.1 [ 16100 ] | |
Fix Version/s | 10.1.0 [ 12200 ] |
Assignee | Kristian Nielsen [ knielsen ] |
Fix Version/s | 10.2.0 [ 14601 ] | |
Fix Version/s | 10.1 [ 16100 ] |
Priority | Minor [ 4 ] | Major [ 3 ] |
Attachment | mdev162.patch [ 34900 ] |
Assignee | Kristian Nielsen [ knielsen ] |
Component/s | Replication [ 10100 ] | |
Fix Version/s | 10.1.3 [ 18000 ] | |
Fix Version/s | 10.2.0 [ 14601 ] | |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Closed [ 6 ] |
Workflow | MariaDB v2 [ 44284 ] | MariaDB v3 [ 63602 ] |
Link | This issue relates to MDEV-18983 [ MDEV-18983 ] |
Workflow | MariaDB v3 [ 63602 ] | MariaDB v4 [ 131907 ] |
Adding a bit to the estimate, as the semisync code has shown itself to be complex and rather bug-ridden. So better anticipate a couple extra days to deal with this.