[MDEV-162] Enhanced semisync replication - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.1.3
Component/s: Replication
Labels:
- pf1
- replication

Description

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phantom
read issue.

For more discussion see

http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

Implementation in MariaDB group commit

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...

    lock LOCK_enhanced_semisync

    unlock LOCK_log

    ... stage (3) wait for slave ...

    lock LOCK_commit_ordered

    unlock LOCK_enhanced_semisync

    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch

http://code.google.com/p/enhanced-semi-sync-replication/

When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.

Crash scenarios

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up
(SHOW MASTER STATUS on master and select master_pos_wait() on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless all connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet —
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case — however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
LOCK_log mutex and holds it while it waits for all pending commits to
finish. But LOCK_log prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course — as eg. done in the Google enhanced
semisync patch. But I do not like this work-around — in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem — that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

https://mariadb.atlassian.net/browse/MDEV-181

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev162.patch
75 kB
2014-10-20 15:19

Issue Links

relates to

MDEV-181 XID crash recovery across binlog boundaries

Closed

MDEV-18983 Port rpl_semi_sync_master_wait_for_slave_count from MySQL

Open

Activity

Ascending order - Click to sort in descending order

View 7 older comments

Jonas Oreland added a comment - 2014-11-24 16:22

ping Kristian!

1) What do you think about my comment above.

2) I started on the http://my-replication-life.blogspot.se/2013/09/dump-thread-enhancement.html,
but soon concluded that it was next to impossible wo/ doing the refactorings.
the code is already convoluted and error prone.

But, if doing the refactorings, this will be a quite big/intrusive patch.
Which is (obviously?) something that I prefer not to have only in our tree.
Hence I find my self in a catch-22.
What do you think about this ?
Are you interested in the enhanced-semi-sync, in the refactorings, both or none ?

/Jonas

Jonas Oreland added a comment - 2014-11-24 16:22 ping Kristian! 1) What do you think about my comment above. 2) I started on the http://my-replication-life.blogspot.se/2013/09/dump-thread-enhancement.html , but soon concluded that it was next to impossible wo/ doing the refactorings. the code is already convoluted and error prone. But, if doing the refactorings, this will be a quite big/intrusive patch. Which is (obviously?) something that I prefer not to have only in our tree. Hence I find my self in a catch-22. What do you think about this ? Are you interested in the enhanced-semi-sync, in the refactorings, both or none ? /Jonas

Kristian Nielsen added a comment - 2014-12-03 21:54

Jonas, I did see any of your comments until by accident just now.

Kristian Nielsen added a comment - 2014-12-03 21:54 Jonas, I did see any of your comments until by accident just now.

Jonas Oreland added a comment - 2014-12-04 16:25

Hi again,

I've now backported the dump thread enhancements,
including the big refactorings...https://mariadb.atlassian.net/browse/MDEV-7257

I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct.

I haven't (yet) tested if that patch fixes the live-lock that occured previously.

/Jonas

Jonas Oreland added a comment - 2014-12-04 16:25 Hi again, I've now backported the dump thread enhancements, including the big refactorings... https://mariadb.atlassian.net/browse/MDEV-7257 I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct. I haven't (yet) tested if that patch fixes the live-lock that occured previously. /Jonas

Kristian Nielsen added a comment - 2014-12-09 15:26 - edited

> I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct.

I think it sounds right. LOCK_log should not be needed by binlog dump threads, as the binlog is write-only.

(There is one thing to check though. When a binlog file is closed, a flag is updated in the Format_description event at the start of the binlog. But I don't expect it could cause any problem).

BTW, it seems to me that group commit also should not need LOCK_log to ensure only one thread at a time is doing (group) commit. It could be a new mutex like LOCK_after_binlog_sync. But I suppose there is no longer much contention on LOCK_log, so no need to introduce another mutex, unless we discover another deadlock issue.

I will take a look at the patch you attached to this bug.

Kristian Nielsen added a comment - 2014-12-09 15:26 - edited > I'm still interested to hear if you think my comments from 2014-11-12 15:37 is correct. I think it sounds right. LOCK_log should not be needed by binlog dump threads, as the binlog is write-only. (There is one thing to check though. When a binlog file is closed, a flag is updated in the Format_description event at the start of the binlog. But I don't expect it could cause any problem). BTW, it seems to me that group commit also should not need LOCK_log to ensure only one thread at a time is doing (group) commit. It could be a new mutex like LOCK_after_binlog_sync. But I suppose there is no longer much contention on LOCK_log, so no need to introduce another mutex, unless we discover another deadlock issue. I will take a look at the patch you attached to this bug.

Kristian Nielsen added a comment - 2014-12-23 15:28

Pushed to 10.1. Thanks, Jonas!

Kristian Nielsen added a comment - 2014-12-23 15:28 Pushed to 10.1. Thanks, Jonas!

MariaDB Server

Enhanced semisync replication

Details

Description

Implementation in MariaDB group commit

Crash scenarios

Pending XID issue

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration