Synchronous Replication configuration membership maintenance by the configuration participants
==============================================================================================

Motivation part
---------------

All Master and slaves servers must keep up consistent knowledge of the
latest configuration that they are or were last time a part of.
Configuration changes when slaves are added, removed, or got crashed as the master
could do too. This task is about to track such changes and keep the knowledge consistent
across involved servers.

The motivation for this task comes from various fairly known use cases, including failover,
desire to automate reconfigurations without creating custom solutions.

For example, consider a semisync configuration of Conf = { M0,S1 }
(M0 - master, S1 - slave, numbers index the servers, M,S designate the roles). I
drop index `0' when it for the "original" master and its change is not inferred in the picture.

When M0 crashes or partitions in network a way that neither DBA nor
S1 can't see it, S1 may be promoted to replace the master as M1.
While semisync allows for that there are few situations when M0 can
"disappear" having few transactions committed which S1 has never heard
of so won't be having as M1. A hot example is MDEV-20996, it is tacked
but the requires M0 to not "forget" to start upon the crash as the
semisync slave.
The full synchronous solution does not require that. M0 simply may not have extra
transactions in its binlog.

So the failover issue is sensitive already in a system of two servers and gets
more serious when the configuration scales out.
E.g in the semisyc Conf = { M, S1, S2, S4 }
M commits a T (transaction) when any of Si acknowledges T received, say it were S1.
Further on let { M,S1 } separate from { S2,S4 } due crashes or a network issue.
With the semisync { S2, S4 } sub-group can't tell DBA that choosing its
server to replace M may lead to the T loss.
In contrast the full sync does not allow automatic promotion to the master
from any of S2,S4 until either M or S1 gets back, or DBA intervenes "divinely"
then to admit on possible (she may not know about T) transactions loss.

Examples like above are virtually endless and all that they have in
common firstly is the lack of configuration understanding by master
and slave servers.  Secondly, a related issue is that at acknowledging
T receive by a slave to M, a single acknowledgment may not be enough.


High level description of the Configuration membership tracking algorithm
-------------------------------------------------------------------------

More specifically the knowledge that master and slaves are to share consists of
a configuration identifier (think of an ever growing integer) and its member
description which includes their role (master or slave) mention,
addresses and how to access (host,port,user,password).
The last executed/received GTID (from each domain) is also a part of the knowledge.
Let us abstract from GTID for a while though.

The Primary Configuration can be initiated with setting a binlogging server - M.
It initiates

  Conf.memb= { M }, Conf.id = 0.

DBA can add a second member to configuration through a routine

  CHANGE MASTER TO ...
  START SLAVE

M adds up S's data to its configuration view to yield

  Conf.memb = { M, S1 }
  Conf.id++

and informs S1 about Conf.
S just accepts that.
Both have understood now

  Conf.memb = { M, S1 },

It's clear how adding more or removal of some slaves would do.
As of current semisync+regular replication code base M or S are always capable
to monitor live status check of each other. M listens for slaves' acks, S does so for heartbeats.
Both mechanisms need to be adjusted to serve as failure detectors to trigger
reconfiguration ("implicit" STOP SLAVE) when a server is suspected to be down.

When M is suspected to be down by a slave S, the latter will initiate
reconfiguration to send the other slaves an offer to form a new configuration.
That entails further message exchange where S:s provide each other with
their Conf views. This type of exchange protocol is described in fine details
in various sources,e.g see internal state transition diagram [1], pages 48, 54
(complexity of the state transition is in significantly lesser in our
case, we have reliable TCP instead of Totem which procudes various uncertainties).

To the bottom line, unlike in the semisync the fullsync slaves are aware of each other,
and at times exchange internal messages
for reconfiguration, a part of which is to elect a master. And that the election
we do account their gtid states including @@global.gtid_slave_pos.
In normal circumstances the reconfiguration takes two message delays and one disk write
to remember the new configuration installed.


User-Configurable number of slave acknowledgments instead of 1 ack always by semisync
-------------------------------------------------------------------------------------

In scaled-out configuration with the number of slaves greater than 2 the current semisync's
just single time acknowledgment is not enough to guarantee no transaction loss
at network split.
The user is offered to provide her preference based on the quality of network and level
of high-availability desired. E.g to secure from two simultaneous crashes the ack number should
be at least 2.
To secure from network partitioning acks must arrive from a majority of slaves + master system.
E.g in 4 slave configuration (so 5 servers in total) there must be 2 acks 
at least be received for letting M to commit a transaction.

Practical benefits
------------------

Include

- automating reconfiguration and master election
- sql interface to learn by DBA the status of the Primary configuration or non-primary configurations
  at any time
- resilience up to the full crash.  At restart of survived majority of
  the last primary configuration they can form Primary configuration.
- necessary (but not yet sufficient) provision to a type of consistent read SELECT from a slave server.

References:

[1] Yiar Amir Replication Using Group Communication Over a Partitioned Network, PhD thesis, 1995