Synchronous Replication configuration membership maintenance by the configuration participants ============================================================================================== Motivation part --------------- All Master and slaves servers must keep up consistent knowledge of the latest configuration that they are or were last time a part of. Configuration changes when slaves are added, removed, or got crashed as the master could do too. This task is about to track such changes and keep the knowledge consistent across involved servers. The motivation for this task comes from various fairly known use cases, including failover, desire to automate reconfigurations without creating custom solutions. For example, consider a semisync configuration of Conf = { M0,S1 } (M0 - master, S1 - slave, numbers index the servers, M,S designate the roles). I drop index `0' when it for the "original" master and its change is not inferred in the picture. When M0 crashes or partitions in network a way that neither DBA nor S1 can't see it, S1 may be promoted to replace the master as M1. While semisync allows for that there are few situations when M0 can "disappear" having few transactions committed which S1 has never heard of so won't be having as M1. A hot example is MDEV-20996, it is tacked but the requires M0 to not "forget" to start upon the crash as the semisync slave. The full synchronous solution does not require that. M0 simply may not have extra transactions in its binlog. So the failover issue is sensitive already in a system of two servers and gets more serious when the configuration scales out. E.g in the semisyc Conf = { M, S1, S2, S4 } M commits a T (transaction) when any of Si acknowledges T received, say it were S1. Further on let { M,S1 } separate from { S2,S4 } due crashes or a network issue. With the semisync { S2, S4 } sub-group can't tell DBA that choosing its server to replace M may lead to the T loss. In contrast the full sync does not allow automatic promotion to the master from any of S2,S4 until either M or S1 gets back, or DBA intervenes "divinely" then to admit on possible (she may not know about T) transactions loss. Examples like above are virtually endless and all that they have in common firstly is the lack of configuration understanding by master and slave servers. Secondly, a related issue is that at acknowledging T receive by a slave to M, a single acknowledgment may not be enough. High level description of the Configuration membership tracking algorithm ------------------------------------------------------------------------- More specifically the knowledge that master and slaves are to share consists of a configuration identifier (think of an ever growing integer) and its member description which includes their role (master or slave) mention, addresses and how to access (host,port,user,password). The last executed/received GTID (from each domain) is also a part of the knowledge. Let us abstract from GTID for a while though. The Primary Configuration can be initiated with setting a binlogging server - M. It initiates Conf.memb= { M }, Conf.id = 0. DBA can add a second member to configuration through a routine CHANGE MASTER TO ... START SLAVE M adds up S's data to its configuration view to yield Conf.memb = { M, S1 } Conf.id++ and informs S1 about Conf. S just accepts that. Both have understood now Conf.memb = { M, S1 }, It's clear how adding more or removal of some slaves would do. As of current semisync+regular replication code base M or S are always capable to monitor live status check of each other. M listens for slaves' acks, S does so for heartbeats. Both mechanisms need to be adjusted to serve as failure detectors to trigger reconfiguration ("implicit" STOP SLAVE) when a server is suspected to be down. When M is suspected to be down by a slave S, the latter will initiate reconfiguration to send the other slaves an offer to form a new configuration. That entails further message exchange where S:s provide each other with their Conf views. This type of exchange protocol is described in fine details in various sources,e.g see internal state transition diagram [1], pages 48, 54 (complexity of the state transition is in significantly lesser in our case, we have reliable TCP instead of Totem which procudes various uncertainties). To the bottom line, unlike in the semisync the fullsync slaves are aware of each other, and at times exchange internal messages for reconfiguration, a part of which is to elect a master. And that the election we do account their gtid states including @@global.gtid_slave_pos. In normal circumstances the reconfiguration takes two message delays and one disk write to remember the new configuration installed. User-Configurable number of slave acknowledgments instead of 1 ack always by semisync ------------------------------------------------------------------------------------- In scaled-out configuration with the number of slaves greater than 2 the current semisync's just single time acknowledgment is not enough to guarantee no transaction loss at network split. The user is offered to provide her preference based on the quality of network and level of high-availability desired. E.g to secure from two simultaneous crashes the ack number should be at least 2. To secure from network partitioning acks must arrive from a majority of slaves + master system. E.g in 4 slave configuration (so 5 servers in total) there must be 2 acks at least be received for letting M to commit a transaction. Practical benefits ------------------ Include - automating reconfiguration and master election - sql interface to learn by DBA the status of the Primary configuration or non-primary configurations at any time - resilience up to the full crash. At restart of survived majority of the last primary configuration they can form Primary configuration. - necessary (but not yet sufficient) provision to a type of consistent read SELECT from a slave server. References: [1] Yiar Amir Replication Using Group Communication Over a Partitioned Network, PhD thesis, 1995