[MXS-190] Automatic FAILOVER for MULTI MASTER - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Do
Affects Version/s: None
Fix Version/s: N/A
Component/s: Core
Labels:
None

Description

There is 2 scenarios to consider :

1 - The master was ok (normal shutdown or unreachable )
2 - The master is crashed

1 - How to detect a normal shutdown or unreachable ?
Looks impossible so coming back to 2 supposing the old master as crashed and need to trigger re provisioning from new master.
Such brut force scenario is possible and need a provisioning solution

2 - We can avoid provisioning for automatic failover when master is crashed if we can fail over on an already up to date candidate master to preserve consistency of the cluster

We have again 2 scenarios:

2.1 - Proxy maintain backlog

2.1.1 No pending event in the queue failover can happen
2.1.2 Repair the new master via multiplexing
2.1.2 Repair the new master via binlog server

2.2 - Proxy track synchronous replication state and failover on state ok

2.2.1 - Replication is semisync improve possible failover period
2.2.2 - Replication is assync still possible via implementation of causal read, track per GTID the state of replication
2.2.3 - Replication as not catch up come back to provisioning

2.1.2 - We add a backlog of WRITE queries and their respective GTID
Each backlog entry is updated by the monitoring plugin keeping track per node of already replicated GTID

Failover do:

Wait that SQL thread as nothing more do do on the candidate master
Check that backlog is not overloaded by replication delay
Apply on the candidate master all backlog entries that have not yet been replicated using same gtid
set gtid_ingore_duplicates=1 on the old master
have the monitor plugin to trigger external script for starting slave on event "old master up"
Switch traffic

The backlog architecture is not 100% safe because keeping track of all writes queries and replaying them does not guaranty that it produce the same state on the candidate master except in serialize mode
Proxy need to keep track of commit order between sessions. that how to need serious discussion with server team and may required statement base replication on the master. If GTID reflect commit order we are may be fine .

2.1.3 - Binlog server and semi-sync replication, the back log is all the bin logs not yet applied to the candidate master

2.2.1 - Do not use backlog but track semi-sync replication state

Failover do:

Check semi sync presence
Check monitoring delay is 3 time smaller than rpl_semi_sync_master_timeout
Minitoring plugin keep track of Rpl_semi_sync_master_status on the current master
if Rpl_semi_sync_master_status was on on the last status failover can continue
Switch traffic

For delegating failover to external script we need to pass backlog and or last _rpl_semi_sync_master_timeout status

Attachments

Issue Links

is blocked by

MXS-189 FAILOVER and SWITCHOVER admin command to call external script

Closed

MXS-215 Provisioning new slave via maxscale

Closed

relates to

MXS-201 Causal Read Relaxing GTID_WAIT per table consistancy

Closed

Activity

People

Assignee:: Unassigned

Reporter:: VAROQUI Stephane

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2015-06-11 20:25

Updated:: 2016-10-17 10:43

Resolved:: 2016-10-17 10:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.