Recovering a Galera cluster is not always an easy task automate, but here are two ways we could do it, up to a certain degree.
1. Suggestion for the recovery algorithm
- get wsrep_last_committed from the mysql interface;
- if not possible, ssh to the remaining nodes and play the galera_recovery;
- select the running node with the highest seqno to execute: set global wsrep_provider_options="pc.bootstrap=1";
- if no running nodes, then, ssh to one of the nodes and execute galera_new_cluster.
For the idea described above for the algorithm, we should always check if there is a most advanced in replication node so we can execute the set global wsrep_provider_options="pc.bootstrap=1" on the right node, avoiding though to bootstrapping the wrong node. We see some ways of doing this check and would like to discuss the best of checking the most advanced node in replication or even bootstrapping the latest master in case we have seqno as -1 or even don't have a grastate.dat on disk.
In case some of the nodes are not reachable, just store the information for display in the galeramon monitor.
Introduce auto_failover parameter for galeramon with following parameters :
- true, false, force
- false: disable failover for galeramon (default)
- true: enables the previous recovery algorithm
- force: same as true, but ignores unreachable nodes and bootstrap the cluster using only reachable nodes
2. Suggestion for the recovery command run by MaxScale:
Introduce a new command for galeramon:
This above command triggers a failover for galeramon, launching the bootstrap operation (cf previous algorithm) ignoring unreachable nodes.