[CONJ-595] Create option to configure DONOR/DESYNCED Galera nodes to be unavailable for load-balancing Created: 2018-04-11  Updated: 2020-08-25  Resolved: 2018-05-25

Status: Closed
Project: MariaDB Connector/J
Component/s: Failover
Affects Version/s: None
Fix Version/s: 2.2.5, 1.7.4

Type: Task Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Diego Dupin
Resolution: Fixed Votes: 0
Labels: None


 Description   

When a node is in the DONOR/DESYNCED state, it doesn't participate in flow control, so its data can get stale. Galera states are explained here:

http://galeracluster.com/documentation-webpages/nodestates.html#changes-in-the-node-state

A node's state can be checked with wsrep_local_state:

http://galeracluster.com/documentation-webpages/galerastatusvariables.html#wsrep-local-state

MaxScale's Galera Monitor treats nodes in the DONOR/DESYNCED state as unavailable unless available_when_donor is configured. This ensures that MaxScale does not route queries to a node that has stale data.

https://mariadb.com/kb/en/mariadb-enterprise/mariadb-maxscale-22-galera-monitor/#available_when_donor

As far as I can tell, MariaDB Connector/J's load balancing implementation does not have a way to keep queries from being sent to desynced Galera nodes. Maybe we should add an option that would enable that kind of behavior?



 Comments   
Comment by Diego Dupin [ 2018-04-13 ]

There is no monitor inside connector, but still, there is a specific implementation for galera.

Pools validate connection state before borrowing it. This is done using Connection.isValid(timeout).

Standard implementation of Connection.isValid() will emit a COM_PING. For galera, connection will emit a query in place of COM_PING : "show status like 'wsrep_cluster_status' ", and check that status is "PRIMARY". To ensure not only that socket is set, but that server is in PRIMARY state. So, using 'wsrep_cluster_status' not 'wsrep_local_state', but result will be the same.

see CONJ-400

Comment by Geoff Montee (Inactive) [ 2018-04-13 ]

It's not really correct that checking wsrep_cluster_status for PRIMARY will have the same result as checking wsrep_local_state. The problem is that a node can be in the DONOR/DESYNCED state while still being in the cluster's primary component. When a node is desynced, it means that it doesn't participate in flow control. This means that it can fall behind the other nodes in the cluster's primary component. It's kind of similar to slave lag with a traditional replication slave. If the node has fallen behind, then it might not make sense to use it for load balancing in some applications.

Maybe Connector/J should have an option that would make Connection.isValid() check wsrep_cluster_status and wsrep_local_state?

Comment by Diego Dupin [ 2018-05-14 ]

right !
As state here : http://galeracluster.com/documentation-webpages/nodestates.html#node-state-changes
Primary is not enough, server must be also Synced.

Implementation will change to rely on wsrep_local_state to check "sync" status.

Comment by Diego Dupin [ 2018-05-25 ]

New option "galeraAllowedState" permit to correct implementation.

If option "galeraAllowedState" if not set (default), Connection.isValid() just send an empty packet to the server, and the server responds with a small packetto ensure connectivity (COM_PING).

When this option "galeraAllowedState" is set, the connector will ensure that server "wsrep_local_state" correspond to allowed values (separated by comma)
Example using option "galeraAllowedState=4" will ensure that the server is available and in "sync" state, not just primary.

Generated at Thu Feb 08 03:16:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.