[MXS-3234] Resolve cooperative monitoring deadlocks Created: 2020-10-12  Updated: 2022-04-06  Resolved: 2022-04-05

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Assen Totin (Inactive) Assignee: Todd Stoffel (Inactive)
Resolution: Incomplete Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-4079 Document cooperative monitoring confl... Closed

 Description   

MaxScale 2.5 introduced cooperative monitoring as a way for two MaxScale instances to decide on their own which is the active and which is the passive instance. In the KB this is now left as the sole option (articles on keepalived, Lsyncd etc. have been removed). Unfortunately, cooperative monitoring does not work well when the system has (or, depending on the monitoring mode, ends up having) an even number of (active) nodes; this may easily lead to the situation when both MaxScale instances become passive and here is none to perform the HA function of a failover, should such be needed.

Example: One primary, three slaves; MaxScales running with "majority of all"; each MaxScale obtains two locks and remains passive due to lack of quorum. Similar scenarios can be created for "majority of running" mode also.

We see at least two ways of resolving such deadlocks:

  • With MaxScale-to-MaxScale communication: a strategy like "lowest IP address wins" may help a MaxScale become active when no MaxScale holds a majority of the necessary locks.
  • Without MaxScale-to-MaxScale communication: assigning a weight to each locked node and using these weights to form the quorum. Following the above example, if one of the nodes is given a weight of 2, sum of all will not be 4 but 5, so one of the MaxScale instances will have 3/5 of the lock's weights and will thus stay active even though it will only have locks of 1/2 of the nodes (the other MaxScale will also have locks for 1/2 of the nodes, but will only have 2/5 of the sum of all weights).


 Comments   
Comment by markus makela [ 2021-10-11 ]

This should be a transient problem as both MaxScales attempting to acquire the locks will sleep for a random time when a conflict occurs. This is a similar approach to how RAFT clusters solve conflicts and has proven to be a practical approach. Anything more complex than that isn't warranted as conflicts are a rare occurrence with cooperative monitoring.

Comment by markus makela [ 2022-04-05 ]

Closing as Incomplete as there is practically no way the current conflict resolution algorithm ends up in a deadlock. Even with the somewhat optimistic approach of the current implementation, there's a 25% probability of two MaxScales ever attempting at the same time. For the given example of two MaxScales and four servers, this should result in roughly a 99% chance of the conflict being resolved within a minute.

Generated at Thu Feb 08 04:19:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.