[MXS-3490] Xpand monitor should detect and handle group change explicitly Created: 2021-04-12  Updated: 2022-05-23  Resolved: 2022-04-08

Status: Closed
Project: MariaDB MaxScale
Component/s: xpandmon
Affects Version/s: None
Fix Version/s: 22.08.0

Type: New Feature Priority: Major
Reporter: Johan Wikman Assignee: Johan Wikman
Resolution: Fixed Votes: 1
Labels: None

Issue Links:
Issue split
split from MXS-3472 Transaction Replay: transactions not ... Closed
Sprint: MXS-SPRINT-153, MXS-SPRINT-154

 Description   

Currently the Xpand monitor treats group change errors as any other error. That is, it'll cause the monitor to abandon the current "hub" (the Xpand node it uses for fetching cluster topology information) and connect to another node, which will fail with a group change error. After that the monitor will at regular intervals connect to each node, which will fail, until the group change is over.

At the same time, the monitor will ping the health check port of each node and but for a node that is removed, it will continue to return OK. That is, as far as any routers are concerned those nodes/servers appear to be ready to use. However, that's just an appearance as any attempt to use them will end with a group change error.

This means that there will be an awful amount of activity and error handling that simply cannot be resolved before the group change is over. Thus, the Xpand monitor:

  • should detect whenever a monitor operation fails due to a group change, and in that case
  • stop the normal health check ping,
  • mark all servers (internally) as being down,
  • regularly connect in order to find out whether the group change has finished, and in that case
  • check the cluster configuration and remove/add servers, and
  • turn on the regular health check ping, which will cause the servers to be marked as being up.

That way a great deal of activity will basically stop for the duration of the group change. Until the group change is over, there is no point in doing anything else than checking whether the group change is over.



 Comments   
Comment by Johan Wikman [ 2021-04-14 ]

maxmether No changes whatsoever would be needed Xpand. The Xpand monitor simply has to check whether an error is a group change error and act differently if it is. Currently all errors are treated in the same way.

Comment by Johan Wikman [ 2022-04-08 ]

When a group change is detected, the state of the Xpand nodes is set to down and kept there until it is detected that the group change is over.

Generated at Thu Feb 08 04:21:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.