Currently the Xpand monitor treats group change errors as any other error. That is, it'll cause the monitor to abandon the current "hub" (the Xpand node it uses for fetching cluster topology information) and connect to another node, which will fail with a group change error. After that the monitor will at regular intervals connect to each node, which will fail, until the group change is over.
At the same time, the monitor will ping the health check port of each node and but for a node that is removed, it will continue to return OK. That is, as far as any routers are concerned those nodes/servers appear to be ready to use. However, that's just an appearance as any attempt to use them will end with a group change error.
This means that there will be an awful amount of activity and error handling that simply cannot be resolved before the group change is over. Thus, the Xpand monitor:
- should detect whenever a monitor operation fails due to a group change, and in that case
- stop the normal health check ping,
- mark all servers (internally) as being down,
- regularly connect in order to find out whether the group change has finished, and in that case
- check the cluster configuration and remove/add servers, and
- turn on the regular health check ping, which will cause the servers to be marked as being up.
That way a great deal of activity will basically stop for the duration of the group change. Until the group change is over, there is no point in doing anything else than checking whether the group change is over.