[MXS-4779] Maxscale monitor suddenly loses entire cluster status (galeramon) Created: 2023-09-27  Updated: 2023-10-25  Resolved: 2023-10-10

Status: Closed
Project: MariaDB MaxScale
Component/s: galeramon
Affects Version/s: 6.4.10
Fix Version/s: 6.4.11

Type: Bug Priority: Major
Reporter: Rick Pizzi Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: triage
Environment:

3 maxscale nodes behind ALB, aws vms



 Description   

A 5-node galera cluster loses two nodes (NODE03 and NODE04) within a couple minutes due to OOM events.
The cluster reconfigures and remains healthy with the remaining 3 nodes.
However, maxscale loses status for ALL nodes and causes an outage.

2023-09-25 14:07:29.827   error  : (mon_report_query_error): Failed to execute query on server 'NODE04' ([10.225.27.118]:3306): Lost connection to server during query
2023-09-25 14:08:10.003   notice : (log_state_change): Server changed state: NODE04[10.225.27.118:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2023-09-25 14:09:06.851   error  : (985612) (NODE03); (socket_write): Write to Backend DCB 10.225.27.183 in state DCB::State::POLLING failed: 104, Connection reset by peer
2023-09-25 14:09:30.579   error  : [galeramon] (post_tick): There are no cluster members
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODE01[10.225.27.121:3306]: lost_master. [Master, Synced, Running] -> [Running]
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODE02[10.225.27.156:3306]: lost_slave. [Slave, Synced, Running] -> [Running]
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODE03[10.225.27.183:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODE05[10.225.27.142:3306]: lost_slave. [Slave, Synced, Running] -> [Running]
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODER02[10.225.27.158:3306]: lost_slave. [Slave, Running] -> [Running]
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODER03[10.225.27.172:3306]: lost_slave. [Slave, Running] -> [Running]
2023-09-25 14:09:30.579   notice : (log_state_change): Server changed state: NODER04[10.225.27.116:3306]: lost_slave. [Slave, Running] -> [Running]2023-09-25 2023-09-25 14:09:30.594   error  : (987213) [readwritesplit] (rwsplit-service); (open_connections): Couldn't find suitable Master from 5 candidates.



 Comments   
Comment by markus makela [ 2023-09-27 ]

I think one improvement that could be done is to store the last reason why a node lost the Synced status in the monitor and report that in the state change messages.

Comment by Rick Pizzi [ 2023-09-27 ]

Looking again at the logs, this happened two times the same afternoon.
About 2 hours later, another two nodes got OOM-killed and again, maxscale has lost track of cluster status completely:

2023-09-25 16:59:32.606   notice : (log_state_change): Server changed state: NODE02[10.225.27.156:3306]: slave_down. [Slave, Synced, Running] -> [Down]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODE01[10.225.27.121:3306]: lost_master. [Master, Synced, Running] -> [Running]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODE03[10.225.27.183:3306]: lost_slave. [Slave, Synced, Running] -> [Running]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODE04[10.225.27.118:3306]: lost_slave. [Slave, Synced, Running] -> [Running]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODE05[10.225.27.142:3306]: new_master. [Slave, Synced, Running] -> [Master, Synced, Running]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODER02[10.225.27.158:3306]: lost_slave. [Slave, Running] -> [Running]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODER03[10.225.27.172:3306]: lost_slave. [Slave, Running] -> [Running]
2023-09-25 17:03:36.030   notice : (log_state_change): Server changed state: NODER04[10.225.27.116:3306]: lost_slave. [Slave, Running] -> [Running]
2023-09-25 17:03:38.962   error  : [galeramon] (post_tick): There are no cluster members2023-09-25 17:03:38.962   notice : (log_state_change): Server changed state: NODE05[10.225.27.142:3306]: master_down. [Master, Synced, Running] -> [Down]

Comment by markus makela [ 2023-09-27 ]

This could be somehow related to how the cluster UUID is calculated (i.e. set_galera_cluster() and calculate_cluster()) and used to see whether the nodes are in the same cluster.

Comment by markus makela [ 2023-10-10 ]

The relevant Galera variables are now logged in the log message which will explain why the Synced status is lost.

Generated at Thu Feb 08 04:31:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.