Details
-
New Feature
-
Status: Confirmed (View Workflow)
-
Minor
-
Resolution: Unresolved
-
23.02.10
-
None
Description
In a multi-node system when a subprocess goes down or is unhealthy, CMAPI does not restart subprocesses, resolve locks or adjust a meaningful state/status in mcs cluster status (besides less entries in the services section). Without advanced monitoring or actions, columnstore high availability is weak.
CMAPI only triggers fail-over when the peer CMAPI goes away (network cut, power off the peer etc.).
It will NOT, however, trigger it on CS process missing or crashed (think systemctl stop mariadb-cloumnstore).
At best, CMAPI should be able to conduct a fail-over, take action and alert cluster health on mcs cluster status, if the peer node is unable to process queries for whatever reason (even if its local CMAPI is still up)
Reproduction
# on all 3 nodes |
time bash cs_package_manager.sh install enterprise 10.6 --token xxxxx --nodes 172.31.55.48,172.31.50.232,172.31.51.201 |
|
# on primary
|
bash lots-of-inserts.sh
|
|
# node 2 |
systemctl stop mcs-primproc
|
systemctl stop mcs-writeengineserver
|
tail -f /var/log/mariadb/columnstore/cmapi_server.log # notice nothing even after 2 subprocesses are down - even after 10 minutes |
|
# primary
|
tail -f /var/log/mariadb/columnstore/debug.log # notice errors connecting to nodes
|
#Mar 7 17:32:05 ip-172-31-37-166 joblist[148514]: 05.879589 |0|0|0| D 05 CAL0000: Failed to get all PrimProc connections. Retry count 20 %%10%% |
#Mar 7 17:42:05 ip-172-31-37-166 joblist[184034]: 05.453887 |0|0|0| W 05 CAL0000: /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 311 Could not connect to PMS1: Connection refused from PMS1 %%10%% |
|
# Here, the cluster is in a bad state - cmapi should do something as the cluster manager to auto resolve issues as best it can |
# cmapi could check cluster health, restart processes, clearing locks? coordinate a solution to get the system back on track, these likely involve taking responses from the subprocesses or logs to determine what is happening.
|
|
# node 2 |
mcs cluster status # notice still a 3 node cluster as well |
# now trigger failover
|
systemctl stop mariadb-columnstore-cmapi
|
|
# idea
|
-- every subprocess have an OK check or lite vs heavy health check
|
-- cmapi heartbeat on each subprocess a lite health check
|
|
# Separate issue / who could help auto solve
|
# # Now that we startup - the cluster soon after goes into read only until manual shutdown - mcs cluster stop
|
# systemctl start mcs-primproc
|
# systemctl start mcs-writeengineserver
|
|
# #
|
# mcs cluster stop
|
# ps aux | grep mysql # notice defunct cpimports: mysql 214039 0.0 0.0 0 0 ? Z 17:47 0:00 [cpimport] <defunct> |