[MCOL-4540] Failover on multi-node systems triggered only on CMAPI absence - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Confirmed (View Workflow)
Priority: Minor
Resolution: Unresolved
Affects Version/s: 23.02.10
Fix Version/s: Icebox
Component/s: cmapi
Labels:
None

Epic Link:
High Availability
Sprint:
2026-3

Description

In a multi-node system when a subprocess goes down or is unhealthy, CMAPI does not restart subprocesses, resolve locks or adjust a meaningful state/status in mcs cluster status (besides less entries in the services section). Without advanced monitoring or actions, columnstore high availability is weak.

CMAPI only triggers fail-over when the peer CMAPI goes away (network cut, power off the peer etc.).
It will NOT, however, trigger it on CS process missing or crashed (think systemctl stop mariadb-cloumnstore).
At best, CMAPI should be able to conduct a fail-over, take action and alert cluster health on mcs cluster status, if the peer node is unable to process queries for whatever reason (even if its local CMAPI is still up)

Reproduction

# on all 3 nodes

time bash cs_package_manager.sh install enterprise 10.6  --token xxxxx --nodes 172.31.55.48,172.31.50.232,172.31.51.201

# on primary

bash lots-of-inserts.sh

# node 2

systemctl stop mcs-primproc

systemctl stop mcs-writeengineserver

tail -f /var/log/mariadb/columnstore/cmapi_server.log  # notice nothing even after 2 subprocesses are down - even after 10 minutes

# primary

tail -f /var/log/mariadb/columnstore/debug.log # notice errors connecting to nodes

#Mar  7 17:32:05 ip-172-31-37-166 joblist[148514]: 05.879589 |0|0|0| D 05 CAL0000: Failed to get all PrimProc connections. Retry count 20        %%10%%

#Mar  7 17:42:05 ip-172-31-37-166 joblist[184034]: 05.453887 |0|0|0| W 05 CAL0000: /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 311 Could not connect to PMS1: Connection refused from PMS1      %%10%%

# Here, the cluster is in a bad state - cmapi should do something as the cluster manager to auto resolve issues as best it can

# cmapi could check cluster health, restart processes, clearing locks? coordinate a solution to get the system back on track, these likely involve taking responses from the subprocesses or logs to determine what is happening.

# node 2

mcs cluster status # notice still a 3 node cluster as well

# now trigger failover

systemctl stop mariadb-columnstore-cmapi

# idea

-- every subprocess have an OK check or lite vs heavy health check

-- cmapi heartbeat on each subprocess a lite health check

# Separate issue / who could help auto solve

# # Now that we startup - the cluster soon after goes into read only until manual shutdown - mcs cluster stop

# systemctl start mcs-primproc

# systemctl start mcs-writeengineserver

# #

# mcs cluster stop

# ps aux | grep mysql # notice defunct cpimports: mysql     214039  0.0  0.0      0     0 ?        Z    17:47   0:00 [cpimport] <defunct>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

lots-of-inserts.sh
2 kB
2025-03-07 18:49

Issue Links

is duplicated by

MCOL-6092 If CMAPI is manually stopped optionaly take the node with stopped CMAPI from failover

Closed

relates to

MCOL-6194 Cmapi user configurable failover no response wait time

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Assen Totin (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2021-02-15 13:41

Updated:: 2026-01-13 14:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.