Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-4540

Failover on multi-node systems triggered only on CMAPI absence

    XMLWordPrintable

Details

    • New Feature
    • Status: Confirmed (View Workflow)
    • Minor
    • Resolution: Unresolved
    • 23.02.10
    • Icebox
    • cmapi
    • None

    Description

      In a multi-node system when a subprocess goes down or is unhealthy, CMAPI does not restart subprocesses, resolve locks or adjust a meaningful state/status in mcs cluster status (besides less entries in the services section). Without advanced monitoring or actions, columnstore high availability is weak.

      CMAPI only triggers fail-over when the peer CMAPI goes away (network cut, power off the peer etc.).
      It will NOT, however, trigger it on CS process missing or crashed (think systemctl stop mariadb-cloumnstore).
      At best, CMAPI should be able to conduct a fail-over, take action and alert cluster health on mcs cluster status, if the peer node is unable to process queries for whatever reason (even if its local CMAPI is still up)

      Reproduction

      # on all 3 nodes
      time bash cs_package_manager.sh install enterprise 10.6  --token xxxxx --nodes 172.31.55.48,172.31.50.232,172.31.51.201
       
      # on primary
      bash lots-of-inserts.sh
       
      # node 2
      systemctl stop mcs-primproc
      systemctl stop mcs-writeengineserver
      tail -f /var/log/mariadb/columnstore/cmapi_server.log  # notice nothing even after 2 subprocesses are down - even after 10 minutes
       
      # primary
      tail -f /var/log/mariadb/columnstore/debug.log # notice errors connecting to nodes
      #Mar  7 17:32:05 ip-172-31-37-166 joblist[148514]: 05.879589 |0|0|0| D 05 CAL0000: Failed to get all PrimProc connections. Retry count 20        %%10%%
      #Mar  7 17:42:05 ip-172-31-37-166 joblist[184034]: 05.453887 |0|0|0| W 05 CAL0000: /home/jenkins/workspace/Build-Package/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/storage/columnstore/columnstore/dbcon/joblist/distributedenginecomm.cpp @ 311 Could not connect to PMS1: Connection refused from PMS1      %%10%%
       
      # Here, the cluster is in a bad state - cmapi should do something as the cluster manager to auto resolve issues as best it can
      # cmapi could check cluster health, restart processes, clearing locks? coordinate a solution to get the system back on track, these likely involve taking responses from the subprocesses or logs to determine what is happening.
       
      # node 2
      mcs cluster status # notice still a 3 node cluster as well
      # now trigger failover
      systemctl stop mariadb-columnstore-cmapi
       
      # idea
      -- every subprocess have an OK check or lite vs heavy health check
      -- cmapi heartbeat on each subprocess a lite health check
       
      # Separate issue / who could help auto solve
      # # Now that we startup - the cluster soon after goes into read only until manual shutdown - mcs cluster stop
      # systemctl start mcs-primproc
      # systemctl start mcs-writeengineserver
       
      # # 
      # mcs cluster stop
      # ps aux | grep mysql # notice defunct cpimports: mysql     214039  0.0  0.0      0     0 ?        Z    17:47   0:00 [cpimport] <defunct>
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            assen.totin Assen Totin (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.