Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-5701

Failover infinite loop after network failure.

    XMLWordPrintable

Details

    • Bug
    • Status: Stalled (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None
    • cmapi
    • None

    Description

      Seems to be network card going out for 12 minutes, after that Failover fall into infinite loop with error trying to activate nodes after failure. Manual cluster restart fixed the issue.

      Notes from allen.herrera

      # last successful cpimport
      Mar 21 04:44:16 atx-mdb101pl writeengine[1442415]: 16.223397 |0|0|0| I 19 CAL0008: Bulkload |Job: /mnt/local/aon/datadumper/loaderColumnXML/Job_5002.xml |For table LTE_ALL_SECTOR_HOURLY: 1 rows processed and 1 rows inserted.
      Mar 21 04:44:16 atx-mdb101pl cpimport.bin[1442415]: 16.336107 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-5002; status-SUCCESS
      # internet card down? mount down?
      Mar 21 06:22:38 atx-mdb101pl kernel: [qede_link_update:2608(ens3f4)]Link is down
      Mar 21 06:22:50 atx-mdb101pl multipathd[2795]: checker failed path 8:112 in map third-party
      # cmapi begins failing 
      Mar 21 06:23:12 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:23:12 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): 10.224.140.30:8640
      # internet back ?
      Mar 21 06:34:09 atx-mdb101pl kernel: [qede_link_update:2601(ens3f0)]Link is up
      # cmapi can connect again and begins failover
      Mar 21 06:34:10 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:10 [DEBUG] (cmapi_server) 10.224.140.30 put_begin starts
      Mar 21 06:34:10 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:10 [INFO] (/usr/share/columnstore/cmapi/cmapi_server/failover_agent.py) FA.deactivateNodes():  deactivating nodes: ['10.224.140.31', '10.224.140.32']
      Mar 21 06:34:11 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:11 [CRITICAL] (node_monitor) Only 1 out of 3 nodes are active.  At least 2 are required.  Entering standby mode to protect the system.
      Mar 21 06:34:13 atx-mdb101pl systemd[1]: mcs-primproc.service: Killing process 3435526 (PrimProc) with signal SIGKILL.
      # network down again for 12 minutes
      Mar 21 06:40:51 atx-mdb101pl kernel: [qede_link_update:2608(ens3f5)]Link is down
      Mar 21 06:52:24 atx-mdb101pl kernel: [qede_link_update:2601(ens3f1)]Link is up
      # cmapi reconnects again and tries cycling columnstore
      Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [DEBUG] (cmapi_server) 10.224.140.30 put_begin starts
      Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [DEBUG] (root) stop running sudo systemctl stop mcs-ddlproc
      Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [INFO] (root) _add_node_to_PMS(): node 10.224.140.32 already exists
      # for 2.5 hours cmapi tries stopping processes until manual intervention likely @ Mar 21 09:20:
      # Shutdown 
      Mar 21 09:20:57 atx-mdb101pl python3[3950659]: 21/Mar/2024 09:20:57 [DEBUG] (cmapi_server) 10.224.140.31 put_shutdown start
      # amdocs node 1 - still issues
      Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.127366 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -f /mnt/local/aon/datadumper -L /var/log/mariadb/columnstore/cpimport/ -j 5021 -p /mnt/local/aon/datadumper/loaderColumnXML -P pm3-1758058 -T SYSTEM -u9fa3542a-182e-4f74-bc61-a1504475bfeb
      Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.233160 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-5021; db-mariadb_actixone_owner
      Mar 21 09:32:36 atx-mdb101pl controllernode[1758058]: 36.236110 |0|0|0| C 29 CAL0000: ExtentMap::getDbRootHWMInfo(): There are no DBRoots for OID 5203 and PM 3#012         %%10%%
      Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.236265 |0|0|0| E 34 CAL0087: BulkLoad Error: Error in pre-processing the job file for table LTE_ALL_MESH_HOURLY
      # Next shutdown again
      Mar 21 10:49:28 atx-mdb101pl python3[1755592]: 21/Mar/2024 10:49:28 [DEBUG] (cmapi_server) 10.224.140.31 put_shutdown starts
      # cpimport continues working
      Mar 21 11:08:22 atx-mdb101pl cpimport.bin[1851124]: 22.044068 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-5401; status-SUCCESS
      

      Additional info here.

      Investigation result from alan.mologorsky
      Because network was down failover never deactivated nodes, so corresponding configuration fields never changed in Columnstore.xml.
      After that failover falls into error trying to activate already activated nodes.

      Attachments

        Activity

          People

            alan.mologorsky Alan Mologorsky
            alan.mologorsky Alan Mologorsky
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.