Details
-
Bug
-
Status: Stalled (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
Description
Seems to be network card going out for 12 minutes, after that Failover fall into infinite loop with error trying to activate nodes after failure. Manual cluster restart fixed the issue.
Notes from allen.herrera
# last successful cpimport
|
Mar 21 04:44:16 atx-mdb101pl writeengine[1442415]: 16.223397 |0|0|0| I 19 CAL0008: Bulkload |Job: /mnt/local/aon/datadumper/loaderColumnXML/Job_5002.xml |For table LTE_ALL_SECTOR_HOURLY: 1 rows processed and 1 rows inserted.
|
Mar 21 04:44:16 atx-mdb101pl cpimport.bin[1442415]: 16.336107 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-5002; status-SUCCESS
|
# internet card down? mount down?
|
Mar 21 06:22:38 atx-mdb101pl kernel: [qede_link_update:2608(ens3f4)]Link is down
|
Mar 21 06:22:50 atx-mdb101pl multipathd[2795]: checker failed path 8:112 in map third-party
|
# cmapi begins failing
|
Mar 21 06:23:12 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:23:12 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): 10.224.140.30:8640
|
# internet back ?
|
Mar 21 06:34:09 atx-mdb101pl kernel: [qede_link_update:2601(ens3f0)]Link is up
|
# cmapi can connect again and begins failover
|
Mar 21 06:34:10 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:10 [DEBUG] (cmapi_server) 10.224.140.30 put_begin starts
|
Mar 21 06:34:10 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:10 [INFO] (/usr/share/columnstore/cmapi/cmapi_server/failover_agent.py) FA.deactivateNodes(): deactivating nodes: ['10.224.140.31', '10.224.140.32']
|
Mar 21 06:34:11 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:11 [CRITICAL] (node_monitor) Only 1 out of 3 nodes are active. At least 2 are required. Entering standby mode to protect the system.
|
Mar 21 06:34:13 atx-mdb101pl systemd[1]: mcs-primproc.service: Killing process 3435526 (PrimProc) with signal SIGKILL.
|
# network down again for 12 minutes
|
Mar 21 06:40:51 atx-mdb101pl kernel: [qede_link_update:2608(ens3f5)]Link is down
|
Mar 21 06:52:24 atx-mdb101pl kernel: [qede_link_update:2601(ens3f1)]Link is up
|
# cmapi reconnects again and tries cycling columnstore
|
Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [DEBUG] (cmapi_server) 10.224.140.30 put_begin starts
|
Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [DEBUG] (root) stop running sudo systemctl stop mcs-ddlproc
|
Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [INFO] (root) _add_node_to_PMS(): node 10.224.140.32 already exists
|
# for 2.5 hours cmapi tries stopping processes until manual intervention likely @ Mar 21 09:20:
|
# Shutdown
|
Mar 21 09:20:57 atx-mdb101pl python3[3950659]: 21/Mar/2024 09:20:57 [DEBUG] (cmapi_server) 10.224.140.31 put_shutdown start
|
# amdocs node 1 - still issues
|
Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.127366 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -f /mnt/local/aon/datadumper -L /var/log/mariadb/columnstore/cpimport/ -j 5021 -p /mnt/local/aon/datadumper/loaderColumnXML -P pm3-1758058 -T SYSTEM -u9fa3542a-182e-4f74-bc61-a1504475bfeb
|
Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.233160 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-5021; db-mariadb_actixone_owner
|
Mar 21 09:32:36 atx-mdb101pl controllernode[1758058]: 36.236110 |0|0|0| C 29 CAL0000: ExtentMap::getDbRootHWMInfo(): There are no DBRoots for OID 5203 and PM 3#012 %%10%%
|
Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.236265 |0|0|0| E 34 CAL0087: BulkLoad Error: Error in pre-processing the job file for table LTE_ALL_MESH_HOURLY
|
# Next shutdown again
|
Mar 21 10:49:28 atx-mdb101pl python3[1755592]: 21/Mar/2024 10:49:28 [DEBUG] (cmapi_server) 10.224.140.31 put_shutdown starts
|
# cpimport continues working
|
Mar 21 11:08:22 atx-mdb101pl cpimport.bin[1851124]: 22.044068 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-5401; status-SUCCESS
|
Additional info here.
Investigation result from alan.mologorsky
Because network was down failover never deactivated nodes, so corresponding configuration fields never changed in Columnstore.xml.
After that failover falls into error trying to activate already activated nodes.