[MCOL-5701] Failover infinite loop after network failure. - Jira

XML

Word

Printable

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: cmapi
Labels:
None

Description

Seems to be network card going out for 12 minutes, after that Failover fall into infinite loop with error trying to activate nodes after failure. Manual cluster restart fixed the issue.

Notes from allen.herrera

# last successful cpimport

Mar 21 04:44:16 atx-mdb101pl writeengine[1442415]: 16.223397 |0|0|0| I 19 CAL0008: Bulkload |Job: /mnt/local/aon/datadumper/loaderColumnXML/Job_5002.xml |For table LTE_ALL_SECTOR_HOURLY: 1 rows processed and 1 rows inserted.

Mar 21 04:44:16 atx-mdb101pl cpimport.bin[1442415]: 16.336107 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-5002; status-SUCCESS

# internet card down? mount down?

Mar 21 06:22:38 atx-mdb101pl kernel: [qede_link_update:2608(ens3f4)]Link is down

Mar 21 06:22:50 atx-mdb101pl multipathd[2795]: checker failed path 8:112 in map third-party

# cmapi begins failing

Mar 21 06:23:12 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:23:12 [DEBUG] (urllib3.connectionpool) Starting new HTTPS connection (1): 10.224.140.30:8640

# internet back ?

Mar 21 06:34:09 atx-mdb101pl kernel: [qede_link_update:2601(ens3f0)]Link is up

# cmapi can connect again and begins failover

Mar 21 06:34:10 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:10 [DEBUG] (cmapi_server) 10.224.140.30 put_begin starts

Mar 21 06:34:10 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:10 [INFO] (/usr/share/columnstore/cmapi/cmapi_server/failover_agent.py) FA.deactivateNodes():  deactivating nodes: ['10.224.140.31', '10.224.140.32']

Mar 21 06:34:11 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:34:11 [CRITICAL] (node_monitor) Only 1 out of 3 nodes are active.  At least 2 are required.  Entering standby mode to protect the system.

Mar 21 06:34:13 atx-mdb101pl systemd[1]: mcs-primproc.service: Killing process 3435526 (PrimProc) with signal SIGKILL.

# network down again for 12 minutes

Mar 21 06:40:51 atx-mdb101pl kernel: [qede_link_update:2608(ens3f5)]Link is down

Mar 21 06:52:24 atx-mdb101pl kernel: [qede_link_update:2601(ens3f1)]Link is up

# cmapi reconnects again and tries cycling columnstore

Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [DEBUG] (cmapi_server) 10.224.140.30 put_begin starts

Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [DEBUG] (root) stop running sudo systemctl stop mcs-ddlproc

Mar 21 06:52:24 atx-mdb101pl python3[3950659]: 21/Mar/2024 06:52:24 [INFO] (root) _add_node_to_PMS(): node 10.224.140.32 already exists

# for 2.5 hours cmapi tries stopping processes until manual intervention likely @ Mar 21 09:20:

# Shutdown

Mar 21 09:20:57 atx-mdb101pl python3[3950659]: 21/Mar/2024 09:20:57 [DEBUG] (cmapi_server) 10.224.140.31 put_shutdown start

# amdocs node 1 - still issues

Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.127366 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -f /mnt/local/aon/datadumper -L /var/log/mariadb/columnstore/cpimport/ -j 5021 -p /mnt/local/aon/datadumper/loaderColumnXML -P pm3-1758058 -T SYSTEM -u9fa3542a-182e-4f74-bc61-a1504475bfeb

Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.233160 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-5021; db-mariadb_actixone_owner

Mar 21 09:32:36 atx-mdb101pl controllernode[1758058]: 36.236110 |0|0|0| C 29 CAL0000: ExtentMap::getDbRootHWMInfo(): There are no DBRoots for OID 5203 and PM 3#012         %%10%%

Mar 21 09:32:36 atx-mdb101pl cpimport.bin[1758058]: 36.236265 |0|0|0| E 34 CAL0087: BulkLoad Error: Error in pre-processing the job file for table LTE_ALL_MESH_HOURLY

# Next shutdown again

Mar 21 10:49:28 atx-mdb101pl python3[1755592]: 21/Mar/2024 10:49:28 [DEBUG] (cmapi_server) 10.224.140.31 put_shutdown starts

# cpimport continues working

Mar 21 11:08:22 atx-mdb101pl cpimport.bin[1851124]: 22.044068 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-5401; status-SUCCESS

Additional info here.

Investigation result from alan.mologorsky
Because network was down failover never deactivated nodes, so corresponding configuration fields never changed in Columnstore.xml.
After that failover falls into error trying to activate already activated nodes.

Attachments

Activity

People

Assignee:: Alan Mologorsky

Reporter:: Alan Mologorsky

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2024-03-28 16:40

Updated:: 2024-05-12 10:25

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.