Details
-
Task
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Do
-
1.1.4
-
None
-
Linux
Description
We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. But after few time we got failure on PM1 (instence stopped) which is our parentOAM. As per failover handling it has transfered dbroot 1 to PM2 and set PM2 as parentOAM. Now PM2 has 2 dbroot attached 1,2. Till this system is working fine queries also running on UM1.
When we encountered PM1 is stopped we have started instance from AWS console then after everything got messy. It has autometically attached dbroot 2 to PM1. We have tried multiple times to shutdown system and start again but it got failure on UM1 as attched "UM1_system_start_report.txt"
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.515921 |0|0|0| E 18 CAL0000: Error: getDBRMdata failed
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.516521 |0|0|0| I 18 CAL0000: STARTALL: ACK back to ProcMgr, return status = 8
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.516577 |0|0|0| D 18 CAL0000: Send SET Alarm ID 27 on device DBRM
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.516996 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start All process request...
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543707 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543761 |0|0|0| D 18 CAL0000: STARTING Process: DBRMWorkerNode
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543784 |0|0|0| D 18 CAL0000: Process location: /usr/local/mariadb/columnstore/bin/workernode
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.545888 |0|0|0| D 18 CAL0000: getLocalDBRMID Worker Node ID = 2
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.547583 |0|0|0| D 18 CAL0000: getDBRMdata called
Jun 12 06:26:07 ip-172-31-7-171 messagequeue[7787]: 07.556991 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed
It is just getting in loop for this erroe again and again when I checked for port 8604 I got this entries but this all entries removed when I shutdown system.
netstat -a | grep 8604
tcp 0 0 ip-172-31-7-5.ec2:47186 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46932 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47208 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46768 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47022 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46758 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47232 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47178 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47118 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47244 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47032 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47266 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46842 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46852 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47108 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47196 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47220 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46922 ip-172-31-7-211.ec:8604 TIME_WAIT
I tried different different solutions like to disable pm1 again, assigned dbrootId 2 to pm2 again, etc but nothing worked. So finnaly I tried to reconfigure system using below command but that also not worked. I have attacked report files generated by /usr/local/mariadb/columnstore/bin/columnstoreSupport.
Please provide me solution for this error. Also I need proper steps to recover system when any PM got failure and Its dbroot transffered to other PM (Please specify both cases 1> failed PM is parentOAM 2> failed PM is non ParentOAM).
Attachments
Issue Links
- includes
-
MCOL-1572 About our ParentOAM failure handling issue with AmazonAMI
-
- Closed
-
1.
|
About our ParentOAM failure handling issue with AmazonAMI |
|
Closed | Unassigned |