Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Fixed
-
1.1.4
-
None
-
amazon ec2 with ebs storage
-
2018-11, 2018-12
Description
Issue reported by a customer, here is the analysis show that the system was in a bad state due to a detach failed. but the dbroot still got reassigned
Ok here is what I see. There was some network issue where pm5 wasnt respoding to pings from pm1, so it went into failover state.
But pm5 was not down based on the logs and looked to be idle. So looks like some network issue between pm1/pm5, best guess.
But since PM5 was still up and some of the CS process were still running and had an access to DBROOT, it failed to get detach.
failover code assumes the module is down and EBS can be detached and restach to pm1. So that is a BUG. The detach failed, but DBROOT 5
still got assigned to PM1. We will open a JIRA on that issue.
I dont know if you have anything on your side to look at an network issues between pm1/pm5 on Apr 25 13:41:26
PM1
LOGS SHOWING THAT PM5 WASNT RESPONDING TO PINGS, WHICH INITIATE A MODULE DOWN ANA FAILURE
info.log:Apr 25 13:41:26 mcs1-pm1 ProcessManager[62467]: 26.465309 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
info.log:Apr 25 13:41:26 mcs1-pm1 ProcessManager[62467]: 26.475801 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
info.log:Apr 25 13:41:43 mcs1-pm1 ProcessManager[62467]: 43.365051 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
info.log:Apr 25 13:41:43 mcs1-pm1 ProcessManager[62467]: 43.367980 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.310225 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.313222 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.313373 |0|0|0| C 17 CAL0000: module is down: pm5
FAILURE TO DETACH DBROOT 5 FROM PM5
info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952210 |0|0|0| E 08 CAL0000: ERROR: amazonReattach, detachEC2Volume failed on vol-09f84a3ec4b5f7dbb
info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952312 |0|0|0| E 08 CAL0000: ERROR: amazonReattach api failure
info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952374 |0|0|0| E 08 CAL0000: ERROR: manualMovePmDbroot failure: pm1:5:pm5
PM5
THIS SAYS THAT PM5 WAS BASICLY IDLE AS FAR AS THE CS LOGS SHOW, NO ACTIVE CPIMPORTS
info.log:Apr 25 13:24:55 mcs1-pm5 cpimport.bin[13602]: 55.066750 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-689411; status-SUCCESS
info.log:Apr 25 13:24:55 mcs1-pm5 writeengineserver[57886]: 55.075839 |0|0|0| I 32 CAL0000: 6607 : cpimport exit on success
THIS SHOWS THAT 3 SECONDS AFTER THE MODULE WASNT RESPONDING TO PINGS, IT RECEIVED MSGS FROM PM1
info.log:Apr 25 13:42:36 mcs1-pm5 ProcessMonitor[101111]: 36.678186 |0|0|0| I 18 CAL0000: MSG RECEIVED: Re-Init process request on: cpimport
info.log:Apr 25 13:42:36 mcs1-pm5 ProcessMonitor[101111]: 36.925981 |0|0|0| I 18 CAL0000: PROCREINITPROCESS: completed, no ack to ProcMgr