[MCOL-1370] Network Error incorrect handled, Amazon DBROOT detach failed, but dbroot still was reassigned Created: 2018-04-26  Updated: 2023-10-26  Resolved: 2018-06-13

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.1.4
Fix Version/s: 1.1.5

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

amazon ec2 with ebs storage


Sprint: 2018-11, 2018-12

 Description   

Issue reported by a customer, here is the analysis show that the system was in a bad state due to a detach failed. but the dbroot still got reassigned

Ok here is what I see. There was some network issue where pm5 wasnt respoding to pings from pm1, so it went into failover state.
But pm5 was not down based on the logs and looked to be idle. So looks like some network issue between pm1/pm5, best guess.

But since PM5 was still up and some of the CS process were still running and had an access to DBROOT, it failed to get detach.
failover code assumes the module is down and EBS can be detached and restach to pm1. So that is a BUG. The detach failed, but DBROOT 5
still got assigned to PM1. We will open a JIRA on that issue.

I dont know if you have anything on your side to look at an network issues between pm1/pm5 on Apr 25 13:41:26

PM1

LOGS SHOWING THAT PM5 WASNT RESPONDING TO PINGS, WHICH INITIATE A MODULE DOWN ANA FAILURE
info.log:Apr 25 13:41:26 mcs1-pm1 ProcessManager[62467]: 26.465309 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
info.log:Apr 25 13:41:26 mcs1-pm1 ProcessManager[62467]: 26.475801 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
info.log:Apr 25 13:41:43 mcs1-pm1 ProcessManager[62467]: 43.365051 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
info.log:Apr 25 13:41:43 mcs1-pm1 ProcessManager[62467]: 43.367980 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.310225 |0|0|0| W 17 CAL0000: NIC failed to respond to ping: i-09e7594ac8af0c07e
info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.313222 |0|0|0| W 17 CAL0000: module failed to respond to pings: pm5
info.log:Apr 25 13:42:33 mcs1-pm1 ProcessManager[62467]: 33.313373 |0|0|0| C 17 CAL0000: module is down: pm5

FAILURE TO DETACH DBROOT 5 FROM PM5
info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952210 |0|0|0| E 08 CAL0000: ERROR: amazonReattach, detachEC2Volume failed on vol-09f84a3ec4b5f7dbb
info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952312 |0|0|0| E 08 CAL0000: ERROR: amazonReattach api failure
info.log:Apr 25 13:47:02 mcs1-pm1 oamcpp[62467]: 02.952374 |0|0|0| E 08 CAL0000: ERROR: manualMovePmDbroot failure: pm1:5:pm5

PM5

THIS SAYS THAT PM5 WAS BASICLY IDLE AS FAR AS THE CS LOGS SHOW, NO ACTIVE CPIMPORTS
info.log:Apr 25 13:24:55 mcs1-pm5 cpimport.bin[13602]: 55.066750 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-689411; status-SUCCESS
info.log:Apr 25 13:24:55 mcs1-pm5 writeengineserver[57886]: 55.075839 |0|0|0| I 32 CAL0000: 6607 : cpimport exit on success

THIS SHOWS THAT 3 SECONDS AFTER THE MODULE WASNT RESPONDING TO PINGS, IT RECEIVED MSGS FROM PM1
info.log:Apr 25 13:42:36 mcs1-pm5 ProcessMonitor[101111]: 36.678186 |0|0|0| I 18 CAL0000: MSG RECEIVED: Re-Init process request on: cpimport
info.log:Apr 25 13:42:36 mcs1-pm5 ProcessMonitor[101111]: 36.925981 |0|0|0| I 18 CAL0000: PROCREINITPROCESS: completed, no ack to ProcMgr



 Comments   
Comment by David Hill (Inactive) [ 2018-05-04 ]

The problem that caused this issue was a problem in the network, not a problem on pm5 module itself. So the module fault tolerance logic needs to be improved to be able to detect the difference and also handle the issue is a storage device (EBS in the case) fails to unmount from the detected bad node, it should not be left mounted on another module, like pm1 in this case.

here is some additional information that was reported by the customer

> Yes – it looks like just after ColumnStore noticed connectivity loss, the kernel noticed a module failure for the elastic networking adapter on pm5:

> Apr 25 13:42:53 mcs1-pm5.us-west-2.prod.sasia.local kernel: ena 0000:00:03.0 eth0: Keep alive watchdog timeout.
> That's your culprit. It successfully recovered the ENA device a few minutes later, but networking was down during the interval while the kernel brought the machine back online.

Comment by David Thompson (Inactive) [ 2018-05-22 ]

I think a specific action we can take is to fix the bug that results in the failover node pm1 thinking it should have dbroot5 when it doesn't. It would seem correct behavior to me that if the volume can't be mounted that we raise a critical error / alarm stating that and leaving dbroot5 with pm5 from a metadata point of view. If it is just a network adapter / network outage then it will recover when that is resolved.

I don't know that if in general pm1 can distinguish a network problem from an instance problem.

Comment by David Hill (Inactive) [ 2018-05-22 ]

agree with DT's last input..

1. definitely can fix the logic where the dbroot gets assigned to pm1, but the unmount or unattached had failed leaving the system is a whose state.

2. and procmgr just does a ping to the instance, when it gets 3 pings failures in a row, it takes action. from this logic, it couldn't determine if the node was actually down or there was a network issue to that node. maybe the detach failed command could be take into effect telling procmgr that since the detach failed, the node is still active and a process is still mounted to the disk. in this case, dont process the failover like the node is down. But that would be just quessing, at best.
another approach I cvan check and test for, check the status of the instance. maybe it will show a different status based on it the instance is down versus a network error

Comment by David Hill (Inactive) [ 2018-05-30 ]

first change for this issue

1. change the auto failure in ProcMgr - try to detach EBS from reported downed module. if it fails, then leave that DBROOT assigned to that module and mark the module AUTO_OFFLINE. then ProcMgr will wait until it comes back online and then will bring it back into the system.
During this period of AUTO_OFFLINE, queries and other commands might fail until module comes back online.

So this will fix one of the issues, DBROOT being reassigned to another pm but then failure to get detached and mounted successfully leaving the system in a bad state that can recover automaticly

Comment by David Hill (Inactive) [ 2018-05-31 ]

used to test - causes a ping failure while leaving the module up and running.

https://access.redhat.com/articles/7134

Comment by David Hill (Inactive) [ 2018-06-01 ]

https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/488

Comment by David Hill (Inactive) [ 2018-06-04 ]

This issue was specific to Amazon system setup with EBS external storage. So that is how this needs to be tested.

Test Cases - system with 2 or more pms

THIS SIMULATES A DOWN MODULE
1. via AWS console - stop pm2 or high. make sure that the dbroot to pm2 gets correctly remounted to pm1 and system gets back in a ACTIVE state with that PM in a AUTO_DISABLED state. Then start that PM. make sure that the dbroot gets moved back to that pm and the system gets back into a ACTIVE with PM back in ACTIVE state also.

THIS SIMULATES A NETWORK ISSUE BETWEEN PM1 AND ANOTHER PM
2. From pm2 or higher, run this command from non-root user. this will cause pings to fail from pm1 to the pm

  1. echo "1" > /proc/sys/net/ipv4/icmp_echo_ignore_all
    the PM will be detected as down and will be placed in AUTO_OFFLINE. But the dbroot will fail to be detached. So it will not move the DBROOT to pm1 leaving it attached to offline PM. That PM will be placed in a AUTO_DISABLED state and the system will be Active. But since that PM isnt receachable, DB functionality will be milited or not work at all. Which is what we want
    Now run the following which will cause pings to start working and the OFFLINE pm will be brought back into ACTIVE with the dbroot still attached to it.
Comment by Daniel Lee (Inactive) [ 2018-06-08 ]

Build tested: 1.1.5-1 ami mcs-1.1.5 (ami-2571365d)

With a 1um2pm stack, I stopped pm2, dbroot 2 did get remounted on pm1. But when I started pm2, the stack remain in this state:

mcsadmin> getprocessstatus
getprocessstatus Fri Jun 8 15:45:16 2018

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Fri Jun 8 15:25:22 2018 2268
ServerMonitor um1 ACTIVE Fri Jun 8 15:25:35 2018 2578
DBRMWorkerNode um1 ACTIVE Fri Jun 8 15:25:35 2018 2591
ExeMgr um1 ACTIVE Fri Jun 8 15:25:47 2018 4115
DDLProc um1 ACTIVE Fri Jun 8 15:25:51 2018 4128
DMLProc um1 ACTIVE Fri Jun 8 15:25:56 2018 4138
mysqld um1 ACTIVE Fri Jun 8 15:25:45 2018

ProcessMonitor pm1 ACTIVE Fri Jun 8 15:24:35 2018 4314
ProcessManager pm1 ACTIVE Fri Jun 8 15:24:41 2018 4440
DBRMControllerNode pm1 ACTIVE Fri Jun 8 15:25:27 2018 5383
ServerMonitor pm1 ACTIVE Fri Jun 8 15:25:29 2018 5403
DBRMWorkerNode pm1 ACTIVE Fri Jun 8 15:25:29 2018 5441
DecomSvr pm1 ACTIVE Fri Jun 8 15:25:33 2018 5585
PrimProc pm1 ACTIVE Fri Jun 8 15:25:37 2018 5664
WriteEngineServer pm1 ACTIVE Fri Jun 8 15:36:24 2018 19921

ProcessMonitor pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
ProcessManager pm2 AUTO_OFFLINE Fri Jun 8 15:33:48 2018
DBRMControllerNode pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
ServerMonitor pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
DBRMWorkerNode pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
DecomSvr pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
PrimProc pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
WriteEngineServer pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018
mcsadmin> getstorage
getstorageconfig Fri Jun 8 15:45:19 2018

System Storage Configuration

Performance Module (DBRoot) Storage Type = external
User Module Storage Type = external
System Assigned DBRoot Count = 2
DBRoot IDs assigned to 'pm1' = 1, 2
DBRoot IDs assigned to 'pm2' =

Amazon EC2 Volume Name/Device Name for 'um1': vol-06ecc9b98ae685a4d, /dev/xvdf

Amazon EC2 Volume Name/Device Name/Amazon Device Name for DBRoot1: vol-03bb47240de1462a7, /dev/sdg, /dev/xvdg
Amazon EC2 Volume Name/Device Name/Amazon Device Name for DBRoot2: vol-039d1e8e4cc68c215, /dev/sdh, /dev/xvdh

mcsadmin>

[mariadb-user@ip-172-31-26-89 columnstore]$ sudo cat crit.log
Jun 8 15:33:14 ip-172-31-26-89 ProcessManager[4440]: 14.846827 |0|0|0| C 17 CAL0000: module is down: pm2
Jun 8 15:38:07 ip-172-31-26-89 ProcessManager[4440]: 07.430373 |0|0|0| C 17 CAL0000: Module failed to auto start: pm2

err.log
Jun 8 15:33:14 ip-172-31-26-89 ProcessManager[4440]: 14.846827 |0|0|0| C 17 CAL0000: module is down: pm2
Jun 8 15:33:19 ip-172-31-26-89 ProcessManager[4440]: 19.872474 |0|0|0| E 17 CAL0000: line: 6246 sendMsgProcMon ping failure
Jun 8 15:33:35 ip-172-31-26-89 ProcessManager[4440]: 35.205208 |0|0|0| E 17 CAL0000: line: 6246 sendMsgProcMon ping failure
Jun 8 15:36:45 ip-172-31-26-89 oamcpp[4440]: 45.001595 |0|0|0| E 08 CAL0000: ERROR: mount failed on dbroot2
Jun 8 15:37:45 ip-172-31-26-89 oamcpp[4440]: 45.070764 |0|0|0| E 08 CAL0000: ERROR: amazonDetach, umount failed on 2
Jun 8 15:38:07 ip-172-31-26-89 ProcessManager[4440]: 07.430373 |0|0|0| C 17 CAL0000: Module failed to auto start: pm2

According to the creation date of the ami, it is created after the last comment so I assume it has the latest change.

Comment by David Hill (Inactive) [ 2018-06-11 ]

started up the system and procmon didnt start on um1 and pm2. it not starting on pm2
is why the failover didnt work.

PROBLEM WAS RC.LOCAL SERVICE WAS DISABLED ON UM1/PM2. MADE THEM ACTIVE ON DANIELS SYSTEM
AND IT FIXED THE FAILOVER ISSUE.

LOOKS LIKE A CHANGCE TO THE AMI ITSELF NEEDS TO BE MADE TO MAKE SURE THE SERVICE IS
ACTIVE

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 INITIAL
ServerMonitor um1 INITIAL
DBRMWorkerNode um1 INITIAL
ExeMgr um1 INITIAL
DDLProc um1 INITIAL
DMLProc um1 INITIAL
mysqld um1 INITIAL

ProcessMonitor pm1 ACTIVE Mon Jun 11 13:43:58 2018 925
ProcessManager pm1 ACTIVE Mon Jun 11 13:44:04 2018 1115
DBRMControllerNode pm1 INITIAL
ServerMonitor pm1 INITIAL
DBRMWorkerNode pm1 INITIAL
DecomSvr pm1 INITIAL
PrimProc pm1 INITIAL
WriteEngineServer pm1 INITIAL

ProcessMonitor pm2 INITIAL
ProcessManager pm2 INITIAL
DBRMControllerNode pm2 INITIAL
ServerMonitor pm2 INITIAL
DBRMWorkerNode pm2 INITIAL
DecomSvr pm2 INITIAL
PrimProc pm2 INITIAL
WriteEngineServer pm2 INITIAL

systemctl status rc-local.service
● rc-local.service - /etc/rc.d/rc.local Compatibility
Loaded: loaded (/usr/lib/systemd/system/rc-local.service; static; vendor preset: disabled)
Active: inactive (dead)
[root@ip-172-31-20-232 ~]# systemctl start rc-local
[root@ip-172-31-20-232 ~]# systemctl enable rc-local
[root@ip-172-31-20-232 ~]# systemctl status rc-local.service
● rc-local.service - /etc/rc.d/rc.local Compatibility
Loaded: loaded (/usr/lib/systemd/system/rc-local.service; static; vendor preset: disabled)
Active: active (exited) since Mon 2018-06-11 13:56:52 UTC; 5s ago

Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal systemd[1]: Starting /etc/rc.d/rc.loca...
Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal rc.local[952]: % Total % Received %...
Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal rc.local[952]: Dload Upload Total ...
Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal rc.local[952]: [155B blob data]
Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal sudo[959]: root : TTY=unknown ; PW...
Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal rc.local[952]: Starting MariaDB Column...
Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal systemd[1]: Started /etc/rc.d/rc.local...
Hint: Some lines were ellipsized, use -l to show in full.
[root@ip-172-31-20-232 ~]#

after updating rc-local on um1 and pm2, reboot and procmon started on both nodes

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Mon Jun 11 14:00:19 2018

Module um1 ACTIVE Mon Jun 11 14:00:15 2018
Module pm1 ACTIVE Mon Jun 11 14:00:06 2018
Module pm2 MAN_DISABLED

Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Mon Jun 11 13:59:50 2018 923
ServerMonitor um1 ACTIVE Mon Jun 11 14:00:01 2018 1258
DBRMWorkerNode um1 ACTIVE Mon Jun 11 14:00:02 2018 1284
ExeMgr um1 ACTIVE Mon Jun 11 14:00:08 2018 1674
DDLProc um1 ACTIVE Mon Jun 11 14:00:12 2018 1685
DMLProc um1 ACTIVE Mon Jun 11 14:00:18 2018 1694
mysqld um1 ACTIVE Mon Jun 11 14:00:06 2018

ProcessMonitor pm1 ACTIVE Mon Jun 11 13:59:32 2018 926
ProcessManager pm1 ACTIVE Mon Jun 11 13:59:38 2018 1206
DBRMControllerNode pm1 ACTIVE Mon Jun 11 13:59:58 2018 1719
ServerMonitor pm1 ACTIVE Mon Jun 11 14:00:00 2018 1754
DBRMWorkerNode pm1 ACTIVE Mon Jun 11 14:00:00 2018 1806
DecomSvr pm1 ACTIVE Mon Jun 11 14:00:04 2018 1951
PrimProc pm1 ACTIVE Mon Jun 11 14:00:06 2018 2011
WriteEngineServer pm1 ACTIVE Mon Jun 11 14:00:07 2018 2060

ProcessMonitor pm2 ACTIVE Mon Jun 11 14:00:00 2018 922
ProcessManager pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018
DBRMControllerNode pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018
ServerMonitor pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018
DBRMWorkerNode pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018
DecomSvr pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018
PrimProc pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018
WriteEngineServer pm2 MAN_OFFLINE Mon Jun 11 13:59:51 2018

Active Alarm Counts: Critical = 2, Major = 0, Minor = 0, Warning = 0, Info = 0

retesting pm2 failover - after pm2

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Mon Jun 11 14:06:59 2018

Module um1 ACTIVE Mon Jun 11 14:05:20 2018
Module pm1 ACTIVE Mon Jun 11 14:05:03 2018
Module pm2 AUTO_DISABLED/DEGRADED Mon Jun 11 14:06:39 2018

Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Mon Jun 11 13:59:50 2018 923
ServerMonitor um1 ACTIVE Mon Jun 11 14:04:58 2018 2262
DBRMWorkerNode um1 ACTIVE Mon Jun 11 14:04:59 2018 2288
ExeMgr um1 ACTIVE Mon Jun 11 14:05:13 2018 4001
DDLProc um1 ACTIVE Mon Jun 11 14:05:17 2018 4014
DMLProc um1 ACTIVE Mon Jun 11 14:05:21 2018 4024
mysqld um1 ACTIVE Mon Jun 11 14:05:10 2018

ProcessMonitor pm1 ACTIVE Mon Jun 11 13:59:32 2018 926
ProcessManager pm1 ACTIVE Mon Jun 11 13:59:38 2018 1206
DBRMControllerNode pm1 ACTIVE Mon Jun 11 14:04:54 2018 7855
ServerMonitor pm1 ACTIVE Mon Jun 11 14:04:57 2018 7888
DBRMWorkerNode pm1 ACTIVE Mon Jun 11 14:04:57 2018 7932
DecomSvr pm1 ACTIVE Mon Jun 11 14:05:01 2018 8073
PrimProc pm1 ACTIVE Mon Jun 11 14:05:03 2018 8180
WriteEngineServer pm1 ACTIVE Mon Jun 11 14:05:04 2018 8251

ProcessMonitor pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
ProcessManager pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
DBRMControllerNode pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
ServerMonitor pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
DBRMWorkerNode pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
DecomSvr pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
PrimProc pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018
WriteEngineServer pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018

Active Alarm Counts: Critical = 2, Major = 1, Minor = 0, Warning = 0, Info = 0

after start pm2

System and Module statuses

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Mon Jun 11 14:09:03 2018

Module um1 ACTIVE Mon Jun 11 14:05:20 2018
Module pm1 ACTIVE Mon Jun 11 14:05:03 2018
Module pm2 ACTIVE Mon Jun 11 14:08:19 2018

Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Mon Jun 11 13:59:50 2018 923
ServerMonitor um1 ACTIVE Mon Jun 11 14:04:58 2018 2262
DBRMWorkerNode um1 ACTIVE Mon Jun 11 14:08:26 2018 4081
ExeMgr um1 ACTIVE Mon Jun 11 14:08:45 2018 4105
DDLProc um1 ACTIVE Mon Jun 11 14:08:57 2018 4402
DMLProc um1 ACTIVE Mon Jun 11 14:09:03 2018 4416
mysqld um1 ACTIVE Mon Jun 11 14:08:51 2018

ProcessMonitor pm1 ACTIVE Mon Jun 11 13:59:32 2018 926
ProcessManager pm1 ACTIVE Mon Jun 11 13:59:38 2018 1206
DBRMControllerNode pm1 ACTIVE Mon Jun 11 14:08:23 2018 12027
ServerMonitor pm1 ACTIVE Mon Jun 11 14:04:57 2018 7888
DBRMWorkerNode pm1 ACTIVE Mon Jun 11 14:08:30 2018 12130
DecomSvr pm1 ACTIVE Mon Jun 11 14:05:01 2018 8073
PrimProc pm1 ACTIVE Mon Jun 11 14:08:39 2018 12236
WriteEngineServer pm1 ACTIVE Mon Jun 11 14:08:53 2018 12414

ProcessMonitor pm2 ACTIVE Mon Jun 11 14:07:58 2018 928
ProcessManager pm2 HOT_STANDBY Mon Jun 11 14:08:08 2018 1093
DBRMControllerNode pm2 COLD_STANDBY Mon Jun 11 14:08:08 2018
ServerMonitor pm2 ACTIVE Mon Jun 11 14:08:12 2018 1179
DBRMWorkerNode pm2 ACTIVE Mon Jun 11 14:08:35 2018 1282
DecomSvr pm2 ACTIVE Mon Jun 11 14:08:16 2018 1218
PrimProc pm2 ACTIVE Mon Jun 11 14:08:40 2018 1300
WriteEngineServer pm2 ACTIVE Mon Jun 11 14:08:54 2018 1332

Active Alarm Counts: Critical = 2, Major = 0, Minor = 0, Warning = 0, Info = 0

Comment by David Hill (Inactive) [ 2018-06-11 ]

rc-local service wasnt active on um1/pm2, so that is why procmon didnt restart after a failure in daniels test. looking at a fix in the AMI

Comment by David Hill (Inactive) [ 2018-06-11 ]

created a new test 1.1.5 ami with the rc-local service being actived on all nodes. retested pm2 failover and it all worked..

daniel can retest and it did work with micro instance..

Comment by David Hill (Inactive) [ 2018-06-11 ]

new ami for 1.1.5 is created and was working for me. Daniel is testing now..

if works for daniel, Ill assigned the JIRA back to QA

Comment by David Hill (Inactive) [ 2018-06-11 ]

underlined text

Comment by Daniel Lee (Inactive) [ 2018-06-13 ]

Build test: 1.1.5-1 ami (ami-1a541762)

Test scenario #1 (failover due to pm2 is stop)
After failover ( pm2 instances started), ColumnStore service on pm2 did not start so failover failed.

Test scenario #2 (NIC on PM2 is down)
The stack recovered as expected. The only downside is that getSystemStatus shows PM2 is AUTO_DISABLE/DEGRADED, but getProcessStatus shows PM2 as normal, with all processes and PIDs displayed.

Comment by Daniel Lee (Inactive) [ 2018-06-13 ]

The failover issue was not related to 1.1.5-1. It was an AMI issue. It is being track by MCOL-1467.

Generated at Thu Feb 08 02:28:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.