[MCOL-434] DecomSvr fails to cleanly start when a Amazon Instance is stopped/started Created: 2016-12-04  Updated: 2016-12-09  Resolved: 2016-12-09

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.6
Fix Version/s: 1.0.6

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: David Hill (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

AWS system


Sprint: 2016-24

 Description   

DecomSvr fails to start the initially after a instance is stop/started
After a number of retries, it did go ACTIVE

Issue found when MCOL-422 was fixed

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Sun Dec 4 21:15:59 2016
Module pm1 ACTIVE Sun Dec 4 21:15:56 2016
MariaDB Columnstore Process statuses
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Sun Dec 4 21:15:00 2016 1245
ProcessManager pm1 ACTIVE Sun Dec 4 21:15:06 2016 4005
DBRMControllerNode pm1 ACTIVE Sun Dec 4 21:15:14 2016 5479
ServerMonitor pm1 ACTIVE Sun Dec 4 21:15:16 2016 5870
DBRMWorkerNode pm1 ACTIVE Sun Dec 4 21:15:16 2016 6142
DecomSvr pm1 AUTO_INIT Sun Dec 4 21:17:07 2016 16905
PrimProc pm1 ACTIVE Sun Dec 4 21:15:25 2016 9215
ExeMgr pm1 ACTIVE Sun Dec 4 21:15:29 2016 9438
WriteEngineServer pm1 ACTIVE Sun Dec 4 21:15:33 2016 9518
DDLProc pm1 ACTIVE Sun Dec 4 21:15:53 2016 15795
DMLProc pm1 ACTIVE Sun Dec 4 21:15:58 2016 15903
mysqld pm1 ACTIVE Sun Dec 4 21:15:58 2016 4712

STARTING Process: DecomSvr
Dec 4 21:17:37 ip-172-30-0-34 ProcessMonitor[1245]: 37.384443 |0|0|0| D 18 CAL0000: Process location: //home/mariadb-user/mariadb/columnstore/bin/DecomSvr
Dec 4 21:17:37 ip-172-30-0-34 ProcessMonitor[1245]: 37.386610 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 4 21:17:37 ip-172-30-0-34 ProcessMonitor[1245]: 37.390396 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DecomSvr*
Dec 4 21:17:37 ip-172-30-0-34 ProcessMonitor[1245]: 37.390447 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 3 PID = 0
Dec 4 21:17:37 ip-172-30-0-34 ProcessMonitor[1245]: 37.393405 |0|0|0| D 18 CAL0000: Send SET Alarm ID 13 on device DecomSvr
Dec 4 21:17:37 ip-172-30-0-34 ProcessMonitor[1245]: 37.400608 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 27 on device DBRM
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.409193 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 4 PID = 17400
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.416406 |0|0|0| E 18 CAL0000: EXCEPTION ERROR on setProcessStatus: Caught unknown exception!
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.416470 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 21 PID = 17400
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.422558 |0|0|0| E 18 CAL0000: EXCEPTION ERROR on setProcessStatus: Caught unknown exception!
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.422605 |0|0|0| D 18 CAL0000: DecomSvr PID is 17400
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.422716 |0|0|0| D 18 CAL0000: Inform Process Mgr that process was restarted: DecomSvr
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.422736 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 21 on device DecomSvr
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.423338 |0|0|0| I 18 CAL0000: Calpont Process DecomSvr restarted successfully!!
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.428955 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 13 on device DecomSvr
Dec 4 21:17:39 ip-172-30-0-34 ProcessMonitor[1245]: 39.437494 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 4 21:17:39 ip-172-30-0-34 ProcessManager[4005]: 39.482441 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm1/DecomSvr
Dec 4 21:18:04 ip-172-30-0-34 ProcessMonitor[1245]: 04.531056 |0|0|0| C 18 CAL0000: DecomSvr/17400 failed to init in 20 seconds, force killing it so it can restart
Dec 4 21:18:04 ip-172-30-0-34 ProcessMonitor[1245]: 04.531098 |0|0|0| D 18 CAL0000: Send SET Alarm ID 25 on device DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.562675 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: DecomSvr, old PID = 17400
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.562797 |0|0|0| D 18 CAL0000: STOPPING Process: DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.562851 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 1 PID = 0
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.562905 |0|0|0| D 18 CAL0000: Send SET Alarm ID 13 on device DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.569512 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DecomSvr*
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.569562 |0|0|0| D 18 CAL0000: STARTING Process: DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.569586 |0|0|0| D 18 CAL0000: Process location: //home/mariadb-user/mariadb/columnstore/bin/DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.571560 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.575237 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DecomSvr*
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.575289 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 3 PID = 0
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.577098 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = AUTO_OFFLINE
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.577138 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = AUTO_OFFLINE PID = 0
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.587201 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = AUTO_INIT
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.587243 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = AUTO_INIT PID = 0
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.590043 |0|0|0| D 18 CAL0000: Send SET Alarm ID 13 on device DecomSvr
Dec 4 21:18:09 ip-172-30-0-34 ProcessMonitor[1245]: 09.596244 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 27 on device DBRM
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.594404 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 4 PID = 18021
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.601588 |0|0|0| E 18 CAL0000: EXCEPTION ERROR on setProcessStatus: Caught unknown exception!
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.601659 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 21 PID = 18021
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.607957 |0|0|0| E 18 CAL0000: EXCEPTION ERROR on setProcessStatus: Caught unknown exception!
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.608004 |0|0|0| D 18 CAL0000: DecomSvr PID is 18021
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.608091 |0|0|0| D 18 CAL0000: Inform Process Mgr that process was restarted: DecomSvr
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.608119 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 13 on device DecomSvr
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.608767 |0|0|0| I 18 CAL0000: Calpont Process DecomSvr restarted successfully!!
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.614615 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 4 21:18:11 ip-172-30-0-34 ProcessMonitor[1245]: 11.621288 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 21 on device DecomSvr
Dec 4 21:18:11 ip-172-30-0-34 ProcessManager[4005]: 11.668342 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm1/DecomSvr
Dec 4 21:18:36 ip-172-30-0-34 ProcessMonitor[1245]: 36.716680 |0|0|0| C 18 CAL0000: DecomSvr/18021 failed to init in 20 seconds, force killing it so it can restart
Dec 4 21:18:36 ip-172-30-0-34 ProcessMonitor[1245]: 36.716722 |0|0|0| D 18 CAL0000: Send SET Alarm ID 25 on device DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.747118 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: DecomSvr, old PID = 18021
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.747227 |0|0|0| D 18 CAL0000: STOPPING Process: DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.747272 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 1 PID = 0
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.747315 |0|0|0| D 18 CAL0000: Send SET Alarm ID 13 on device DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.754309 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DecomSvr*
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.754361 |0|0|0| D 18 CAL0000: STARTING Process: DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.754400 |0|0|0| D 18 CAL0000: Process location: //home/mariadb-user/mariadb/columnstore/bin/DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.756584 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.760057 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DecomSvr*
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.760108 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 3 PID = 0
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.762082 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = AUTO_OFFLINE
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.762127 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = AUTO_OFFLINE PID = 0
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.763604 |0|0|0| D 18 CAL0000: Send SET Alarm ID 13 on device DecomSvr
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.771972 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = AUTO_INIT
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.772017 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = AUTO_INIT PID = 0
Dec 4 21:18:41 ip-172-30-0-34 ProcessMonitor[1245]: 41.780036 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 27 on device DBRM
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.779175 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 4 PID = 18691
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.792351 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = ACTIVE
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.792429 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = ACTIVE PID = 18691
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.792446 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 21 PID = 18691
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.804486 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = PID_UPDATE
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.804553 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = PID_UPDATE PID = 18691
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.804569 |0|0|0| D 18 CAL0000: DecomSvr PID is 18691
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.804665 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 21 on device DecomSvr
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.804668 |0|0|0| D 18 CAL0000: Inform Process Mgr that process was restarted: DecomSvr
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.805249 |0|0|0| I 18 CAL0000: Calpont Process DecomSvr restarted successfully!!
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.811381 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 13 on device DecomSvr
Dec 4 21:18:43 ip-172-30-0-34 ProcessMonitor[1245]: 43.820111 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 4 21:18:43 ip-172-30-0-34 ProcessManager[4005]: 43.862092 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm1/DecomSvr
[root@ip-172-30-0-34 columnstore]#



 Comments   
Comment by David Hill (Inactive) [ 2016-12-07 ]

fixed with changes in MCOL-435 repo... Below is the list of files that were changed.
I'm not sure if any one thing fixed this issue, but with the changes made and some are just AMI enchancments like the ones made to postConfigure, the problems with multi-node starts and with startup after reboots have been fixed.

1. Fixed a code issue in procmgr from an older checkin
2. Fixed some sudo issues in install scripts where sudo was being called when root user, which was causing issues on the cs-test system
3. Added in some additional log to help determine the cause of the get/set status issues, oamapi and procmon was changed in this area along

M oam/install_scripts/columnstore
M oam/install_scripts/post-install
M oam/install_scripts/post-mysql-install
M oam/install_scripts/pre-uninstall
M oam/install_scripts/syslogSetup.sh
M oam/oamcpp/liboamcpp.cpp
M oam/oamcpp/liboamcpp.h
M oamapps/mcsadmin/mcsadmin.cpp
M oamapps/postConfigure/installer.cpp
M oamapps/postConfigure/postConfigure.cpp
M procmgr/processmanager.cpp
M procmon/main.cpp
M procmon/processmonitor.cpp

Comment by David Hill (Inactive) [ 2016-12-07 ]

please review, might need to discuss some changes

Comment by David Hill (Inactive) [ 2016-12-09 ]

testing on amazon ec2 with latest AMI, MariaDB-ColumnStore-1.0

Generated at Thu Feb 08 02:21:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.