[MCOL-435] Amazon AMi multi-node system didnt successfully restart after a stop/start Created: 2016-12-04  Updated: 2016-12-09  Resolved: 2016-12-09

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 1.0.6
Fix Version/s: 1.0.6

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: David Hill (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

aws multi-node system


Sprint: 2016-24

 Description   

during the fixing of MCOL-422, discovered this problem where the system doesn't come back ACTIVE after a stop/start of the instances on a multi-node system

Component Status Last Status Change
------------ -------------------------- ------------------------
System FAILED Sun Dec 4 23:43:37 2016

Module um1 MAN_OFFLINE Sun Dec 4 23:43:36 2016
Module pm1 ACTIVE Sun Dec 4 23:43:41 2016
Module pm2 FAILED Sun Dec 4 23:43:35 2016

Active Parent OAM Performance Module is 'pm1'

MariaDB Columnstore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Sun Dec 4 23:43:16 2016 1306
ServerMonitor um1 ACTIVE Sun Dec 4 23:43:32 2016 8207
DBRMWorkerNode um1 ACTIVE Sun Dec 4 23:43:36 2016 8769
ExeMgr um1 INITIAL
DDLProc um1 INITIAL
DMLProc um1 INITIAL
mysqld um1 ACTIVE Sun Dec 4 23:43:28 2016

ProcessMonitor pm1 ACTIVE Sun Dec 4 23:43:05 2016 1437
ProcessManager pm1 ACTIVE Sun Dec 4 23:43:10 2016 4093
DBRMControllerNode pm1 ACTIVE Sun Dec 4 23:43:33 2016 9262
ServerMonitor pm1 ACTIVE Sun Dec 4 23:43:35 2016 9431
DBRMWorkerNode pm1 ACTIVE Sun Dec 4 23:43:35 2016 10128
DecomSvr pm1 ACTIVE Sun Dec 4 23:43:39 2016 14411
PrimProc pm1 ACTIVE Sun Dec 4 23:43:41 2016 14528
WriteEngineServer pm1 ACTIVE Sun Dec 4 23:43:42 2016 14608

ProcessMonitor pm2 ACTIVE Sun Dec 4 23:45:15 2016 11675
ProcessManager pm2 HOT_STANDBY Sun Dec 4 23:44:56 2016 11389
DBRMControllerNode pm2 AUTO_OFFLINE Sun Dec 4 23:46:00 2016
ServerMonitor pm2 INITIAL
DBRMWorkerNode pm2 INITIAL
DecomSvr pm2 INITIAL
PrimProc pm2 INITIAL
WriteEngineServer pm2 INITIAL



 Comments   
Comment by David Hill (Inactive) [ 2016-12-04 ]

One issue is the this :

DBRMControllerNode pm2 AUTO_OFFLINE

This process should be getting started on PM2.

Comment by David Hill (Inactive) [ 2016-12-05 ]

MCOL-436 - alarms being logged on all nodes might be related to this issue

Also determined that the ProcMon is PM2 was continuing restarting... This had been a problem in the past related to ProcMon trying to write to the log directory to update the alarms logs. The log directory is correctly permissioned, which fixed the previous issue... But it might be tried somehow the the log and alarm again.

Comment by David Hill (Inactive) [ 2016-12-07 ]

fixed with changes in MCOL-435 repo... Below is the list of files that were changed.
I'm not sure if any one thing fixed this issue, but with the changes made and some are just AMI enchancments like the ones made to postConfigure, the problems with multi-node starts and with startup after reboots have been fixed.

1. Fixed a code issue in procmgr from an older checkin
2. Fixed some sudo issues in install scripts where sudo was being called when root user, which was causing issues on the cs-test system
3. Added in some additional log to help determine the cause of the get/set status issues, oamapi and procmon was changed in this area along

M oam/install_scripts/columnstore
M oam/install_scripts/post-install
M oam/install_scripts/post-mysql-install
M oam/install_scripts/pre-uninstall
M oam/install_scripts/syslogSetup.sh
M oam/oamcpp/liboamcpp.cpp
M oam/oamcpp/liboamcpp.h
M oamapps/mcsadmin/mcsadmin.cpp
M oamapps/postConfigure/installer.cpp
M oamapps/postConfigure/postConfigure.cpp
M procmgr/processmanager.cpp
M procmon/main.cpp
M procmon/processmonitor.cpp

Comment by David Hill (Inactive) [ 2016-12-07 ]

please review, might need to discuss some of the changes

Comment by David Hill (Inactive) [ 2016-12-09 ]

tested on build from 12/09, passed restart test

Generated at Thu Feb 08 02:21:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.