[MCOL-1977] Amazon AMI with EBS - pm failover failed - ExeMgr in BUSY_INIT for 2 miuntes Created: 2018-11-28  Updated: 2021-01-16  Resolved: 2021-01-16

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.2.2
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Environment:

Amazon AMI with EBS 1um and 2 pm with EBS



 Description   

Started with 1um 2pm all Active and stopped PM2 instance.
Notice ExeMgr was in a BUST_INIT state on UM1 for a long time, over 2 minutes. This is not normal
And DDL/DMLProc remained MAN_OFFLINE. DDL/DML didnt start up because ExeMgr in BUSY_INIT too long. code gave up too soon on bring up DDL/DML

System BUSY_INIT Wed Nov 28 16:11:40 2018

Module um1 ACTIVE Wed Nov 28 16:12:39 2018
Module pm1 ACTIVE Wed Nov 28 16:08:01 2018
Module pm2 AUTO_DISABLED/DEGRADED Wed Nov 28 16:11:51 2018

Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore set for Distributed Install

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Nov 28 16:07:37 2018 17077
ServerMonitor um1 ACTIVE Wed Nov 28 16:07:58 2018 17683
DBRMWorkerNode um1 ACTIVE Wed Nov 28 16:12:10 2018 20984
ExeMgr um1 ACTIVE Wed Nov 28 16:14:39 2018 21107
DDLProc um1 MAN_OFFLINE Wed Nov 28 16:11:55 2018
DMLProc um1 MAN_OFFLINE Wed Nov 28 16:11:56 2018
mysqld um1 ACTIVE Wed Nov 28 16:12:39 2018 21403

ProcessMonitor pm1 ACTIVE Wed Nov 28 16:07:16 2018 23923
ProcessManager pm1 ACTIVE Wed Nov 28 16:07:22 2018 24379
DBRMControllerNode pm1 ACTIVE Wed Nov 28 16:12:05 2018 31560
ServerMonitor pm1 ACTIVE Wed Nov 28 16:07:58 2018 25265
DBRMWorkerNode pm1 ACTIVE Wed Nov 28 16:12:14 2018 31640
PrimProc pm1 ACTIVE Wed Nov 28 16:12:20 2018 31690
WriteEngineServer pm1 ACTIVE Wed Nov 28 16:12:26 2018 31751

ProcessMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
ProcessManager pm2 AUTO_OFFLINE Wed Nov 28 16:13:04 2018
DBRMControllerNode pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
ServerMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
DBRMWorkerNode pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
PrimProc pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
WriteEngineServer pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018

Active Alarm Counts: Critical = 3, Major = 1, Minor = 2, Warning = 0, Info = 0

DDL/DML didnt start up because ExeMgr in BUSY_INIT – log files from um1

Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256138 |0|0|0| D 18 CAL0000: STARTING Process: DDLProc
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256167 |0|0|0| D 18 CAL0000: Process location: /home/mysql/mariadb/columnstore/bin/DDLProc
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.257881 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm1 is 4
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.261321 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm2 is 1
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.268606 |0|0|0| D 18 CAL0000: Dependent process of DBRMWorkerNode/um1 is 4
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.269818 |0|0|0| D 18 CAL0000: Dependent process of ExeMgr/um1 is 19
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.272698 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.272776 |0|0|0| I 18 CAL0000: START: ACK back to ProcMgr, return status = 8
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.281977 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start process request on: DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.284498 |0|0|0| D 18 CAL0000: checkSpecialProcessState on : DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285704 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285823 |0|0|0| D 18 CAL0000: STARTING Process: DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285857 |0|0|0| D 18 CAL0000: Process location: /home/mysql/mariadb/columnstore/bin/DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.287570 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm1 is 4
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.290779 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm2 is 1
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.295106 |0|0|0| D 18 CAL0000: Dependent process of DBRMWorkerNode/um1 is 4
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.296215 |0|0|0| D 18 CAL0000: Dependent process of DDLProc/um1 is 0
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.296256 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral

The storage is correct on this test

getstorageconfig Wed Nov 28 16:24:15 2018

System Storage Configuration

Performance Module (DBRoot) Storage Type = external
User Module Storage Type = internal
System Assigned DBRoot Count = 2
DBRoot IDs assigned to 'pm1' = 1, 2
DBRoot IDs assigned to 'pm2' =

This was in UM1 logs, which is from ExeMgr. After this log, ExeMgr went Active. So it was trying to communicate with PM2 PrimProc, which wasnt there.

Nov 28 16:14:39 ip-172-31-35-86 joblist[21107]: 39.523080 |0|0|0| W 05 CAL0000: /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 298 Could not connect to PMS2: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 4 inet: 172.31.44.128 port: 8620

Columnstore.xml is correct, it shows 1 pm and its IP address

<PrimitiveServers>
<Count>1</Count>

<PMS1>
<IPAddr>172.31.41.24</IPAddr>
<Port>8620</Port>
</PMS1>
<PMS2>
<IPAddr>172.31.41.24</IPAddr>
<Port>8620</Port>



 Comments   
Comment by David Hill (Inactive) [ 2018-11-28 ]

workaround to get system active

mcsadmin> restartsystem y
restartsystem Wed Nov 28 16:30:57 2018
NOTE: These module(s) are DISABLED: pm2

System being restarted now ...
Successful restart of System

mcsadmin>

Comment by David Hill (Inactive) [ 2018-11-28 ]

I did reproduce issue on retest - ddl/dmlproc in man_offline

System BUSY_INIT Wed Nov 28 17:31:54 2018

Module um1 ACTIVE Wed Nov 28 17:32:53 2018
Module pm1 ACTIVE Wed Nov 28 17:29:52 2018
Module pm2 AUTO_DISABLED/DEGRADED Wed Nov 28 17:32:05 2018

Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore set for Distributed Install

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Nov 28 17:29:27 2018 22710
ServerMonitor um1 ACTIVE Wed Nov 28 17:29:49 2018 23316
DBRMWorkerNode um1 ACTIVE Wed Nov 28 17:32:25 2018 25803
ExeMgr um1 ACTIVE Wed Nov 28 17:34:56 2018 25926
DDLProc um1 MAN_OFFLINE Wed Nov 28 17:32:10 2018
DMLProc um1 MAN_OFFLINE Wed Nov 28 17:32:11 2018
mysqld um1 ACTIVE Wed Nov 28 17:32:53 2018 26222

ProcessMonitor pm1 ACTIVE Wed Nov 28 17:29:06 2018 10189
ProcessManager pm1 ACTIVE Wed Nov 28 17:29:13 2018 10644
DBRMControllerNode pm1 ACTIVE Wed Nov 28 17:32:19 2018 15158
ServerMonitor pm1 ACTIVE Wed Nov 28 17:29:49 2018 11577
DBRMWorkerNode pm1 ACTIVE Wed Nov 28 17:32:29 2018 15238
PrimProc pm1 ACTIVE Wed Nov 28 17:32:34 2018 15290
WriteEngineServer pm1 ACTIVE Wed Nov 28 17:32:40 2018 15357

ProcessMonitor pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018
ProcessManager pm2 AUTO_OFFLINE Wed Nov 28 17:33:19 2018
DBRMControllerNode pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018
ServerMonitor pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018
DBRMWorkerNode pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018
PrimProc pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018
WriteEngineServer pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018

Active Alarm Counts: Critical = 3, Major = 1, Minor = 2, Warning = 0, Info = 0
^C
[mysql@ip-172-31-41-24 bin]$

Generated at Thu Feb 08 02:32:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.