Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Won't Do
-
1.2.2
-
None
-
Amazon AMI with EBS 1um and 2 pm with EBS
Description
Started with 1um 2pm all Active and stopped PM2 instance.
Notice ExeMgr was in a BUST_INIT state on UM1 for a long time, over 2 minutes. This is not normal
And DDL/DMLProc remained MAN_OFFLINE. DDL/DML didnt start up because ExeMgr in BUSY_INIT too long. code gave up too soon on bring up DDL/DML
System BUSY_INIT Wed Nov 28 16:11:40 2018
Module um1 ACTIVE Wed Nov 28 16:12:39 2018
Module pm1 ACTIVE Wed Nov 28 16:08:01 2018
Module pm2 AUTO_DISABLED/DEGRADED Wed Nov 28 16:11:51 2018
Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore set for Distributed Install
MariaDB ColumnStore Process statuses
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Nov 28 16:07:37 2018 17077
ServerMonitor um1 ACTIVE Wed Nov 28 16:07:58 2018 17683
DBRMWorkerNode um1 ACTIVE Wed Nov 28 16:12:10 2018 20984
ExeMgr um1 ACTIVE Wed Nov 28 16:14:39 2018 21107
DDLProc um1 MAN_OFFLINE Wed Nov 28 16:11:55 2018
DMLProc um1 MAN_OFFLINE Wed Nov 28 16:11:56 2018
mysqld um1 ACTIVE Wed Nov 28 16:12:39 2018 21403
ProcessMonitor pm1 ACTIVE Wed Nov 28 16:07:16 2018 23923
ProcessManager pm1 ACTIVE Wed Nov 28 16:07:22 2018 24379
DBRMControllerNode pm1 ACTIVE Wed Nov 28 16:12:05 2018 31560
ServerMonitor pm1 ACTIVE Wed Nov 28 16:07:58 2018 25265
DBRMWorkerNode pm1 ACTIVE Wed Nov 28 16:12:14 2018 31640
PrimProc pm1 ACTIVE Wed Nov 28 16:12:20 2018 31690
WriteEngineServer pm1 ACTIVE Wed Nov 28 16:12:26 2018 31751
ProcessMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
ProcessManager pm2 AUTO_OFFLINE Wed Nov 28 16:13:04 2018
DBRMControllerNode pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
ServerMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
DBRMWorkerNode pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
PrimProc pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
WriteEngineServer pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018
Active Alarm Counts: Critical = 3, Major = 1, Minor = 2, Warning = 0, Info = 0
DDL/DML didnt start up because ExeMgr in BUSY_INIT – log files from um1
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256138 |0|0|0| D 18 CAL0000: STARTING Process: DDLProc
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256167 |0|0|0| D 18 CAL0000: Process location: /home/mysql/mariadb/columnstore/bin/DDLProc
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.257881 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm1 is 4
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.261321 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm2 is 1
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.268606 |0|0|0| D 18 CAL0000: Dependent process of DBRMWorkerNode/um1 is 4
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.269818 |0|0|0| D 18 CAL0000: Dependent process of ExeMgr/um1 is 19
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.272698 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral
Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.272776 |0|0|0| I 18 CAL0000: START: ACK back to ProcMgr, return status = 8
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.281977 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start process request on: DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.284498 |0|0|0| D 18 CAL0000: checkSpecialProcessState on : DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285704 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285823 |0|0|0| D 18 CAL0000: STARTING Process: DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.285857 |0|0|0| D 18 CAL0000: Process location: /home/mysql/mariadb/columnstore/bin/DMLProc
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.287570 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm1 is 4
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.290779 |0|0|0| D 18 CAL0000: Dependent process of WriteEngineServer/pm2 is 1
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.295106 |0|0|0| D 18 CAL0000: Dependent process of DBRMWorkerNode/um1 is 4
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.296215 |0|0|0| D 18 CAL0000: Dependent process of DDLProc/um1 is 0
Nov 28 16:12:33 ip-172-31-35-86 ProcessMonitor[17077]: 33.296256 |0|0|0| D 18 CAL0000: Dependent Process is not in correct state, Failed Restoral
The storage is correct on this test
getstorageconfig Wed Nov 28 16:24:15 2018
System Storage Configuration
Performance Module (DBRoot) Storage Type = external
User Module Storage Type = internal
System Assigned DBRoot Count = 2
DBRoot IDs assigned to 'pm1' = 1, 2
DBRoot IDs assigned to 'pm2' =
This was in UM1 logs, which is from ExeMgr. After this log, ExeMgr went Active. So it was trying to communicate with PM2 PrimProc, which wasnt there.
Nov 28 16:14:39 ip-172-31-35-86 joblist[21107]: 39.523080 |0|0|0| W 05 CAL0000: /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 298 Could not connect to PMS2: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 4 inet: 172.31.44.128 port: 8620
Columnstore.xml is correct, it shows 1 pm and its IP address
<PrimitiveServers>
<Count>1</Count>
<PMS1>
<IPAddr>172.31.41.24</IPAddr>
<Port>8620</Port>
</PMS1>
<PMS2>
<IPAddr>172.31.41.24</IPAddr>
<Port>8620</Port>