[MCOL-1977] Amazon AMI with EBS - pm failover failed - ExeMgr in BUSY_INIT for 2 miuntes Created: 2018-11-28 Updated: 2021-01-16 Resolved: 2021-01-16 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ExeMgr |
| Affects Version/s: | 1.2.2 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | David Hill (Inactive) | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Amazon AMI with EBS 1um and 2 pm with EBS |
||
| Description |
|
Started with 1um 2pm all Active and stopped PM2 instance. System BUSY_INIT Wed Nov 28 16:11:40 2018 Module um1 ACTIVE Wed Nov 28 16:12:39 2018 Active Parent OAM Performance Module is 'pm1' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Wed Nov 28 16:07:16 2018 23923 ProcessMonitor pm2 AUTO_OFFLINE Wed Nov 28 16:11:51 2018 Active Alarm Counts: Critical = 3, Major = 1, Minor = 2, Warning = 0, Info = 0 DDL/DML didnt start up because ExeMgr in BUSY_INIT – log files from um1 Nov 28 16:12:32 ip-172-31-35-86 ProcessMonitor[17077]: 32.256138 |0|0|0| D 18 CAL0000: STARTING Process: DDLProc The storage is correct on this test getstorageconfig Wed Nov 28 16:24:15 2018 System Storage Configuration Performance Module (DBRoot) Storage Type = external This was in UM1 logs, which is from ExeMgr. After this log, ExeMgr went Active. So it was trying to communicate with PM2 PrimProc, which wasnt there. Nov 28 16:14:39 ip-172-31-35-86 joblist[21107]: 39.523080 |0|0|0| W 05 CAL0000: /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 298 Could not connect to PMS2: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 4 inet: 172.31.44.128 port: 8620 Columnstore.xml is correct, it shows 1 pm and its IP address <PrimitiveServers> <PMS1> |
| Comments |
| Comment by David Hill (Inactive) [ 2018-11-28 ] |
|
workaround to get system active mcsadmin> restartsystem y System being restarted now ... mcsadmin> |
| Comment by David Hill (Inactive) [ 2018-11-28 ] |
|
I did reproduce issue on retest - ddl/dmlproc in man_offline System BUSY_INIT Wed Nov 28 17:31:54 2018 Module um1 ACTIVE Wed Nov 28 17:32:53 2018 Active Parent OAM Performance Module is 'pm1' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Wed Nov 28 17:29:06 2018 10189 ProcessMonitor pm2 AUTO_OFFLINE Wed Nov 28 17:32:05 2018 Active Alarm Counts: Critical = 3, Major = 1, Minor = 2, Warning = 0, Info = 0 |