[MCOL-891] Get "Could not get a ExeMgr connection." - restart did not clear the error Created: 2017-08-24  Updated: 2017-11-27  Resolved: 2017-11-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.0.10
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Daniel Jackman (Inactive) Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Virtual Box Version 5.1.26 r117224 (Qt5.6.2) - Running on OS X
Linux ubuntu-03 4.9.13


Attachments: Text File pm1_configReport.txt     Text File pm1_hardwareReport.txt     File pm1_logReport.tar.gz     Text File pm1_resourceReport.txt     Text File pm1_softwareReport.txt     Text File pm2_configReport.txt     File pm2_logReport.tar.gz     Text File pm2_resourceReport.txt     Text File pm2_softwareReport.txt     Text File um1_configReport.txt     Text File um1_dbmsReport.txt     Text File um1_hardwareReport.txt     File um1_logReport.tar.gz     Text File um1_logReport.txt     File um1_mysqllogReport.tar.gz     Text File um1_resourceReport.txt     Text File um1_softwareReport.txt     Text File um2_configReport.txt     Text File um2_hardwareReport.txt     File um2_logReport.tar.gz     Text File um2_resourceReport.txt     Text File um2_softwareReport.txt    

 Description   

Problem

When connecting via a client, get the following error:

ERROR 1815 (HY000): Internal error: IDB-2004: Cannot connect to ExeMgr.

Restarting the ColumnStore via mcsadmin did not clear this error.

mcsadmin> getSystemStatus
getsystemstatus   Fri Aug 18 23:28:07 2017
 
System columnstore-1
 
System and Module statuses
 
Component     Status                       Last Status Change
------------  --------------------------   ------------------------
System        ACTIVE                       Fri Aug 18 23:26:37 2017
 
Module um1    MAN_OFFLINE                  Fri Aug 18 22:03:34 2017
Module um2    MAN_OFFLINE                  Fri Aug 18 22:03:37 2017
Module pm1    MAN_INIT                     Fri Aug 18 22:03:40 2017
Module pm2    MAN_OFFLINE                  Fri Aug 18 22:03:40 2017
 
Active Parent OAM Performance Module is 'pm1'
Primary Front-End MariaDB ColumnStore Module is 'um1'
Local Query Feature is enabled
MariaDB ColumnStore Replication Feature is enabled

mcsadmin> getProcessStatus
getprocessstatus   Fri Aug 18 23:28:28 2017
 
MariaDB ColumnStore Process statuses
 
Process             Module    Status            Last Status Change        Process ID
------------------  ------    ---------------   ------------------------  ----------
ProcessMonitor      um1       ACTIVE            Wed Aug 16 16:34:05 2017        1099
ServerMonitor       um1       FAILED            Fri Aug 18 23:19:42 2017        6793
DBRMWorkerNode      um1       ACTIVE            Fri Aug 18 23:19:22 2017        6825
ExeMgr              um1       MAN_OFFLINE       Fri Aug 18 23:18:58 2017
DDLProc             um1       MAN_OFFLINE       Fri Aug 18 23:18:58 2017
DMLProc             um1       MAN_OFFLINE       Fri Aug 18 23:18:58 2017
mysqld              um1       ACTIVE            Fri Aug 18 23:19:25 2017        6752
 
ProcessMonitor      um2       ACTIVE            Wed Aug 16 16:34:05 2017        1084
ServerMonitor       um2       FAILED            Fri Aug 18 23:19:41 2017       24821
DBRMWorkerNode      um2       ACTIVE            Fri Aug 18 23:19:29 2017       24853
ExeMgr              um2       MAN_OFFLINE       Fri Aug 18 23:19:01 2017
DDLProc             um2       MAN_OFFLINE       Fri Aug 18 23:19:01 2017
DMLProc             um2       MAN_OFFLINE       Fri Aug 18 23:19:01 2017
mysqld              um2       ACTIVE            Fri Aug 18 23:19:31 2017       24775
 
ProcessMonitor      pm1       ACTIVE            Wed Aug 16 16:33:55 2017        1267
ProcessManager      pm1       ACTIVE            Wed Aug 16 16:34:01 2017        1554
DBRMControllerNode  pm1       ACTIVE            Fri Aug 18 23:19:19 2017       23270
ServerMonitor       pm1       FAILED            Fri Aug 18 23:19:49 2017       23750
DBRMWorkerNode      pm1       ACTIVE            Fri Aug 18 23:19:21 2017       23353
DecomSvr            pm1       ACTIVE            Fri Aug 18 23:19:25 2017       23525
PrimProc            pm1       ACTIVE            Fri Aug 18 23:19:27 2017       23601
ExeMgr              pm1       ACTIVE            Fri Aug 18 23:19:48 2017       26317
WriteEngineServer   pm1       ACTIVE            Fri Aug 18 23:19:51 2017       26501
mysqld              pm1       ACTIVE            Fri Aug 18 23:19:48 2017       23024
 
ProcessMonitor      pm2       ACTIVE            Wed Aug 16 16:34:11 2017        1069
ProcessManager      pm2       HOT_STANDBY       Fri Aug 18 23:19:12 2017       19507
DBRMControllerNode  pm2       COLD_STANDBY      Fri Aug 18 23:19:30 2017
ServerMonitor       pm2       FAILED            Fri Aug 18 23:19:53 2017       19919
DBRMWorkerNode      pm2       ACTIVE            Fri Aug 18 23:19:38 2017       19974
DecomSvr            pm2       ACTIVE            Fri Aug 18 23:19:42 2017       19990
PrimProc            pm2       ACTIVE            Fri Aug 18 23:19:44 2017       20024
ExeMgr              pm2       ACTIVE            Fri Aug 18 23:19:48 2017       20053
WriteEngineServer   pm2       ACTIVE            Fri Aug 18 23:19:53 2017       20075
mysqld              pm2       ACTIVE            Fri Aug 18 23:19:33 2017       19770

Reproduce

The environment was created as follows
1. VirtualBox (2GB RAM assigned)
2. Docker Containers for 2 x UM and 2 x PM running Ubuntu

The customer was up and running, data loaded via cpimport.

We wanted to visualize the data, so ran MetaBase in another container via

docker run -d -p 3000:3000 --name metabase metabase/metabase

3. After MetaBase container has started, connect via port 3000 and connect to the ColumnStore cluster

4. During the connection to the ColumnSTore cluster, the cluster started to report the above error. Clearly something in the startup process or meta data gathering caused ColumnStore to fail

We built the cluster again and reproduced this a second time.

Solution

  • gracefully fail and not leave the cluster compromised
  • provide workaround to clear the error

Workaround

None. We had to rebuild the ColumnStore cluster.



 Comments   
Comment by David Thompson (Inactive) [ 2017-09-21 ]

unfortunately there are no useful logs in here. I'd suspect either low memory or a setup issue with containers since no logs are preserved which suggests that syslog was not running.

Generated at Thu Feb 08 02:24:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.