[MCOL-3341] PM ExeMgr doesnt restart on User Module failure w/ local query enabled Created: 2019-05-30  Updated: 2023-10-26  Resolved: 2020-04-15

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.2.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Ben Thompson (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

2um 2pm with local query enabled



 Description   

Reported by customer and reproduced:

System with multiple UMs and local query enabled, if UM1 goes down the ExeMgrs are all stopped and started as part of the recovery process. The ExeMgrs fail to start leaving the system in this state:

System BUSY_INIT Thu May 30 14:45:51 2019

Module um1 AUTO_DISABLED/DEGRADED Thu May 30 14:45:57 2019
Module um2 FAILED Thu May 30 14:48:20 2019
Module pm1 ACTIVE Thu May 30 14:48:02 2019
Module pm2 ACTIVE Thu May 30 14:48:03 2019

Active Parent OAM Performance Module is 'pm1'
Primary Front-End MariaDB ColumnStore Module is 'um2'
Local Query Feature is enabled
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore set for Distributed Install

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
ServerMonitor um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
DBRMWorkerNode um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
ExeMgr um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
DDLProc um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
DMLProc um1 AUTO_OFFLINE Thu May 30 14:45:57 2019
mysqld um1 AUTO_OFFLINE Thu May 30 14:45:57 2019

ProcessMonitor um2 ACTIVE Thu May 30 14:42:22 2019 7059
ServerMonitor um2 ACTIVE Thu May 30 14:42:48 2019 7497
DBRMWorkerNode um2 ACTIVE Thu May 30 14:47:19 2019 11086
ExeMgr um2 ACTIVE Thu May 30 14:47:50 2019 11270
DDLProc um2 COLD_STANDBY Thu May 30 14:46:48 2019
DMLProc um2 COLD_STANDBY Thu May 30 14:46:49 2019
mysqld um2 ACTIVE Thu May 30 14:48:24 2019 11521

ProcessMonitor pm1 ACTIVE Thu May 30 14:41:30 2019 9303
ProcessManager pm1 ACTIVE Thu May 30 14:41:36 2019 9427
DBRMControllerNode pm1 ACTIVE Thu May 30 14:47:16 2019 23967
ServerMonitor pm1 ACTIVE Thu May 30 14:42:42 2019 11653
DBRMWorkerNode pm1 ACTIVE Thu May 30 14:47:23 2019 24115
PrimProc pm1 ACTIVE Thu May 30 14:47:32 2019 24253
ExeMgr pm1 MAN_OFFLINE Thu May 30 14:45:59 2019
WriteEngineServer pm1 ACTIVE Thu May 30 14:47:45 2019 24491
mysqld pm1 ACTIVE Thu May 30 14:48:02 2019 24952

ProcessMonitor pm2 ACTIVE Thu May 30 14:42:32 2019 7669
ProcessManager pm2 HOT_STANDBY Thu May 30 14:42:33 2019 7765
DBRMControllerNode pm2 COLD_STANDBY Thu May 30 14:47:15 2019
ServerMonitor pm2 ACTIVE Thu May 30 14:42:53 2019 8137
DBRMWorkerNode pm2 ACTIVE Thu May 30 14:47:28 2019 10444
PrimProc pm2 ACTIVE Thu May 30 14:47:36 2019 10512
ExeMgr pm2 MAN_OFFLINE Thu May 30 14:45:59 2019
WriteEngineServer pm2 ACTIVE Thu May 30 14:47:46 2019 10589
mysqld pm2 ACTIVE Thu May 30 14:48:03 2019 10855

From pm1 logs when ExeMgr is trying to start back up

May 30 14:46:47 ip-172-31-38-221 ProcessMonitor[9303]: 47.487022 |0|0|0| E 18 CAL0000: Process location: not found
May 30 14:47:52 ip-172-31-38-221 ProcessMonitor[9303]: 52.591412 |0|0|0| E 18 CAL0000: Process location: not found

I think the issue is that in the a separate system install, the ExeMgr Process Configuration shows its running on UM. So the reason for the error above. Looks like there needs to be additional code to handle the local query option.

Process #7 Configuration information
ProcessName = ExeMgr
ModuleType = um
ProcessLocation = /usr/local/mariadb/columnstore/bin/ExeMgr
BootLaunch = 2
LaunchID = 30
DepModuleName1 = pm*
DepProcessName1 = PrimProc
RunType = LOADSHARE
LogFile = off



 Comments   
Comment by Todd Stoffel (Inactive) [ 2020-04-15 ]

OAM is being deprecated and replaced by an enhanced API and the MaxScale orchestration project.

Generated at Thu Feb 08 02:42:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.