[MCOL-1965] 1um2pm stack failed to start after being shut down Created: 2018-11-26  Updated: 2023-10-26  Resolved: 2018-11-26

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.2.2
Fix Version/s: None

Type: Bug Priority: Blocker
Reporter: Daniel Lee (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None


 Description   

Build tested: 2018-11.25 nightly build, Lentos 7 RPM

1. Installed 1um2pm stack
2. create and loaded 1g DBT3 database
3. ran select count on lineitem
4. Shutdownsystem on pm1
5. startsystem <password> on pm1

mcsadmin> shutdownsystem
shutdownsystem Mon Nov 26 14:57:58 2018

This command stops the processing of applications on all Modules within the MariaDB ColumnStore System

Checking for active transactions
Do you want to proceed: (y or n) [n]: y

Stopping System...
Successful stop of System

Shutting Down System...
Successful shutdown of System

mcsadmin> startsystem vagrant
startsystem Mon Nov 26 14:58:59 2018

startSystem command, 'columnstore' service is down, sending command to
start the 'columnstore' service on all modules

System being started, please wait........

TIMEOUT: ProcMon not responding to getSystemStatus

        • startSystem Failed : check log files
          mcsadmin> getprocessstatus
          getprocessstatus Mon Nov 26 15:02:40 2018

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Mon Nov 26 14:59:33 2018 12948
ServerMonitor um1 ACTIVE Mon Nov 26 14:59:52 2018 13396
DBRMWorkerNode um1 FAILED Mon Nov 26 14:59:50 2018
ExeMgr um1 INITIAL
DDLProc um1 INITIAL
DMLProc um1 INITIAL
mysqld um1 ACTIVE Mon Nov 26 14:59:53 2018 13352

ProcessMonitor pm1 ACTIVE Mon Nov 26 14:59:16 2018 2889
ProcessManager pm1 ACTIVE Mon Nov 26 14:59:22 2018 3056
DBRMControllerNode pm1 ACTIVE Mon Nov 26 14:59:47 2018 3684
ServerMonitor pm1 ACTIVE Mon Nov 26 14:59:49 2018 3704
DBRMWorkerNode pm1 ACTIVE Mon Nov 26 14:59:49 2018 3743
PrimProc pm1 ACTIVE Mon Nov 26 14:59:53 2018 3877
WriteEngineServer pm1 ACTIVE Mon Nov 26 14:59:54 2018 3921

ProcessMonitor pm2 ACTIVE Mon Nov 26 14:59:40 2018 4373
ProcessManager pm2 HOT_STANDBY Mon Nov 26 14:59:41 2018 4419
DBRMControllerNode pm2 INITIAL
ServerMonitor pm2 INITIAL
DBRMWorkerNode pm2 INITIAL
PrimProc pm2 INITIAL
WriteEngineServer pm2 INITIAL
mcsadmin> quit

logs on pm1:

[root@localhost columnstore]# cat crit.log
Nov 26 14:59:57 localhost ProcessManager[3056]: 57.902720 |0|0|0| C 17 CAL0000: startMgrProcessThread Exit with a failure, error returned from startSystemThread
[root@localhost columnstore]# cat err.log
Nov 26 14:58:19 localhost controllernode[3278]: 19.063429 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::setSystemState() failed (network)
Nov 26 14:58:19 localhost controllernode[3278]: 19.064352 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::setSystemState() failed (network)
Nov 26 14:58:19 localhost controllernode[3278]: 19.971126 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Nov 26 14:59:57 localhost ProcessManager[3056]: 57.902720 |0|0|0| C 17 CAL0000: startMgrProcessThread Exit with a failure, error returned from startSystemThread



 Comments   
Comment by David Hill (Inactive) [ 2018-11-26 ]

the dbrm files are being downloaded into the incorrect directory. the main files are in /root instead of the dbrm directory causing the issue

Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.258585 |0|0|0| D 18 CAL0000: getDBRMdata called
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.275465 |0|0|0| D 18 CAL0000: files
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.275910 |0|0|0| D 18 CAL0000: 6
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.277133 |0|0|0| D 18 CAL0000: BRM_saves_current
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.277226 |0|0|0| D 18 CAL0000: 10
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.277481 |0|0|0| D 18 CAL0000: BRM_saves_em
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.277526 |0|0|0| D 18 CAL0000: 14828
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.277700 |0|0|0| D 18 CAL0000: BRM_saves_vbbm
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.277801 |0|0|0| D 18 CAL0000: 9516
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.278077 |0|0|0| D 18 CAL0000: BRM_saves_vss
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.278138 |0|0|0| D 18 CAL0000: 10376
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.278488 |0|0|0| D 18 CAL0000: /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves_current
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.278547 |0|0|0| D 18 CAL0000: 10
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.284556 |0|0|0| D 18 CAL0000: /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/oidbitmap
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.284715 |0|0|0| D 18 CAL0000: 2099206
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.289766 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 27 on device DBRM
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.322222 |0|0|0| D 18 CAL0000: BRM reset_locks script run
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.700546 |0|0|0| D 18 CAL0000: Clear Shared Memory script run
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.700812 |0|0|0| D 18 CAL0000: load_brm cmd = /usr/local/mariadb/columnstore/bin/load_brm /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves > /var/log/mariadb/columnstore/load_brm.log1 2>&1
Nov 26 14:59:50 localhost controllernode[13417]: 50.729524 |0|0|0| C 29 CAL0000: ExtentMap::load(): open: No such file or directory
Nov 26 14:59:50 localhost ProcessMonitor[12948]: 50.734250 |0|0|0| E 18 CAL0000: Error return DBRM load_brm

[root@localhost dbrm]# ll
total 2056
rw-rr-. 1 root root 10 Nov 26 14:59 BRM_saves_current
rw-rr-. 1 root root 0 Nov 26 14:59 BRM_saves_journal
rw-rr-. 1 root root 2099206 Nov 26 14:59 oidbitmap

[root@localhost dbrm]# cd
[root@localhost ~]# ll
rw-rr-. 1 root root 10 Nov 26 14:59 BRM_saves_current
rw-rr-. 1 root root 14828 Nov 26 14:59 BRM_saves_em
rw-rr-. 1 root root 9516 Nov 26 14:59 BRM_saves_vbbm
rw-rr-. 1 root root 10376 Nov 26 14:59 BRM_saves_vss

Comment by Patrick LeBlanc (Inactive) [ 2018-11-26 ]

This is likely a bug in my impl for MCOL-1558.

Comment by Daniel Lee (Inactive) [ 2018-11-26 ]

Build verified: build made by buildbot on 2018-11-26 for pull request 643.

Generated at Thu Feb 08 02:32:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.