[MCOL-1130] Can't restart System, When our server relocation? Created: 2017-12-22  Updated: 2018-05-07  Resolved: 2018-05-07

Status: Closed
Project: MariaDB ColumnStore
Component/s: MariaDB Server, ProcMgr
Affects Version/s: 1.0.7
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Allen Chan Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

CentOS 6.5 mariadb-columnstore-1.0.7-1-x86_64



 Description   

After server relocation,We can't restart system,and get error log as below:
mcsadmin> restartsystem
restartsystem Fri Dec 22 09:22:14 2017
Do you want to proceed: (y or n) [n]: y

System being restarted now ...

        • restartSystem Failed : API Failure return in restartSystem API

----------------
[root@HLServer110 columnstore]# cat err.log
Dec 22 09:22:23 HLServer110 ProcessManager[3268]: 23.450253 |0|0|0| E 17 CAL0000: line: 6255 Error running DBRM save_brm

----------------
[root@HLServer110 columnstore]# cat warning.log
Dec 22 09:22:19 HLServer110 ProcessManager[3268]: 19.033411 |0|0|0| W 17 CAL0000: pm2 module failed to stop!!

----------------
[root@HLServer110 columnstore]# cat info.log
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.015530 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request...
Dec 22 09:22:18 HLServer110 ProcessMonitor[3177]: 18.023963 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request...
Dec 22 09:22:18 HLServer110 ProcessMonitor[3177]: 18.024402 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, STATUS_UPDATE only performed
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.034577 |0|0|0| I 17 CAL0000: RESTARTSYSTEM: Request Stop Module on pm2
Dec 22 09:22:19 HLServer110 ProcessManager[3268]: 19.033531 |0|0|0| I 17 CAL0000: RESTARTSYSTEM: ACK received from Process-Monitor, return status = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.034908 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request...
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.361166 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, return status = 0
Dec 22 09:22:24 HLServer110 ProcessManager[3268]: 24.343899 |0|0|0| I 17 CAL0000: RESTARTSYSTEM: ACK received from Process-Monitor for stopModule requests, return status = 1
Dec 22 09:22:24 HLServer110 ProcessManager[3268]: 24.344078 |0|0|0| I 17 CAL0000: RESTARTSYSTEM: ACK back to sender
Dec 22 09:22:24 HLServer110 ProcessManager[3268]: 24.344171 |0|0|0| I 17 CAL0000: Restart System Completed, status = 1

----------------
[root@HLServer110 columnstore]# tail -n 100 debug.log
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.016406 |0|0|0| D 17 CAL0000: Set System State = MAN_OFFLINE
Dec 22 09:22:18 HLServer110 ProcessMonitor[3177]: 18.017023 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = MAN_OFFLINE
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.022752 |0|0|0| D 17 CAL0000: pm1 module is stopped by request.
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.022852 |0|0|0| D 17 CAL0000: Set Module pm1 State = 2
Dec 22 09:22:18 HLServer110 ProcessMonitor[3177]: 18.023380 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_INIT
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.023511 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.026061 |0|0|0| D 17 CAL0000: pm2 module is stopped by request.
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.026153 |0|0|0| D 17 CAL0000: Set Module pm2 State = 2
Dec 22 09:22:18 HLServer110 ProcessMonitor[3177]: 18.026692 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm2 State = MAN_INIT
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.032138 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm2
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.034787 |0|0|0| D 17 CAL0000: pm2 module is stopped by request.
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.034865 |0|0|0| D 17 CAL0000: Set Module pm2 State = 2
Dec 22 09:22:18 HLServer110 ProcessMonitor[3177]: 18.035424 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm2 State = MAN_INIT
Dec 22 09:22:18 HLServer110 ProcessManager[3268]: 18.040640 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm2
Dec 22 09:22:19 HLServer110 ProcessManager[3268]: 19.033708 |0|0|0| D 17 CAL0000: pm1 module is stopped by request.
Dec 22 09:22:19 HLServer110 ProcessManager[3268]: 19.033782 |0|0|0| D 17 CAL0000: Set Module pm1 State = 2
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.034357 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_INIT
Dec 22 09:22:19 HLServer110 ProcessManager[3268]: 19.034482 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.036065 |0|0|0| D 18 CAL0000: STOPPING Process: DMLProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.036214 |0|0|0| D 18 CAL0000: StatusUpdate of Process DMLProc State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.036310 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DMLProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.038223 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DMLProc State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.038370 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DMLProc State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.041100 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DMLProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.057360 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DMLProc*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.057490 |0|0|0| D 18 CAL0000: STOPPING Process: DDLProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.057618 |0|0|0| D 18 CAL0000: StatusUpdate of Process DDLProc State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.057712 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DDLProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.059941 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DDLProc State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.060084 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DDLProc State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.061109 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DDLProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.079658 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DDLProc*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.079785 |0|0|0| D 18 CAL0000: STOPPING Process: WriteEngineServer
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.079902 |0|0|0| D 18 CAL0000: StatusUpdate of Process WriteEngineServer State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.079960 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device WriteEngineServer
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.082031 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/WriteEngineServer State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.082157 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/WriteEngineServer State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.082897 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device WriteEngineServer
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.100912 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: WriteEngineServ*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.101037 |0|0|0| D 18 CAL0000: STOPPING Process: ExeMgr
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.101147 |0|0|0| D 18 CAL0000: StatusUpdate of Process ExeMgr State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.101246 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device ExeMgr
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.102941 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/ExeMgr State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.103065 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/ExeMgr State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.104149 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device ExeMgr
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.122374 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: ExeMgr*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.122497 |0|0|0| D 18 CAL0000: STOPPING Process: PrimProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.122606 |0|0|0| D 18 CAL0000: StatusUpdate of Process PrimProc State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.122690 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device PrimProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.124526 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/PrimProc State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.124654 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/PrimProc State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.125791 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device PrimProc
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.143851 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: PrimProc*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.143976 |0|0|0| D 18 CAL0000: STOPPING Process: DecomSvr
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.144092 |0|0|0| D 18 CAL0000: StatusUpdate of Process DecomSvr State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.144154 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DecomSvr
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.145723 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DecomSvr State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.145879 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DecomSvr State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.147011 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DecomSvr
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.165526 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DecomSvr*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.165645 |0|0|0| D 18 CAL0000: STOPPING Process: DBRMWorkerNode
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.165756 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMWorkerNode State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.165849 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DBRMWorkerNode
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.167865 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DBRMWorkerNode State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.167979 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DBRMWorkerNode State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.169042 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DBRMWorkerNode
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.187243 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: workernode*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.187363 |0|0|0| D 18 CAL0000: STOPPING Process: ServerMonitor
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.187475 |0|0|0| D 18 CAL0000: StatusUpdate of Process ServerMonitor State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.187559 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device ServerMonitor
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.189784 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/ServerMonitor State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.189941 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/ServerMonitor State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.190853 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device ServerMonitor
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.208791 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: ServerMonitor*
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.208913 |0|0|0| D 18 CAL0000: STOPPING Process: DBRMControllerNode
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.209031 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMControllerNode State = 0 PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.209116 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DBRMControllerNode
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.211139 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DBRMControllerNode State = MAN_OFFLINE
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.211254 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DBRMControllerNode State = MAN_OFFLINE PID = 0
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.212667 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DBRMControllerNode
Dec 22 09:22:19 HLServer110 ProcessMonitor[3177]: 19.230747 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: controllernode*
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.268807 |0|0|0| D 18 CAL0000: BRM reset_locks script run
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.269607 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/mysqld State = MAN_OFFLINE
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.269735 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/mysqld State = MAN_OFFLINE PID = 0
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.359965 |0|0|0| D 18 CAL0000: Stop MySQL Process
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.360778 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_OFFLINE
Dec 22 09:22:21 HLServer110 ProcessMonitor[3177]: 21.360897 |0|0|0| D 18 CAL0000: unmountExtraDBroots called
Dec 22 09:22:21 HLServer110 ProcessManager[3268]: 21.361308 |0|0|0| D 17 CAL0000: pm1 module is successfully stopped.
Dec 22 09:22:21 HLServer110 ProcessManager[3268]: 21.365243 |0|0|0| D 17 CAL0000: Running reset_locks
Dec 22 09:22:23 HLServer110 ProcessManager[3268]: 23.401485 |0|0|0| D 17 CAL0000: Ran reset_locks
Dec 22 09:22:23 HLServer110 ProcessManager[3268]: 23.401677 |0|0|0| D 17 CAL0000: Running DBRM save_brm
Dec 22 09:22:24 HLServer110 ProcessManager[3268]: 24.343707 |0|0|0| D 17 CAL0000: Successfully ran DBRM clearShm

How can i recovery this question?



 Comments   
Comment by David Thompson (Inactive) [ 2017-12-22 ]

That API error can occur if the ProcMon/ProcMgr processes are not running (these are started by the systemd service).

Can you try shutdownSystem then startSystem?

What does systemctl status columnstore show?

Comment by Allen Chan [ 2017-12-24 ]

Yes, I try to reboot the CentOS system,and question already exists. I checked the ProcMon/ProcMgr processes service is exists. but there is not responsing. How can i check the status of ColumnStore is running normal step by step.

Comment by David Thompson (Inactive) [ 2017-12-24 ]

This is a good general guide: https://mariadb.com/kb/en/library/system-troubleshooting-mariadb-columnstore/

Also since this is a new environment can you run cluster tester:
https://mariadb.com/kb/en/library/mariadb-columnstore-cluster-test-tool/

This tests for common system issues that prevent it from running.

Otherwise as i said check the output of systemctl status columnstore is clear on each node. What does mcsadmin getProcessStatus show after trying to start the system?

Also check your logs in /var/log/mariadb/columnstore as usually something will be logged there. Finally if the system never came up cleanly after first install it's likely you'll need to rerun postConfigure. This writes logs to varying log files under /tmp so it's also worth checking there and re-running postConfigure.

Comment by David Thompson (Inactive) [ 2018-05-07 ]

Please advise if you still have any questions or issues.

Generated at Thu Feb 08 02:26:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.