[MCOL-2215]  MCS remained MAN_OFFLINE after OAM restartSystem Created: 2019-03-06  Updated: 2021-02-21  Resolved: 2021-02-21

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: 1.2.3
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Environment:

mcs single server, centOS7


Attachments: Text File cpimport.log     Text File mcs_debug.log    

 Description   

MCS remained MAN_OFFLINE after OAM restartSystem

that problem is observed during BulkLoad activities
done with cpimport ;
MCS was restarted due to Swap Space usage over Major threashold, perform OAM command restartSystem
but PrimProc,ExeMgr,WriteEngineServer,DDLProc,DMLProc remained MAN_OFFLINE

the problem is that mcs system does not recover automatically as it's expected and it's needed
it's needed recover mcs manually ; that will increase significantly mcs system downtime

note : if try to recover with mcsadmin restart system
are returned messages about table locks but locks are not returned and operation loop over
The following tables are locked:
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused

how to repeat :
run autopilot concurrency test module

concurrency.concurCpimport
concurrency.concurDDL
concurrency.concurDML
concurrency.queryAndDataMod  ==>MCS  become MAN_OFFLINE

mcsadmin getsystemi
getsysteminfo   Wed Mar  6 14:51:09 2019
 
System columnstore-1
 
System and Module statuses
 
Component     Status                       Last Status Change
------------  --------------------------   ------------------------
System        ACTIVE                       Tue Mar  5 20:53:32 2019
 
Module pm1    FAILED                       Tue Mar  5 20:53:26 2019
 
 
MariaDB ColumnStore Process statuses
 
Process             Module    Status            Last Status Change        Process ID
------------------  ------    ---------------   ------------------------  ----------
ProcessMonitor      pm1       ACTIVE            Fri Mar  1 14:14:24 2019        5616
ProcessManager      pm1       ACTIVE            Fri Mar  1 14:14:30 2019        5771
DBRMControllerNode  pm1       ACTIVE            Tue Mar  5 20:53:08 2019       19745
ServerMonitor       pm1       FAILED            Tue Mar  5 20:53:39 2019       21541
DBRMWorkerNode      pm1       ACTIVE            Tue Mar  5 20:53:10 2019       20022
PrimProc            pm1       MAN_OFFLINE       Tue Mar  5 20:52:54 2019
ExeMgr              pm1       MAN_OFFLINE       Tue Mar  5 20:52:54 2019
WriteEngineServer   pm1       MAN_OFFLINE       Tue Mar  5 20:52:54 2019
DDLProc             pm1       MAN_OFFLINE       Tue Mar  5 20:52:54 2019
DMLProc             pm1       MAN_OFFLINE       Tue Mar  5 20:52:53 2019
mysqld              pm1       ACTIVE            Tue Mar  5 20:53:06 2019       19397
 
Active Alarm Counts: Critical = 1, Major = 3, Minor = 6, Warning = 0, Info = 0
[root@cps queryAndDataMod]#

Mar  5 20:48:44 cps1 cpimport.bin[14506]: 44.948994 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-35369; status-SUCCESS
Mar  5 20:48:45 cps1 cpimport.bin[15877]: 45.021926 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -P pm1-15877 -u7a3e6085-d0c2-4cc9-a401-3d4f7b9f018d mytest lineitem /root/autopilot__1_2_3/data/source/dbt3/10g/lineitem.tbl
Mar  5 20:48:45 cps1 cpimport.bin[15877]: 45.078612 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-35369; db-mytest
Mar  5 20:48:52 cps1 ServerMonitor[8828]: 52.779394 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  79
Mar  5 20:49:52 cps1 ServerMonitor[8828]: 52.783352 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  79
Mar  5 20:50:40 cps1 cpimport.bin[15877]: 40.895053 |0|0|0| I 34 CAL0083: BulkLoad: JobId-35369; finished loading table mytest.lineitem; 59986052 rows inserted
Mar  5 20:50:40 cps1 writeengine[15877]: 40.895091 |0|0|0| I 19 CAL0008: Bulkload |Job: /usr/local/mariadb/columnstore/data/bulk/tmpjob/35369_D20190305_T204845_S059001_Job_35369.xml |For table mytest.lineitem: 59986052 rows processed and 59986052 rows inserted.
Mar  5 20:50:41 cps1 cpimport.bin[15877]: 41.332351 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-35369; status-SUCCESS
Mar  5 20:50:41 cps1 cpimport.bin[17304]: 41.405682 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -P pm1-17304 -u9d4f1717-3b9a-4008-b16f-77b03c5849a6 mytest lineitem /root/autopilot__1_2_3/data/source/dbt3/10g/lineitem.tbl
Mar  5 20:50:41 cps1 cpimport.bin[17304]: 41.460399 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-35369; db-mytest
Mar  5 20:50:52 cps1 ServerMonitor[8828]: 52.787134 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  79
Mar  5 20:51:52 cps1 ServerMonitor[8828]: 52.790870 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  79
Mar  5 20:52:34 cps1 cpimport.bin[17304]: 34.705426 |0|0|0| I 34 CAL0083: BulkLoad: JobId-35369; finished loading table mytest.lineitem; 59986052 rows inserted
Mar  5 20:52:34 cps1 writeengine[17304]: 34.705466 |0|0|0| I 19 CAL0008: Bulkload |Job: /usr/local/mariadb/columnstore/data/bulk/tmpjob/35369_D20190305_T205041_S441865_Job_35369.xml |For table mytest.lineitem: 59986052 rows processed and 59986052 rows inserted.
Mar  5 20:52:34 cps1 cpimport.bin[17304]: 34.732453 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-35369; status-SUCCESS
Mar  5 20:52:34 cps1 cpimport.bin[18659]: 34.805807 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -P pm1-18659 -u6b6d16e2-c008-4e73-9cd4-c49ce02f98a4 mytest lineitem /root/autopilot__1_2_3/data/source/dbt3/10g/lineitem.tbl
Mar  5 20:52:34 cps1 cpimport.bin[18659]: 34.863442 |0|0|0| I 34 CAL0081: Start BulkLoad: JobId-35369; db-mytest
Mar  5 20:52:52 cps1 ServerMonitor[8828]: 52.796726 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  80
Mar  5 20:52:52 cps1 ServerMonitor[8828]: 52.798578 |0|0|0| I 09 CAL0000: Swap  usage at percentage of  80 , Alarm set:  11
Mar  5 20:52:52 cps1 ServerMonitor[8828]: 52.798699 |0|0|0| C 09 CAL0000: Swap Space usage over Major threashold, perform OAM command  restartSystem
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.921227 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request...
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.922234 |0|0|0| D 17 CAL0000: Set System State = MAN_OFFLINE
Mar  5 20:52:52 cps1 ProcessMonitor[5616]: 52.947620 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = MAN_OFFLINE
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.953978 |0|0|0| D 17 CAL0000: pm1 module is stopped by request.
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.954062 |0|0|0| D 17 CAL0000: Set Module pm1 State = 2
Mar  5 20:52:52 cps1 ProcessMonitor[5616]: 52.954602 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_INIT
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.956541 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Mar  5 20:52:52 cps1 ProcessMonitor[5616]: 52.956818 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request...
Mar  5 20:52:52 cps1 ProcessMonitor[5616]: 52.957017 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, STATUS_UPDATE only performed
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.957440 |0|0|0| D 17 CAL0000: pm1 module is stopped by request.
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.957507 |0|0|0| D 17 CAL0000: Set Module pm1 State = 2
Mar  5 20:52:52 cps1 ProcessMonitor[5616]: 52.958007 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_INIT
Mar  5 20:52:52 cps1 ProcessManager[5771]: 52.958053 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.957394 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request...
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.959396 |0|0|0| D 18 CAL0000: STOPPING Process: DMLProc
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.959528 |0|0|0| D 18 CAL0000: StatusUpdate of Process DMLProc State = 0 PID = 0
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.959680 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DMLProc
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.963007 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DMLProc State = MAN_OFFLINE
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.963055 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DMLProc State = MAN_OFFLINE PID = 0
Mar  5 20:52:53 cps1 ProcessMonitor[5616]: 53.963548 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DMLProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.007261 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DMLProc\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.007381 |0|0|0| D 18 CAL0000: STOPPING Process: DDLProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.007497 |0|0|0| D 18 CAL0000: StatusUpdate of Process DDLProc State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.009627 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DDLProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.011274 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DDLProc State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.011325 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DDLProc State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.011544 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DDLProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.020483 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: DDLProc\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.020570 |0|0|0| D 18 CAL0000: STOPPING Process: WriteEngineServer
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.020681 |0|0|0| D 18 CAL0000: StatusUpdate of Process WriteEngineServer State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.020832 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device WriteEngineServer
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.024301 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/WriteEngineServer State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.024385 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/WriteEngineServer State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.024625 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device WriteEngineServer
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.034639 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: WriteEngineServ\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.034743 |0|0|0| D 18 CAL0000: STOPPING Process: ExeMgr
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.034860 |0|0|0| D 18 CAL0000: StatusUpdate of Process ExeMgr State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.035014 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device ExeMgr
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.037896 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/ExeMgr State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.037942 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/ExeMgr State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.039418 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device ExeMgr
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.047127 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: ExeMgr\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.047236 |0|0|0| D 18 CAL0000: STOPPING Process: PrimProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.047349 |0|0|0| D 18 CAL0000: StatusUpdate of Process PrimProc State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.047512 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device PrimProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.051587 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device PrimProc
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.052857 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/PrimProc State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.052905 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/PrimProc State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.060337 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: PrimProc\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.060444 |0|0|0| D 18 CAL0000: STOPPING Process: DBRMWorkerNode
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.060558 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMWorkerNode State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.060734 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DBRMWorkerNode
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.063894 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DBRMWorkerNode State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.063973 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DBRMWorkerNode State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.064995 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DBRMWorkerNode
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.114250 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: workernode\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.114391 |0|0|0| D 18 CAL0000: STOPPING Process: ServerMonitor
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.114513 |0|0|0| D 18 CAL0000: StatusUpdate of Process ServerMonitor State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.114706 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device ServerMonitor
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.118545 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device ServerMonitor
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.119126 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/ServerMonitor State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.119176 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/ServerMonitor State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.128425 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: ServerMonitor\*
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.128535 |0|0|0| D 18 CAL0000: STOPPING Process: DBRMControllerNode
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.128654 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMControllerNode State = 0 PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.128805 |0|0|0| D 18 CAL0000: Send CLEAR Alarm ID 25 on device DBRMControllerNode
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.131962 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/DBRMControllerNode State = MAN_OFFLINE
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.132024 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/DBRMControllerNode State = MAN_OFFLINE PID = 0
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.132910 |0|0|0| D 18 CAL0000: Send SET Alarm ID 21 on device DBRMControllerNode
Mar  5 20:52:54 cps1 ProcessMonitor[5616]: 54.142770 |0|0|0| D 18 CAL0000: Pkill Process just to make sure: controllernode\*
Mar  5 20:52:54 cps1 joblist[18659]: 54.261420 |0|0|0| C 05 CAL0000: /data/buildbot/bb-worker/centos7/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 432 DEC: lost connection to 127.0.0.1
Mar  5 20:52:56 cps1 ProcessMonitor[5616]: 56.233321 |0|0|0| D 18 CAL0000: BRM reset_locks script run
Mar  5 20:52:56 cps1 ProcessMonitor[5616]: 56.234507 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/mysqld State = MAN_OFFLINE
Mar  5 20:52:56 cps1 ProcessMonitor[5616]: 56.234546 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/mysqld State = MAN_OFFLINE PID = 0
Mar  5 20:52:56 cps1 ProcessMonitor[5616]: 56.565656 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/mysqld State = ACTIVE
Mar  5 20:52:56 cps1 ProcessMonitor[5616]: 56.565710 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/mysqld State = ACTIVE PID = 8645
Mar  5 20:52:58 cps1 ProcessMonitor[5616]: 58.457269 |0|0|0| D 18 CAL0000: Stop MySQL Process
Mar  5 20:52:58 cps1 ProcessMonitor[5616]: 58.457906 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_OFFLINE
Mar  5 20:52:58 cps1 ProcessMonitor[5616]: 58.458159 |0|0|0| D 18 CAL0000: unmountExtraDBroots called
Mar  5 20:52:58 cps1 ProcessMonitor[5616]: 58.459606 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, return status = 0
Mar  5 20:52:58 cps1 ProcessManager[5771]: 58.459688 |0|0|0| D 17 CAL0000: pm1 module is successfully stopped.
Mar  5 20:52:58 cps1 ProcessManager[5771]: 58.461268 |0|0|0| D 17 CAL0000: Running reset_locks
Mar  5 20:53:00 cps1 ProcessManager[5771]: 00.503752 |0|0|0| D 17 CAL0000: Ran reset_locks
Mar  5 20:53:00 cps1 ProcessManager[5771]: 00.503807 |0|0|0| D 17 CAL0000: Running DBRM save_brm
Mar  5 20:53:00 cps1 ProcessManager[5771]: 00.629356 |0|0|0| D 17 CAL0000: Successfully ran DBRM save_brm
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.020186 |0|0|0| D 17 CAL0000: Successfully ran DBRM clearShm
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.020269 |0|0|0| I 17 CAL0000: RESTARTSYSTEM: ACK received from Process-Monitor for stopModule requests, return status = 0
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.020367 |0|0|0| D 17 CAL0000: distributeConfigFile called for system file = Columnstore.xml
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.024175 |0|0|0| D 17 CAL0000: startSystemThread launched
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.025318 |0|0|0| D 17 CAL0000: Set System State = MAN_INIT
Mar  5 20:53:01 cps1 ProcessMonitor[5616]: 01.025810 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = MAN_INIT
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.037515 |0|0|0| D 17 CAL0000: Set System State = MAN_INIT
Mar  5 20:53:01 cps1 ProcessMonitor[5616]: 01.038060 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = MAN_INIT
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.039232 |0|0|0| D 17 CAL0000: getStandbyModule called
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.040409 |0|0|0| D 17 CAL0000: updateWorkerNodeconfig Started
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.221603 |0|0|0| D 17 CAL0000: updateWorkerNodeconfig completed
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.249727 |0|0|0| D 17 CAL0000: updatePMSconfig Started
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.258934 |0|0|0| D 17 CAL0000: updatePMSconfig completed
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.260201 |0|0|0| D 17 CAL0000: setPMProcIPs called for pm1
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.263919 |0|0|0| D 17 CAL0000: setPMProcIPs for DDLProc: no update needed
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.264935 |0|0|0| D 17 CAL0000: setPMProcIPs for DMLProc: no update needed, exiting function
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.265022 |0|0|0| D 17 CAL0000: distributeConfigFile called for system file = Columnstore.xml
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.269957 |0|0|0| D 17 CAL0000: Start Module pm1
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.272772 |0|0|0| D 17 CAL0000: Set Module pm1 State = 2
Mar  5 20:53:01 cps1 ProcessMonitor[5616]: 01.273284 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Module pm1 State = MAN_INIT
Mar  5 20:53:01 cps1 ProcessManager[5771]: 01.273681 |0|0|0| D 17 CAL0000: sendMsgProcMon: Process module pm1
Mar  5 20:53:01 cps1 ProcessMonitor[5616]: 01.274635 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start All process request...
Mar  5 20:53:01 cps1 ProcessMonitor[5616]: 01.675140 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/mysqld State = MAN_OFFLINE
Mar  5 20:53:01 cps1 ProcessMonitor[5616]: 01.675188 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/mysqld State = MAN_OFFLINE PID = 0
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.588103 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm1/mysqld State = ACTIVE
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.588165 |0|0|0| D 18 CAL0000: statusControl: Set Process pm1/mysqld State = ACTIVE PID = 0
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.588186 |0|0|0| D 18 CAL0000: createDataDirs called
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.659938 |0|0|0| D 18 CAL0000: checkDataMount called
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.668396 |0|0|0| D 18 CAL0000: checkSpecialProcessState on : DBRMControllerNode
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.668438 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.668495 |0|0|0| D 18 CAL0000: STARTING Process: DBRMControllerNode
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.668510 |0|0|0| D 18 CAL0000: Process location: /usr/local/mariadb/columnstore/bin/controllernode
Mar  5 20:53:03 cps1 ProcessMonitor[5616]: 03.669090 |0|0|0| D 18 CAL0000: Dependent process of ProcessManager/pm1 is 4
Mar  5 20:53:04 cps1 cpimport.bin[18659]: 04.218878 |0|0|0| E 34 CAL0087: BulkLoad Error: updateDctnryStore: error adding rows to store file for OID-35388; DBRoot-1; part-138; seg-1;  a BRM Allocate extent error. [BRM error status: network error]
Mar  5 20:53:04 cps1 cpimport.bin[18659]: 04.218990 |0|0|0| E 34 CAL0087: BulkLoad Error: parseDict: error parsing section1:  OID-35385;  a BRM Allocate extent error.
Mar  5 20:53:04 cps1 cpimport.bin[18659]: 04.219036 |0|0|0| E 34 CAL0087: BulkLoad Error: Bulkload Parse (thread 1) Failed for Table mytest.lineitem during parsing.  Terminating this job.
Mar  5 20:53:04 cps1 controllernode[18659]: 04.219263 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Mar  5 20:53:05 cps1 controllernode[18659]: 05.219767 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Mar  5 20:53:05 cps1 writeengine[18659]: 05.232724 |0|0|0| I 19 CAL0084: ClearTableLock: Starting bulk rollback for table mytest.lineitem (OID-35369); lock-8211; initiated by cpimport.bin.
Mar  5 20:53:05 cps1 writeengine[18659]: 05.233177 |0|0|0| E 19 CAL0085: ClearTableLock: Ending bulk rollback for table mytest.lineitem (OID-35369); lock-8211; initiated by cpimport.bin. (rollback failed; Bulk rollback for table mytest.lineitem (OID-35369) not performed; BRM error getting read-write state.).
Mar  5 20:53:05 cps1 writeengine[18659]: 05.233202 |0|0|0| I 19 CAL0085: ClearTableLock: Ending bulk rollback for table mytest.lineitem (OID-35369); lock-8211; initiated by cpimport.bin. (rollback failed; Bulk rollback for table mytest.lineitem (OID-35369) not performed; BRM error getting read-write state.).
Mar  5 20:53:05 cps1 cpimport.bin[18659]: 05.233259 |0|0|0| E 34 CAL0087: BulkLoad Error: Error rolling back table mytest.lineitem; Bulk rollback for table mytest.lineitem (OID-35369) not performed; BRM error getting read-write state.
Mar  5 20:53:05 cps1 cpimport.bin[18659]: 05.233567 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-35369; status-FAILED
Mar  5 20:53:05 cps1 cpimport.bin[19467]: 05.324721 |0|0|0| I 34 CAL0086: Initiating BulkLoad: -P pm1-19467 -ufc644995-f313-45dc-b1db-abd6284386f6 mytest lineitem /root/autopilot__1_2_3/data/source/dbt3/10g/lineitem.tbl
Mar  5 20:53:05 cps1 controllernode[19467]: 05.331589 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::getSystemState() failed (network)
Mar  5 20:53:05 cps1 cpimport.bin[19467]: 05.331630 |0|0|0| E 34 CAL0087: BulkLoad Error: System is not ready.  Verify that InfiniDB is up and ready before running cpimport.
Mar  5 20:53:05 cps1 cpimport.bin[19467]: 05.331647 |0|0|0| I 34 CAL0082: End BulkLoad: JobId-0; status-FAILED



 Comments   
Comment by Zdravelina Sokolovska (Inactive) [ 2019-03-13 ]

mcs system does not recover automatically as it's expected and it's needed
recover mcs manually.
In addition mcsadmin restart utility does not work and should restart entire OS in order to recover

[root@cps ~]# mcsadmin restart
restartsystem   Wed Mar 13 10:04:02 2019
The following tables are locked:
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
Could not connect to PMS1: Connection refused
LockID Name         Process            PID   Session  CreationTime           State    DBRoots
Could not connect to PMS1: Connection refused

Comment by Todd Stoffel (Inactive) [ 2021-02-21 ]

Obsolete

Generated at Thu Feb 08 02:34:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.