Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Duplicate
-
1.0.9
-
None
Description
Our Mariadb Columnstore cluster is crashing every 4-6 hours.
We have 3 PMs with 16CPUs and 28GB RAM.
After some time running, the cluster stops working.
How can we avoid it to crash?
Suport Report here (created after the crash, after the server restart and running properly)
http://files.playax.com/problems/columnstoreSupportReport.playax-column-store.tar.gz
We see this messages in the err.log:
Jul 4 11:10:17 column-store-pm1 PrimProc[24147]: 17.559061 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3021; /000.dir/000.dir/011.dir/205.dir/008.dir/FILE002.cdf:No such file or directory
and debug.log close to failure time:
Jul 3 04:12:26 column-store-pm1 ProcessManager[13826]: 26.639173 |0|0|0| D 17 CAL0000: reinitProcessType: ACK received from Process-Monitor, return status = 0
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.877399 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of 90 ; Swap 0
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.895569 |0|0|0| I 09 CAL0000: Local-Memory usage at percentage of 90 , Alarm set: 7
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.961621 |0|0|0| I 09 CAL0000: Memory Usage for Process: workernode : Memory Used 2531 : % Used 1
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962205 |0|0|0| I 09 CAL0000: Memory Usage for Process: ProcMgr : Memory Used 2675 : % Used 1
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962522 |0|0|0| I 09 CAL0000: Memory Usage for Process: controllernode : Memory Used 3050 : % Used 1
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962850 |0|0|0| I 09 CAL0000: Memory Usage for Process: WriteEngineServ : Memory Used 15606 : % Used 2
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.963163 |0|0|0| I 09 CAL0000: Memory Usage for Process: PrimProc : Memory Used 1292214 : % Used 88
Jul 3 04:14:36 column-store-pm1 joblist[16609]: 36.447803 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 10.240.0.26
Jul 3 04:14:36 column-store-pm1 IDBFile[16546]: 36.598692 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file
Jul 3 04:14:37 column-store-pm1 IDBFile[16546]: 37.599740 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file
Jul 3 04:14:38 column-store-pm1 IDBFile[16546]: 38.600118 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file
Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158491 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_OFFLINE
Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158576 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_OFFLINE PID = 0
Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_INIT
Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236334 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_INIT PID = 0
Jul 3 04:14:39 column-store-pm1 IDBFile[16546]: 39.600499 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file
Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270409 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = PID_UPDATE
Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270517 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = PID_UPDATE PID = 21516
Jul 3 04:14:40 column-store-pm1 ProcessManager[13826]: 40.272368 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm3/PrimProc
Jul 3 04:14:40 column-store-pm1 IDBFile[16546]: 40.600830 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file
Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = ACTIVE
Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507315 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = ACTIVE PID = 21516
D