[MCOL-801] Process Managers break after some time running Created: 2017-07-04  Updated: 2017-11-27  Resolved: 2017-11-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: ProcMgr
Affects Version/s: 1.0.9
Fix Version/s: 1.0.11

Type: Bug Priority: Major
Reporter: Playax Assignee: David Hill (Inactive)
Resolution: Duplicate Votes: 0
Labels: None


 Description   

Our Mariadb Columnstore cluster is crashing every 4-6 hours.

We have 3 PMs with 16CPUs and 28GB RAM.

After some time running, the cluster stops working.

How can we avoid it to crash?

Suport Report here (created after the crash, after the server restart and running properly)
http://files.playax.com/problems/columnstoreSupportReport.playax-column-store.tar.gz

We see this messages in the err.log:

Jul 4 11:10:17 column-store-pm1 PrimProc[24147]: 17.559061 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3021; /000.dir/000.dir/011.dir/205.dir/008.dir/FILE002.cdf:No such file or directory

and debug.log close to failure time:

Jul 3 04:12:26 column-store-pm1 ProcessManager[13826]: 26.639173 |0|0|0| D 17 CAL0000: reinitProcessType: ACK received from Process-Monitor, return status = 0
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.877399 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of 90 ; Swap 0
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.895569 |0|0|0| I 09 CAL0000: Local-Memory usage at percentage of 90 , Alarm set: 7
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.961621 |0|0|0| I 09 CAL0000: Memory Usage for Process: workernode : Memory Used 2531 : % Used 1
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962205 |0|0|0| I 09 CAL0000: Memory Usage for Process: ProcMgr : Memory Used 2675 : % Used 1
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962522 |0|0|0| I 09 CAL0000: Memory Usage for Process: controllernode : Memory Used 3050 : % Used 1
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962850 |0|0|0| I 09 CAL0000: Memory Usage for Process: WriteEngineServ : Memory Used 15606 : % Used 2
Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.963163 |0|0|0| I 09 CAL0000: Memory Usage for Process: PrimProc : Memory Used 1292214 : % Used 88
Jul 3 04:14:36 column-store-pm1 joblist[16609]: 36.447803 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 10.240.0.26

Jul 3 04:14:36 column-store-pm1 IDBFile[16546]: 36.598692 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

Jul 3 04:14:37 column-store-pm1 IDBFile[16546]: 37.599740 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

Jul 3 04:14:38 column-store-pm1 IDBFile[16546]: 38.600118 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158491 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_OFFLINE

Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158576 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_OFFLINE PID = 0

Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_INIT

Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236334 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_INIT PID = 0

Jul 3 04:14:39 column-store-pm1 IDBFile[16546]: 39.600499 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270409 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = PID_UPDATE

Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270517 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = PID_UPDATE PID = 21516

Jul 3 04:14:40 column-store-pm1 ProcessManager[13826]: 40.272368 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm3/PrimProc

Jul 3 04:14:40 column-store-pm1 IDBFile[16546]: 40.600830 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = ACTIVE

Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507315 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = ACTIVE PID = 21516

D



 Comments   
Comment by Playax [ 2017-07-04 ]

I've noticed that PM servers memory usage increases over time, and after some hours of operation they crash. Could it be a memory leak?

Comment by David Hill (Inactive) [ 2017-07-05 ]

Seeing a number of logs showing the PrimProc is over using the memory way over its setting NumBlocksPct = 70%. Also seeing where you don't have swap space setup on the servers.. So there is an issue of running out of memory and with no swap, believe this is causing the PrimProc to crash.

pm3_logs/log/mariadb/columnstore/info.log:Jul 4 15:05:10 column-store-pm3 ServerMonitor[11300]: 10.909398 |0|0|0| I 09 CAL0000: Memory Usage for Process: PrimProc : Memory Used 2767966 : % Used 97

info.log-20170703:Jul 3 02:50:58 column-store-pm3 ServerMonitor[25548]: 58.549829 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of 90 ; Swap 0

You could lower this setting from 70 to 50 NumBlocksPct to see if it stabilizes.This would be changed in pm1 ColumnStore.xml and a restartsystem would need to be performed.

from the logs, looks like you are doing queries and not a lot of cpimports... I don't know if there is anything unusual about your DB or queries.

Comment by Playax [ 2017-07-10 ]

I have created 4GB of swap. Let's see what happens...

Comment by David Thompson (Inactive) [ 2017-11-27 ]

I believe that this was resolved under another bug for 1.0.11 though i think you may still have less frequent crashes however i'm not sure if we have anything further actionable from this time.

Generated at Thu Feb 08 02:23:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.