[MCOL-801] Process Managers break after some time running Created: 2017-07-04 Updated: 2017-11-27 Resolved: 2017-11-27 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ProcMgr |
| Affects Version/s: | 1.0.9 |
| Fix Version/s: | 1.0.11 |
| Type: | Bug | Priority: | Major |
| Reporter: | Playax | Assignee: | David Hill (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Description |
|
Our Mariadb Columnstore cluster is crashing every 4-6 hours. We have 3 PMs with 16CPUs and 28GB RAM. After some time running, the cluster stops working. How can we avoid it to crash? Suport Report here (created after the crash, after the server restart and running properly) We see this messages in the err.log: Jul 4 11:10:17 column-store-pm1 PrimProc[24147]: 17.559061 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3021; /000.dir/000.dir/011.dir/205.dir/008.dir/FILE002.cdf:No such file or directory and debug.log close to failure time: Jul 3 04:12:26 column-store-pm1 ProcessManager[13826]: 26.639173 |0|0|0| D 17 CAL0000: reinitProcessType: ACK received from Process-Monitor, return status = 0 Jul 3 04:14:36 column-store-pm1 IDBFile[16546]: 36.598692 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file Jul 3 04:14:37 column-store-pm1 IDBFile[16546]: 37.599740 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file Jul 3 04:14:38 column-store-pm1 IDBFile[16546]: 38.600118 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158491 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_OFFLINE Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158576 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_OFFLINE PID = 0 Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_INIT Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236334 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_INIT PID = 0 Jul 3 04:14:39 column-store-pm1 IDBFile[16546]: 39.600499 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270409 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = PID_UPDATE Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270517 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = PID_UPDATE PID = 21516 Jul 3 04:14:40 column-store-pm1 ProcessManager[13826]: 40.272368 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm3/PrimProc Jul 3 04:14:40 column-store-pm1 IDBFile[16546]: 40.600830 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = ACTIVE Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507315 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = ACTIVE PID = 21516 D |
| Comments |
| Comment by Playax [ 2017-07-04 ] |
|
I've noticed that PM servers memory usage increases over time, and after some hours of operation they crash. Could it be a memory leak? |
| Comment by David Hill (Inactive) [ 2017-07-05 ] |
|
Seeing a number of logs showing the PrimProc is over using the memory way over its setting NumBlocksPct = 70%. Also seeing where you don't have swap space setup on the servers.. So there is an issue of running out of memory and with no swap, believe this is causing the PrimProc to crash. pm3_logs/log/mariadb/columnstore/info.log:Jul 4 15:05:10 column-store-pm3 ServerMonitor[11300]: 10.909398 |0|0|0| I 09 CAL0000: Memory Usage for Process: PrimProc : Memory Used 2767966 : % Used 97 info.log-20170703:Jul 3 02:50:58 column-store-pm3 ServerMonitor[25548]: 58.549829 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of 90 ; Swap 0 You could lower this setting from 70 to 50 NumBlocksPct to see if it stabilizes.This would be changed in pm1 ColumnStore.xml and a restartsystem would need to be performed. from the logs, looks like you are doing queries and not a lot of cpimports... I don't know if there is anything unusual about your DB or queries. |
| Comment by Playax [ 2017-07-10 ] |
|
I have created 4GB of swap. Let's see what happens... |
| Comment by David Thompson (Inactive) [ 2017-11-27 ] |
|
I believe that this was resolved under another bug for 1.0.11 though i think you may still have less frequent crashes however i'm not sure if we have anything further actionable from this time. |