Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-801

Process Managers break after some time running

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 1.0.9
    • 1.0.11
    • ProcMgr
    • None

    Description

      Our Mariadb Columnstore cluster is crashing every 4-6 hours.

      We have 3 PMs with 16CPUs and 28GB RAM.

      After some time running, the cluster stops working.

      How can we avoid it to crash?

      Suport Report here (created after the crash, after the server restart and running properly)
      http://files.playax.com/problems/columnstoreSupportReport.playax-column-store.tar.gz

      We see this messages in the err.log:

      Jul 4 11:10:17 column-store-pm1 PrimProc[24147]: 17.559061 |0|0|0| C 28 CAL0053: PrimProc could not open file for OID 3021; /000.dir/000.dir/011.dir/205.dir/008.dir/FILE002.cdf:No such file or directory

      and debug.log close to failure time:

      Jul 3 04:12:26 column-store-pm1 ProcessManager[13826]: 26.639173 |0|0|0| D 17 CAL0000: reinitProcessType: ACK received from Process-Monitor, return status = 0
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.877399 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of 90 ; Swap 0
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.895569 |0|0|0| I 09 CAL0000: Local-Memory usage at percentage of 90 , Alarm set: 7
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.961621 |0|0|0| I 09 CAL0000: Memory Usage for Process: workernode : Memory Used 2531 : % Used 1
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962205 |0|0|0| I 09 CAL0000: Memory Usage for Process: ProcMgr : Memory Used 2675 : % Used 1
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962522 |0|0|0| I 09 CAL0000: Memory Usage for Process: controllernode : Memory Used 3050 : % Used 1
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.962850 |0|0|0| I 09 CAL0000: Memory Usage for Process: WriteEngineServ : Memory Used 15606 : % Used 2
      Jul 3 04:12:55 column-store-pm1 ServerMonitor[16301]: 55.963163 |0|0|0| I 09 CAL0000: Memory Usage for Process: PrimProc : Memory Used 1292214 : % Used 88
      Jul 3 04:14:36 column-store-pm1 joblist[16609]: 36.447803 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 10.240.0.26

      Jul 3 04:14:36 column-store-pm1 IDBFile[16546]: 36.598692 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

      Jul 3 04:14:37 column-store-pm1 IDBFile[16546]: 37.599740 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

      Jul 3 04:14:38 column-store-pm1 IDBFile[16546]: 38.600118 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

      Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158491 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_OFFLINE

      Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.158576 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_OFFLINE PID = 0

      Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = AUTO_INIT

      Jul 3 04:14:39 column-store-pm1 ProcessMonitor[13620]: 39.236334 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = AUTO_INIT PID = 0

      Jul 3 04:14:39 column-store-pm1 IDBFile[16546]: 39.600499 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

      Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270409 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = PID_UPDATE

      Jul 3 04:14:40 column-store-pm1 ProcessMonitor[13620]: 40.270517 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = PID_UPDATE PID = 21516

      Jul 3 04:14:40 column-store-pm1 ProcessManager[13826]: 40.272368 |0|0|0| I 17 CAL0000: MSG RECEIVED: Process Restarted on pm3/PrimProc

      Jul 3 04:14:40 column-store-pm1 IDBFile[16546]: 40.600830 |0|0|0| D 35 CAL0002: Failed to open file: /000.dir/000.dir/011.dir/205.dir/006.dir/FILE001.cdf, exception: unable to open Unbuffered file

      Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507226 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set Process pm3/PrimProc State = ACTIVE

      Jul 3 04:14:41 column-store-pm1 ProcessMonitor[13620]: 41.507315 |0|0|0| D 18 CAL0000: statusControl: Set Process pm3/PrimProc State = ACTIVE PID = 21516

      D

      Attachments

        Activity

          People

            hill David Hill (Inactive)
            playax Playax
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.