Uploaded image for project: 'MariaDB ColumnStore'
  1. MariaDB ColumnStore
  2. MCOL-3905

restoring a failed node breaks DBRMWorkerNode on load_brm

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4.0
    • Fix Version/s: 1.4.5
    • Component/s: oam
    • Labels:
      None
    • Sprint:
      2020-5, 2020-6, 2020-7

      Description

      The load_brm program is not using the correct path on non-primary node startup. This will cause a node that was down/out of service to fail startup when columnstore restarts.

      Example to reproduce:
      3pm combined UM/PM
      Take PM3 out of service with ungraceful shutdown.
      Wait for system to normalize.
      Bring PM3 back online.
      errors will occur when PM3 attempts to download BRM_save files and run load_brm because it is not looking in correct path.

      errors in log files will appear like following:

      Mar 26 15:01:37 testPM3 ProcessMonitor[1616]: 37.822073 |0|0|0| D 18 CAL0000: BRM reset_locks script run
      Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.260824 |0|0|0| D 18 CAL0000: Clear Shared Memory script run
      Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.260944 |0|0|0| D 18 CAL0000: load_brm cmd = load_brm /var/lib/columnstore/data1/systemFiles/dbrm/0a30099b-a5ae-40d7-a7ef-420a71886490/BRM_saves > /var/log/mariadb/columnstore/load_brm.log1 2>&1
      Mar 26 15:01:38 testPM3 IDBFile[4447]: 38.307307 |0|0|0| D 35 CAL0002: Failed to open file: /var/lib/columnstore/data1/systemFiles/dbrm/BRM_saves_journal, exception: unable to open Buffered file
      Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.313567 |0|0|0| E 18 CAL0000: Error return DBRM load_brm
      Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.314009 |0|0|0| D 18 CAL0000: Send SET Alarm ID 27 on device DBRM
      Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.314762 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMWorkerNode State = 7 PID = 0
      Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.317351 |0|0|0| I 18 CAL0000: STARTALL: ACK back to ProcMgr, return status = 1
      Mar 26 15:01:39 testPM3 ServerMonitor[4420]: 39.844808 |0|0|0| I 09 CAL0000: processInitComplete Successfully Called
      Mar 26 15:01:44 testPM3 ProcessMonitor[1616]: 44.620073 |0|0|0| I 18 CAL0000: MSG RECEIVED: Update Calpont Config file
      Mar 26 15:01:44 testPM3 ProcessMonitor[1616]: 44.620501 |0|0|0| I 18 CAL0000: UPDATECONFIGFILE: Completed
      [root@testPM3 ~]# cat /var/log/mariadb/columnstore/load_brm.log1 
      Error opening journal file /var/lib/columnstore/data1/systemFiles/dbrm/0a30099b-a5ae-40d7-a7ef-420a71886490/BRM_saves_journal
      

      Recovering from this can be done by running following on PM1:

      mcsadmin alterSystem-enableModule pm3
      mcsadmin restartsystem y
      

      This issue is related to failures with glusterfs failovers observed in 1.4 –
      MCOL-3842

        Attachments

          Activity

            People

            Assignee:
            dleeyh Daniel Lee
            Reporter:
            ben.thompson Ben Thompson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.