[MCOL-3905] restoring a failed node breaks DBRMWorkerNode on load_brm Created: 2020-03-26  Updated: 2023-10-26  Resolved: 2020-08-26

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.4.0
Fix Version/s: 1.4.5

Type: Bug Priority: Major
Reporter: Ben Thompson (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2020-5, 2020-6, 2020-7

 Description   

The load_brm program is not using the correct path on non-primary node startup. This will cause a node that was down/out of service to fail startup when columnstore restarts.

Example to reproduce:
3pm combined UM/PM
Take PM3 out of service with ungraceful shutdown.
Wait for system to normalize.
Bring PM3 back online.
errors will occur when PM3 attempts to download BRM_save files and run load_brm because it is not looking in correct path.

errors in log files will appear like following:

Mar 26 15:01:37 testPM3 ProcessMonitor[1616]: 37.822073 |0|0|0| D 18 CAL0000: BRM reset_locks script run
Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.260824 |0|0|0| D 18 CAL0000: Clear Shared Memory script run
Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.260944 |0|0|0| D 18 CAL0000: load_brm cmd = load_brm /var/lib/columnstore/data1/systemFiles/dbrm/0a30099b-a5ae-40d7-a7ef-420a71886490/BRM_saves > /var/log/mariadb/columnstore/load_brm.log1 2>&1
Mar 26 15:01:38 testPM3 IDBFile[4447]: 38.307307 |0|0|0| D 35 CAL0002: Failed to open file: /var/lib/columnstore/data1/systemFiles/dbrm/BRM_saves_journal, exception: unable to open Buffered file
Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.313567 |0|0|0| E 18 CAL0000: Error return DBRM load_brm
Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.314009 |0|0|0| D 18 CAL0000: Send SET Alarm ID 27 on device DBRM
Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.314762 |0|0|0| D 18 CAL0000: StatusUpdate of Process DBRMWorkerNode State = 7 PID = 0
Mar 26 15:01:38 testPM3 ProcessMonitor[1616]: 38.317351 |0|0|0| I 18 CAL0000: STARTALL: ACK back to ProcMgr, return status = 1
Mar 26 15:01:39 testPM3 ServerMonitor[4420]: 39.844808 |0|0|0| I 09 CAL0000: processInitComplete Successfully Called
Mar 26 15:01:44 testPM3 ProcessMonitor[1616]: 44.620073 |0|0|0| I 18 CAL0000: MSG RECEIVED: Update Calpont Config file
Mar 26 15:01:44 testPM3 ProcessMonitor[1616]: 44.620501 |0|0|0| I 18 CAL0000: UPDATECONFIGFILE: Completed
[root@testPM3 ~]# cat /var/log/mariadb/columnstore/load_brm.log1 
Error opening journal file /var/lib/columnstore/data1/systemFiles/dbrm/0a30099b-a5ae-40d7-a7ef-420a71886490/BRM_saves_journal

Recovering from this can be done by running following on PM1:

mcsadmin alterSystem-enableModule pm3
mcsadmin restartsystem y

This issue is related to failures with glusterfs failovers observed in 1.4 –
MCOL-3842



 Comments   
Comment by Daniel Lee (Inactive) [ 2020-08-26 ]

Build verified: 1.4.5-1 (drone b452)

I tried to reproduce the issue in 1.4.0-1 and 1.4.2-1. Both issues seems to have some issues setting up the 3pm stack.

Generated at Thu Feb 08 02:46:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.