[MCOL-3945] load_brm will hang on dbroot1 failover Created: 2020-04-14  Updated: 2023-10-26  Resolved: 2020-06-22

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: None
Fix Version/s: 1.2.6, 1.4.4

Type: Bug Priority: Critical
Reporter: Ben Thompson (Inactive) Assignee: Ben Thompson (Inactive)
Resolution: Fixed Votes: 0
Labels: None


 Description   

saveBRM on failover runs before the dbroot is exchanged. this could lead to saveBRM being run before the brm_saves_journal file exists on the new primary module on a OAM parent failure and could lead to load_brm hanging.

Reproduce by setting up multi-node glusterfs installation and perform large table import. After import completes kill PM1 and wait for PM2 to take over primary roll will see save_brm command run first then dbroot1 moved to PM2 and then load_brm called in logging.

Fix is to first move dbroot1 then run saveBRM this should allow load_brm to run successfully.



 Comments   
Comment by Patrick LeBlanc (Inactive) [ 2020-04-14 ]

Looks ok. This will need to get into develop, and develop-1.

{2,4}

also.

Comment by Ben Thompson (Inactive) [ 2020-05-27 ]

Part of this fix was reverted with other failover changes in MCOL-3842. This all was merged into 1.2.6 and 1.4.4 And will have been retested by MCOL-3842. Moving to test for 1.2.6 if necessary.

Comment by Ben Thompson (Inactive) [ 2020-05-27 ]

This was all merged in 1.2.6 with MCOL-3842 - Restesting of that MCOL in 1.2 should be sufficient for closing this if already completed.

Comment by Daniel Lee (Inactive) [ 2020-06-01 ]

Build tested: 1.4.4-1 (Jenkins 20200601)

Failover (PM1 to PM2) after a 10g lineitem import worked fine.

According to the debug.log on PM2, save_brm is still being executed first, then dbroot moved.

Generated at Thu Feb 08 02:46:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.