[MCOL-1610] DataRedundancy failover recovery leaves dbroot in limbo if gluster mount fails Created: 2018-07-30  Updated: 2023-10-26  Resolved: 2019-01-21

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.1.5
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Ben Thompson (Inactive) Assignee: Ben Thompson (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2018-15, 2018-16, 2018-17, 2018-18, 2018-19, 2018-20, 2018-21, 2019-01

 Description   

Example scenario similar to customer observed issue:

4PM / 1UM system with data redundancy

PM2 fails and dbroot2 is moved to PM3

mcsadmin getstorageconfig
getstorageconfig Fri Jul 27 09:31:33 2018

System Storage Configuration

Performance Module (DBRoot) Storage Type = DataRedundancy
System Assigned DBRoot Count = 4
DBRoot IDs assigned to 'pm1' = 1
DBRoot IDs assigned to 'pm2' =
DBRoot IDs assigned to 'pm3' = 2, 3
DBRoot IDs assigned to 'pm4' = 4

PM2 reconnects to PM1 and during the failover recovery of dbroot2 from PM3 to PM2 the mount command on gluster/dbroot2 to data2 on PM2 fails in some way.

This will leave dbroot2 not mounted to PM3 or PM2 when dbrm attempts a reload/resume. While system expects it to still be mounted on PM3

Fix needs to modify procedure to ensure dbroot is always left mounted somewhere on failure in the recovery process.

simple way to reproduce behavior is to disconnect PM2 and break the file permissions on the glusterfs mount to data2 so that it will fail on recovery.



 Comments   
Comment by Ben Thompson (Inactive) [ 2018-08-08 ]

Simple way to force issue is to have a 4pm/1um setup follow these steps

  1. Disconnect pm2 once system has recovered and dbroot2 is assigned to pm3
  2. Directly login to pm2 and umount data2 and delete data2 directory ( data2 is empty because the gluster brick isn't mounted )
  3. Run a few cpimport commands successfully
  4. Reconnect the network for pm2. The attempt to move dbroot 2 back to pm2 will fail because the mount does not exist. In a 1.1.5 system the dbroot2 will not be connected to any system however when it resumes it will assume it still has access to dbroot2.

The fix is that on this failure dbroot2 will be reconnected to pm3 and pm2 will fail to man disabled with no dbroots assigned to it. This ensures the system returns in a usable state. To manually recover from an issue like this user will need to follow this procedure:

  1. Determine the cause of the mount failure on pm2 and resolve (this case mkdir data2).
  2. mcsadmin alterSystem-enableModule pm2
  3. mcsadmin shutdownSystem
  4. mcsadmin movePmDbrootConfig pm3 2 pm2
  5. mcsadmin startSystem
Comment by Daniel Lee (Inactive) [ 2018-08-22 ]

Trying to reproduce the issue in 1.1.5-1.

Comment by Daniel Lee (Inactive) [ 2018-08-22 ]

It looks like I closed the ticket by accident. Reopening it.

Generated at Thu Feb 08 02:30:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.