[MCOL-916] Gluster failover: Stack did not recover completely after PM1 reboot Created: 2017-09-13 Updated: 2023-10-26 Resolved: 2017-10-27 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ? |
| Affects Version/s: | 1.1.0 |
| Fix Version/s: | 1.1.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Daniel Lee (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Sprint: | 2017-18, 2017-19, 2017-20, 2017-21 |
| Description |
|
Build tested: 1.1.0-1 beta [root@localhost columnstore]# cat crit.log But I checked all 8 nodes and found that procmons are all running. maybe a one point procmon was not running Tried shutdownsystem from PM2 (active PM after failover). Command failed: Aug 29 17:00:50 localhost controllernode[11150]: 50.867853 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::clearSystemState() failed (network) |
| Comments |
| Comment by David Thompson (Inactive) [ 2017-09-13 ] |
|
This should be addressed by related MCOL's from david hill |
| Comment by Daniel Lee (Inactive) [ 2017-09-14 ] |
|
Build tested: 1.1.0-1 With a 1um2pm cluster stack: rebooted pm2 = ok I will try 1um4pm stack next |
| Comment by Daniel Lee (Inactive) [ 2017-09-15 ] |
|
Tested with 1um4pm, same behavior. Thinking that using VMs, PM1 recovered "too" fast and PM2 did not have time for failover to kick in, I suspended PM1 for few minutes and resumed it. PM1 recovered, but only PM1 could getprocessstatus, not any of the other nodes. Stopsystem also failed. |
| Comment by David Hill (Inactive) [ 2017-09-18 ] |
|
On ubuntu 16 1um 2pm root glister install,I stopped pm1 instance (didnt rebopt). The failover is in a good state as far as the status of the system processes and storage. There was an issue in the rollback. DMLProc was in BUSY_INIT for a long time and it looks like rollback errors was hit based on the logs ------------------------------------------------------------ Component Status Last Status Change Module um1 ACTIVE Mon Sep 18 19:44:18 2017 Active Parent OAM Performance Module is 'pm2' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017 ProcessMonitor pm2 ACTIVE Mon Sep 18 19:43:41 2017 1371 Active Alarm Counts: Critical = 0, Major = 1, Minor = 0, Warning = 0, Info = 0 System Storage Configuration Performance Module (DBRoot) Storage Type = DataRedundancy Data Redundant Configuration Copies Per DBroot = 2 Sep 18 19:51:44 ubuntu16-um1 DMLProc[4086]: 44.573121 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll. |
| Comment by David Hill (Inactive) [ 2017-09-18 ] |
|
pm1 came back into the system after I started it... System and Module statuses Component Status Last Status Change Module um1 ACTIVE Mon Sep 18 19:44:18 2017 Active Parent OAM Performance Module is 'pm2' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Mon Sep 18 19:59:28 2017 1415 ProcessMonitor pm2 ACTIVE Mon Sep 18 19:43:41 2017 1371 Active Alarm Counts: Critical = 0, Major = 0, Minor = 0, Warning = 0, Info = 0 System Storage Configuration Performance Module (DBRoot) Storage Type = DataRedundancy |
| Comment by Daniel Lee (Inactive) [ 2017-09-18 ] |
|
Additional info on the failover issue After suspending PM1 (in Vagrant), Active OAM module was failover to PM2. Now PM2 has both dbroot1 and dbroot2 assigned. PM3 became the HOTSTANDBY module. After resuming PM1, PM1 became disabled because dbroot2 could not be assigned to PM1 since PM2 never had a copy of dbroot 2. [root@localhost columnstore]# ma MariaDB ColumnStore Admin Console Active Alarm Counts: Critical = 1, Major = 0, Minor = 0, Warning = 0, Info = 0 Critical Active Alarms: AlarmID = 14 mcsadmin> getstor System Storage Configuration Performance Module (DBRoot) Storage Type = DataRedundancy Data Redundant Configuration Copies Per DBroot = 3 In this 3-data-copy configuration, the only PM that could be failovered to is PM3. My suggestion is that when we place the copy among PMs, we need to have logic so that there would be at least one node that would be able to swap dbroots with PM1. HOTSTANDBY needs to be set on the correct module accordingly. |
| Comment by Ben Thompson (Inactive) [ 2017-10-26 ] |
|
In the event that the hotstandby does not share a dbroot with active parent module. when old active parent comes back online it will resume active parent mode and new active parent should resume standby mode. example 3PM / 2 Copy system: DBRoot1 has copies on PM1 and PM2 PM1 shutsdown |
| Comment by David Hill (Inactive) [ 2017-10-26 ] |
|
reviewed by David Hill |
| Comment by Daniel Lee (Inactive) [ 2017-10-27 ] |
|
Build verified: 1.1.1-1 rpm package. When PM1 recovered, active cam module did switch back to PM1 from PM2 and PM2 was the hot standby module again. Also tested shutdownsystem and startsystem. |