[MCOL-916] Gluster failover: Stack did not recover completely after PM1 reboot Created: 2017-09-13  Updated: 2023-10-26  Resolved: 2017-10-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.1.0
Fix Version/s: 1.1.1

Type: Bug Priority: Critical
Reporter: Daniel Lee (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 1
Labels: None

Sprint: 2017-18, 2017-19, 2017-20, 2017-21

 Description   

Build tested: 1.1.0-1 beta

[root@localhost columnstore]# cat crit.log
Aug 29 14:38:48 localhost ProcessMonitor[6881]: 48.096282 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ProcessManager, old PID = 6981
Aug 29 15:33:54 localhost ProcessMonitor[6881]: 54.703490 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ProcessManager, old PID = 7332
Aug 29 15:35:23 localhost ProcessManager[11150]: 23.027245 |0|0|0| C 17 CAL0000: startMgrProcessThread Exit with a failure, not all ProcMons ACTIVE

But I checked all 8 nodes and found that procmons are all running.

maybe a one point procmon was not running

Tried shutdownsystem from PM2 (active PM after failover). Command failed:

Aug 29 17:00:50 localhost controllernode[11150]: 50.867853 |0|0|0| E 29 CAL0000: DBRM: error: SessionManager::clearSystemState() failed (network)
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.459731 |0|0|0| E 18 CAL0000: glusterUnassign failed.
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.496036 |0|0|0| E 18 CAL0000: Error unassigning gluster dbroot# 1
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.523484 |0|0|0| E 18 CAL0000: glusterUnassign failed.
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.531013 |0|0|0| E 18 CAL0000: Error unassigning gluster dbroot# 3
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.571047 |0|0|0| E 18 CAL0000: glusterUnassign failed.
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.597465 |0|0|0| E 18 CAL0000: Error unassigning gluster dbroot# 4
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.622959 |0|0|0| E 18 CAL0000: glusterUnassign failed.
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.640528 |0|0|0| E 18 CAL0000: Error unassigning gluster dbroot# 5
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.661192 |0|0|0| E 18 CAL0000: glusterUnassign failed.
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.670325 |0|0|0| E 18 CAL0000: Error unassigning gluster dbroot# 6
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.698744 |0|0|0| E 18 CAL0000: glusterUnassign failed.
Aug 29 17:00:55 localhost ProcessMonitor[6881]: 55.711682 |0|0|0| E 18 CAL0000: Error unassigning gluster dbroot# 7



 Comments   
Comment by David Thompson (Inactive) [ 2017-09-13 ]

This should be addressed by related MCOL's from david hill

Comment by Daniel Lee (Inactive) [ 2017-09-14 ]

Build tested: 1.1.0-1

With a 1um2pm cluster stack:

rebooted pm2 = ok
rebooted pm1 = recover ok, but did not failover to pm2

I will try 1um4pm stack next

Comment by Daniel Lee (Inactive) [ 2017-09-15 ]

Tested with 1um4pm, same behavior.

Thinking that using VMs, PM1 recovered "too" fast and PM2 did not have time for failover to kick in, I suspended PM1 for few minutes and resumed it. PM1 recovered, but only PM1 could getprocessstatus, not any of the other nodes. Stopsystem also failed.

Comment by David Hill (Inactive) [ 2017-09-18 ]

On ubuntu 16 1um 2pm root glister install,I stopped pm1 instance (didnt rebopt). The failover is in a good state as far as the status of the system processes and storage. There was an issue in the rollback. DMLProc was in BUSY_INIT for a long time and it looks like rollback errors was hit based on the logs

------------------------------------------------------------

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Mon Sep 18 19:53:52 2017

Module um1 ACTIVE Mon Sep 18 19:44:18 2017
Module pm1 AUTO_DISABLED/DEGRADED Mon Sep 18 19:48:42 2017
Module pm2 ACTIVE Mon Sep 18 19:51:26 2017

Active Parent OAM Performance Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Mon Sep 18 19:43:33 2017 1385
ServerMonitor um1 ACTIVE Mon Sep 18 19:43:58 2017 1987
DBRMWorkerNode um1 ACTIVE Mon Sep 18 19:44:00 2017 2017
ExeMgr um1 ACTIVE Mon Sep 18 19:51:31 2017 4044
DDLProc um1 ACTIVE Mon Sep 18 19:51:38 2017 4065
DMLProc um1 ACTIVE Mon Sep 18 19:53:52 2017 4086
mysqld um1 ACTIVE Mon Sep 18 19:51:28 2017

ProcessMonitor pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
ProcessManager pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
DBRMControllerNode pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
ServerMonitor pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
DBRMWorkerNode pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
DecomSvr pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
PrimProc pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017
WriteEngineServer pm1 AUTO_OFFLINE Mon Sep 18 19:48:44 2017

ProcessMonitor pm2 ACTIVE Mon Sep 18 19:43:41 2017 1371
ProcessManager pm2 ACTIVE Mon Sep 18 19:50:30 2017 2235
DBRMControllerNode pm2 ACTIVE Mon Sep 18 19:51:18 2017 3029
ServerMonitor pm2 ACTIVE Mon Sep 18 19:51:20 2017 3050
DBRMWorkerNode pm2 ACTIVE Mon Sep 18 19:51:20 2017 3084
DecomSvr pm2 ACTIVE Mon Sep 18 19:51:24 2017 3122
PrimProc pm2 ACTIVE Mon Sep 18 19:51:26 2017 3152
WriteEngineServer pm2 ACTIVE Mon Sep 18 19:51:27 2017 3174

Active Alarm Counts: Critical = 0, Major = 1, Minor = 0, Warning = 0, Info = 0
mcsadmin> getst
getstorageconfig Mon Sep 18 19:57:55 2017

System Storage Configuration

Performance Module (DBRoot) Storage Type = DataRedundancy
System Assigned DBRoot Count = 2
DBRoot IDs assigned to 'pm1' =
DBRoot IDs assigned to 'pm2' = 1, 2

Data Redundant Configuration

Copies Per DBroot = 2
mcsadmin>

Sep 18 19:51:44 ubuntu16-um1 DMLProc[4086]: 44.573121 |0|0|0| I 20 CAL0002: DMLProc starts rollbackAll.
Sep 18 19:53:51 ubuntu16-um1 joblist[4086]: 51.808538 |0|0|0| E 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/writeengine/client/we_clients.cpp @ 268 Could not connect to pm1_WriteEngineServer: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 10 inet: 10.128.0.2 port: 8630
Sep 18 19:53:51 ubuntu16-um1 joblist[4086]: 51.808543 |0|0|0| E 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/writeengine/client/we_clients.cpp @ 268 Could not connect to pm1_WriteEngineServer: InetStreamSocket::connect: connect() error: Connection timed out to: InetStreamSocket: sd: 11 inet: 10.128.0.2 port: 8630
Sep 18 19:53:52 ubuntu16-um1 oamcpp[4086]: 52.817152 |0|0|0| E 08 CAL0000: OamCache::checkReload shows state for pm1 as AUTO_DISABLED
Sep 18 19:53:52 ubuntu16-um1 DMLProc[4086]: 52.819025 |0|0|0| I 20 CAL0002: DMLProc will rollback 0 tables.
Sep 18 19:53:52 ubuntu16-um1 DMLProc[4086]: 52.829735 |0|0|0| I 20 CAL0002: DMLProc finished rollbackAll.

Comment by David Hill (Inactive) [ 2017-09-18 ]

pm1 came back into the system after I started it...

System and Module statuses

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Mon Sep 18 20:00:01 2017

Module um1 ACTIVE Mon Sep 18 19:44:18 2017
Module pm1 ACTIVE Mon Sep 18 19:59:45 2017
Module pm2 ACTIVE Mon Sep 18 19:51:26 2017

Active Parent OAM Performance Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Mon Sep 18 19:43:33 2017 1385
ServerMonitor um1 ACTIVE Mon Sep 18 19:43:58 2017 1987
DBRMWorkerNode um1 ACTIVE Mon Sep 18 19:44:00 2017 2017
ExeMgr um1 ACTIVE Mon Sep 18 19:59:49 2017 4193
DDLProc um1 ACTIVE Mon Sep 18 19:59:55 2017 4214
DMLProc um1 ACTIVE Mon Sep 18 20:00:01 2017 4232
mysqld um1 ACTIVE Mon Sep 18 19:51:28 2017

ProcessMonitor pm1 ACTIVE Mon Sep 18 19:59:28 2017 1415
ProcessManager pm1 HOT_STANDBY Mon Sep 18 19:59:35 2017 2055
DBRMControllerNode pm1 COLD_STANDBY Mon Sep 18 19:59:35 2017
ServerMonitor pm1 ACTIVE Mon Sep 18 19:59:38 2017 2125
DBRMWorkerNode pm1 ACTIVE Mon Sep 18 19:59:40 2017 2153
DecomSvr pm1 ACTIVE Mon Sep 18 19:59:43 2017 2170
PrimProc pm1 ACTIVE Mon Sep 18 19:59:46 2017 2181
WriteEngineServer pm1 ACTIVE Mon Sep 18 19:59:47 2017 2192

ProcessMonitor pm2 ACTIVE Mon Sep 18 19:43:41 2017 1371
ProcessManager pm2 ACTIVE Mon Sep 18 19:50:30 2017 2235
DBRMControllerNode pm2 ACTIVE Mon Sep 18 19:51:18 2017 3029
ServerMonitor pm2 ACTIVE Mon Sep 18 19:51:20 2017 3050
DBRMWorkerNode pm2 ACTIVE Mon Sep 18 19:51:20 2017 3084
DecomSvr pm2 ACTIVE Mon Sep 18 19:51:24 2017 3122
PrimProc pm2 ACTIVE Mon Sep 18 19:51:26 2017 3152
WriteEngineServer pm2 ACTIVE Mon Sep 18 19:59:19 2017 6200

Active Alarm Counts: Critical = 0, Major = 0, Minor = 0, Warning = 0, Info = 0
mcsadmin> getst
getstorageconfig Mon Sep 18 20:00:43 2017

System Storage Configuration

Performance Module (DBRoot) Storage Type = DataRedundancy
System Assigned DBRoot Count = 2
DBRoot IDs assigned to 'pm1' = 2
DBRoot IDs assigned to 'pm2' = 1

Comment by Daniel Lee (Inactive) [ 2017-09-18 ]

Additional info on the failover issue

After suspending PM1 (in Vagrant), Active OAM module was failover to PM2. Now PM2 has both dbroot1 and dbroot2 assigned. PM3 became the HOTSTANDBY module.

After resuming PM1, PM1 became disabled because dbroot2 could not be assigned to PM1 since PM2 never had a copy of dbroot 2.

[root@localhost columnstore]# ma

MariaDB ColumnStore Admin Console
enter 'help' for list of commands
enter 'exit' to exit the MariaDB ColumnStore Command Console
use up/down arrows to recall commands

Active Alarm Counts: Critical = 1, Major = 0, Minor = 0, Warning = 0, Info = 0

Critical Active Alarms:

AlarmID = 14
Brief Description = MODULE_DOWN_AUTO
Alarm Severity = CRITICAL
Time Issued = Mon Sep 18 20:50:13 2017
Reporting Module = pm2
Reporting Process = ProcessManager
Reported Device = pm1

mcsadmin> getstor
getstorageconfig Mon Sep 18 21:08:03 2017

System Storage Configuration

Performance Module (DBRoot) Storage Type = DataRedundancy
System Assigned DBRoot Count = 4
DBRoot IDs assigned to 'pm1' =
DBRoot IDs assigned to 'pm2' = 1, 2
DBRoot IDs assigned to 'pm3' = 3
DBRoot IDs assigned to 'pm4' = 4

Data Redundant Configuration

Copies Per DBroot = 3
DBRoot #1 has copies on PMs = 1 2 3
DBRoot #2 has copies on PMs = 2 3 4
DBRoot #3 has copies on PMs = 1 3 4
DBRoot #4 has copies on PMs = 1 2 4

In this 3-data-copy configuration, the only PM that could be failovered to is PM3.

My suggestion is that when we place the copy among PMs, we need to have logic so that there would be at least one node that would be able to swap dbroots with PM1. HOTSTANDBY needs to be set on the correct module accordingly.

Comment by Ben Thompson (Inactive) [ 2017-10-26 ]

In the event that the hotstandby does not share a dbroot with active parent module. when old active parent comes back online it will resume active parent mode and new active parent should resume standby mode.

example 3PM / 2 Copy system:

DBRoot1 has copies on PM1 and PM2
DBRoot2 has copies on PM2 and PM3

PM1 shutsdown
PM2 has DBROOT 1 and 2 and becomes active parent
PM1 restarts
PM1 cannot take DBROOT 2 when returning
PM1 retakes DBROOT1 and resumes active parent
PM2 resumes hot standby mode

Comment by David Hill (Inactive) [ 2017-10-26 ]

reviewed by David Hill

Comment by Daniel Lee (Inactive) [ 2017-10-27 ]

Build verified: 1.1.1-1 rpm package.

When PM1 recovered, active cam module did switch back to PM1 from PM2 and PM2 was the hot standby module again. Also tested shutdownsystem and startsystem.

Generated at Thu Feb 08 02:24:46 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.