[MCOL-1138] pm1 failover testing - didnt leave a HOT_STANDBY ProcMgr on remainng node Created: 2018-01-05  Updated: 2023-10-26  Resolved: 2018-01-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: ?
Affects Version/s: 1.1.2
Fix Version/s: 1.1.3

Type: Bug Priority: Major
Reporter: David Hill (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

non-root amazon ami with EBS 3pm combo system


Sprint: 2018-02

 Description   

started with pm1 as the active/master node after install. stopped pm1 instance, PM3 took over as master, but PM2 ProcMgr didnt go HOT_STANDBY

[mariadb-user@ip-172-30-0-204 ~]$ ma getsystemi
getsysteminfo Fri Jan 5 15:47:17 2018

System 1.1.2

System and Module statuses

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Fri Jan 5 15:43:52 2018

Module pm1 ACTIVE Fri Jan 5 15:43:48 2018
Module pm2 ACTIVE Fri Jan 5 15:43:44 2018
Module pm3 ACTIVE Fri Jan 5 15:43:43 2018

Active Parent OAM Performance Module is 'pm1'
Primary Front-End MariaDB ColumnStore Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 ACTIVE Fri Jan 5 15:42:23 2018 1283
ProcessManager pm1 ACTIVE Fri Jan 5 15:42:29 2018 1440
DBRMControllerNode pm1 ACTIVE Fri Jan 5 15:43:18 2018 2897
ServerMonitor pm1 ACTIVE Fri Jan 5 15:43:20 2018 2956
DBRMWorkerNode pm1 ACTIVE Fri Jan 5 15:43:20 2018 2996
DecomSvr pm1 ACTIVE Fri Jan 5 15:43:24 2018 3159
PrimProc pm1 ACTIVE Fri Jan 5 15:43:27 2018 3262
ExeMgr pm1 ACTIVE Fri Jan 5 15:43:37 2018 5003
WriteEngineServer pm1 ACTIVE Fri Jan 5 15:43:41 2018 5143
DDLProc pm1 ACTIVE Fri Jan 5 15:43:45 2018 5333
DMLProc pm1 ACTIVE Fri Jan 5 15:43:49 2018 5494
mysqld pm1 ACTIVE Fri Jan 5 15:43:41 2018 2696

ProcessMonitor pm2 ACTIVE Fri Jan 5 15:43:07 2018 15334
ProcessManager pm2 COLD_STANDBY Fri Jan 5 15:43:36 2018
DBRMControllerNode pm2 COLD_STANDBY Fri Jan 5 15:43:36 2018
ServerMonitor pm2 ACTIVE Fri Jan 5 15:43:22 2018 15820
DBRMWorkerNode pm2 ACTIVE Fri Jan 5 15:43:23 2018 15846
DecomSvr pm2 ACTIVE Fri Jan 5 15:43:26 2018 15877
PrimProc pm2 ACTIVE Fri Jan 5 15:43:30 2018 15885
ExeMgr pm2 ACTIVE Fri Jan 5 15:43:39 2018 16794
WriteEngineServer pm2 ACTIVE Fri Jan 5 15:43:43 2018 16815
DDLProc pm2 COLD_STANDBY Fri Jan 5 15:43:44 2018
DMLProc pm2 COLD_STANDBY Fri Jan 5 15:43:44 2018
mysqld pm2 ACTIVE Fri Jan 5 15:43:45 2018 15694

ProcessMonitor pm3 ACTIVE Fri Jan 5 15:43:08 2018 14322
ProcessManager pm3 HOT_STANDBY Fri Jan 5 15:43:12 2018 14457
DBRMControllerNode pm3 COLD_STANDBY Fri Jan 5 15:43:24 2018
ServerMonitor pm3 ACTIVE Fri Jan 5 15:43:27 2018 14823
DBRMWorkerNode pm3 ACTIVE Fri Jan 5 15:43:28 2018 14868
DecomSvr pm3 ACTIVE Fri Jan 5 15:43:31 2018 14882
PrimProc pm3 ACTIVE Fri Jan 5 15:43:34 2018 14890
ExeMgr pm3 ACTIVE Fri Jan 5 15:43:39 2018 14969
WriteEngineServer pm3 ACTIVE Fri Jan 5 15:43:43 2018 14990
DDLProc pm3 COLD_STANDBY Fri Jan 5 15:43:43 2018
DMLProc pm3 COLD_STANDBY Fri Jan 5 15:43:43 2018
mysqld pm3 ACTIVE Fri Jan 5 15:43:26 2018 14698

Active Alarm Counts: Critical = 0, Major = 0, Minor = 0, Warning = 0, Info = 0
[mariadb-user@ip-172-30-0-204 ~]$

STOP PM1

System 1.1.2

System and Module statuses

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Thu Jan 4 21:36:54 2018

Module pm1 AUTO_DISABLED/DEGRADED Thu Jan 4 21:35:01 2018
Module pm2 ACTIVE Thu Jan 4 21:36:12 2018
Module pm3 ACTIVE Thu Jan 4 21:35:38 2018

Active Parent OAM Performance Module is 'pm3'
Primary Front-End MariaDB ColumnStore Module is 'pm3'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
ProcessManager pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
DBRMControllerNode pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
ServerMonitor pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
DBRMWorkerNode pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
DecomSvr pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
PrimProc pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
ExeMgr pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
WriteEngineServer pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
DDLProc pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
DMLProc pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018
mysqld pm1 AUTO_OFFLINE Thu Jan 4 21:35:51 2018

ProcessMonitor pm2 ACTIVE Thu Jan 4 21:19:18 2018 3458
ProcessManager pm2 COLD_STANDBY Thu Jan 4 21:36:12 2018
DBRMControllerNode pm2 COLD_STANDBY Thu Jan 4 21:36:12 2018
ServerMonitor pm2 ACTIVE Thu Jan 4 21:19:33 2018 3951
DBRMWorkerNode pm2 ACTIVE Thu Jan 4 21:19:34 2018 3963
DecomSvr pm2 ACTIVE Thu Jan 4 21:19:37 2018 3995
PrimProc pm2 ACTIVE Thu Jan 4 21:19:40 2018 4003
ExeMgr pm2 ACTIVE Thu Jan 4 21:19:49 2018 4914
WriteEngineServer pm2 ACTIVE Thu Jan 4 21:19:53 2018 4935
DDLProc pm2 COLD_STANDBY Thu Jan 4 21:36:12 2018
DMLProc pm2 COLD_STANDBY Thu Jan 4 21:36:12 2018
mysqld pm2 ACTIVE Thu Jan 4 21:36:14 2018 3825

ProcessMonitor pm3 ACTIVE Thu Jan 4 21:19:19 2018 3457
ProcessManager pm3 ACTIVE Thu Jan 4 21:36:38 2018 3599
DBRMControllerNode pm3 ACTIVE Thu Jan 4 21:35:15 2018 7013
ServerMonitor pm3 ACTIVE Thu Jan 4 21:35:17 2018 7029
DBRMWorkerNode pm3 ACTIVE Thu Jan 4 21:35:17 2018 7050
DecomSvr pm3 ACTIVE Thu Jan 4 21:35:21 2018 7088
PrimProc pm3 ACTIVE Thu Jan 4 21:35:23 2018 7106
ExeMgr pm3 ACTIVE Thu Jan 4 21:35:27 2018 7177
WriteEngineServer pm3 ACTIVE Thu Jan 4 21:35:31 2018 7209
DDLProc pm3 ACTIVE Thu Jan 4 21:35:35 2018 7257
DMLProc pm3 ACTIVE Thu Jan 4 21:36:54 2018 7320
mysqld pm3 ACTIVE Thu Jan 4 21:36:26 2018 6868

Active Alarm Counts: Critical = 3, Major = 1, Minor = 0, Warning = 0, Info = 0
mcsadmin> getstorage



 Comments   
Comment by David Hill (Inactive) [ 2018-01-05 ]

pm2 status

MariaDB [(none)]> show master status\G;

                                                      • 1. row ***************************
                                                        File: mysql-bin.000002
                                                        Position: 342
                                                        Binlog_Do_DB:
                                                        Binlog_Ignore_DB:
                                                        1 row in set (0.00 sec)

ERROR: No query specified

MariaDB [(none)]> show slave status\G;

                                                      • 1. row ***************************
                                                        Slave_IO_State: Waiting for master to send event
                                                        Master_Host: 172.30.0.129
                                                        Master_User: idbrep
                                                        Master_Port: 3306
                                                        Connect_Retry: 60
                                                        Master_Log_File: mysql-bin.000003
                                                        Read_Master_Log_Pos: 2834
                                                        Relay_Log_File: relay-bin.000002
                                                        Relay_Log_Pos: 710
                                                        Relay_Master_Log_File: mysql-bin.000003
                                                        Slave_IO_Running: Yes
                                                        Slave_SQL_Running: Yes
                                                        Replicate_Do_DB:
                                                        Replicate_Ignore_DB:
                                                        Replicate_Do_Table:
                                                        Replicate_Ignore_Table:
                                                        Replicate_Wild_Do_Table:
                                                        Replicate_Wild_Ignore_Table:
                                                        Last_Errno: 0
                                                        Last_Error:
                                                        Skip_Counter: 0
                                                        Exec_Master_Log_Pos: 2834
                                                        Relay_Log_Space: 1013
                                                        Until_Condition: None
                                                        Until_Log_File:
                                                        Until_Log_Pos: 0
                                                        Master_SSL_Allowed: No
                                                        Master_SSL_CA_File:
                                                        Master_SSL_CA_Path:
                                                        Master_SSL_Cert:
                                                        Master_SSL_Cipher:
                                                        Master_SSL_Key:
                                                        Seconds_Behind_Master: 0
                                                        Master_SSL_Verify_Server_Cert: No
                                                        Last_IO_Errno: 0
                                                        Last_IO_Error:
                                                        Last_SQL_Errno: 0
                                                        Last_SQL_Error:
                                                        Replicate_Ignore_Server_Ids:
                                                        Master_Server_Id: 3
                                                        Master_SSL_Crl:
                                                        Master_SSL_Crlpath:
                                                        Using_Gtid: No
                                                        Gtid_IO_Pos:
                                                        Replicate_Do_Domain_Ids:
                                                        Replicate_Ignore_Domain_Ids:
                                                        Parallel_Mode: conservative
                                                        SQL_Delay: 0
                                                        SQL_Remaining_Delay: NULL
                                                        Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
                                                        1 row in set (0.00 sec)

ERROR: No query specified

MariaDB [(none)]>

Comment by David Hill (Inactive) [ 2018-01-22 ]

fixed 2 issues

1. fix issue where no HOT_STANDBY procmgr existed after pm1 outage
2. on a Parent pm outage, it was going through recovery code in the parent outage code and in module outage code. Changed to have it only process the outage in the parent outage code

HOW TO TEST..

1. on a 3 combo pm system with storage,remove pm1 and make sure you are left with an HOST_STANDBY ProcMgr.
2. on pm1 outage, just make sure the 2 remaining PMS processes are in a good state

Comment by David Hill (Inactive) [ 2018-01-22 ]

https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/379

Comment by Ben Thompson (Inactive) [ 2018-01-22 ]

Reviewed / Merged

Comment by Daniel Lee (Inactive) [ 2018-01-25 ]

Build verified: 1.1.3-1 created on 01/24/2018, ami-99b40be1

Verified this ticket.
Also encountered the issue described in MCOL-1034.

mcsadmin> getprocessstatus
getprocessstatus Thu Jan 25 17:13:02 2018

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
ProcessManager pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
DBRMControllerNode pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
ServerMonitor pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
DBRMWorkerNode pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
DecomSvr pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
PrimProc pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
ExeMgr pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
WriteEngineServer pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
DDLProc pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
DMLProc pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018
mysqld pm1 AUTO_OFFLINE Thu Jan 25 16:58:44 2018

ProcessMonitor pm2 ACTIVE Thu Jan 25 16:50:30 2018 3362
ProcessManager pm2 ACTIVE Thu Jan 25 16:58:59 2018 3527
DBRMControllerNode pm2 ACTIVE Thu Jan 25 16:58:55 2018 6839
ServerMonitor pm2 ACTIVE Thu Jan 25 16:58:57 2018 6874
DBRMWorkerNode pm2 ACTIVE Thu Jan 25 16:58:57 2018 6919
DecomSvr pm2 ACTIVE Thu Jan 25 16:59:01 2018 6962
PrimProc pm2 ACTIVE Thu Jan 25 16:59:03 2018 6981
ExeMgr pm2 ACTIVE Thu Jan 25 16:59:07 2018 7054
WriteEngineServer pm2 ACTIVE Thu Jan 25 16:59:11 2018 7103
DDLProc pm2 ACTIVE Thu Jan 25 16:59:15 2018 7165
DMLProc pm2 ACTIVE Thu Jan 25 16:59:32 2018 7233
mysqld pm2 ACTIVE Thu Jan 25 16:58:55 2018 6648

ProcessMonitor pm3 ACTIVE Thu Jan 25 16:50:31 2018 3370
ProcessManager pm3 HOT_STANDBY Thu Jan 25 16:59:35 2018 5386
DBRMControllerNode pm3 COLD_STANDBY Thu Jan 25 16:50:46 2018
ServerMonitor pm3 ACTIVE Thu Jan 25 16:50:49 2018 3881
DBRMWorkerNode pm3 ACTIVE Thu Jan 25 16:50:50 2018 3912
DecomSvr pm3 ACTIVE Thu Jan 25 16:50:53 2018 3925
PrimProc pm3 ACTIVE Thu Jan 25 16:50:56 2018 3933
ExeMgr pm3 ACTIVE Thu Jan 25 16:59:23 2018 5316
WriteEngineServer pm3 ACTIVE Thu Jan 25 16:59:28 2018 5354
DDLProc pm3 COLD_STANDBY Thu Jan 25 16:51:05 2018
DMLProc pm3 COLD_STANDBY Thu Jan 25 16:51:05 2018
mysqld pm3 ACTIVE Thu Jan 25 16:59:24 2018 5267

Generated at Thu Feb 08 02:26:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.