handling multi server columnstore failover (MCOL-1466)

[MCOL-1572] About our ParentOAM failure handling issue with AmazonAMI Created: 2018-07-20  Updated: 2021-04-05  Resolved: 2021-04-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: None
Fix Version/s: N/A

Type: Sub-Task Priority: Critical
Reporter: Developer Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None

Attachments: Text File 45865_post-configure-steps-followed.txt     File columnstoreSupportReport.mycolumnstore (1).tar.gz    
Issue Links:
PartOf
is part of MCOL-1466 handling multi server columnstore fai... Closed

 Description   

Hi David,
As per your suggestion we have intiated all our Amazon Instances with "MariaDB-ColumnStore-1.1.5 - ami-a0c09edf" AMI. Also we have added ext2 saperate volumes for each PM module. We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. Then to check what will happen if we will get failure on PM1 (instence stopped) which is our parentOAM.
We found system has moved parentOAM to other PM and PM1 become disabled but its dbroot had not moved. Also we noticed database become readonly access means It allows only SELECT operation but "CREATE TABLE, UPDATE, INSERT, DELETE" had stopped working. Why?
Please also help me to find out answer for some queries as below.
1 > can you please let us know what should be system behaviour when parentOAM got failure?
2 > Can you please check attached "post-configure-steps-followed.txt" to see we have followed proper steps to configure system?
3 > Can you also provide us details about which EBS Volume Type (gp2, io1, sc1, st1, standard) is best suitable for large amount of data we have some tables which has more than 50 Million records?
We have attached columnstoreSupport report with this.
Current system status is as below.
Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Wed Jul 11 10:08:23 2018
Module um1 ACTIVE Wed Jul 11 09:45:53 2018
Module pm1 AUTO_DISABLED/DEGRADED Wed Jul 11 09:52:24 2018
Module pm2 DEGRADED Wed Jul 11 09:58:53 2018
Module pm3 ACTIVE Wed Jul 11 09:45:43 2018
Active Parent OAM Performance Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled
MariaDB ColumnStore Process statuses
Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Jul 11 09:45:09 2018 15729
ServerMonitor um1 ACTIVE Wed Jul 11 09:45:28 2018 16138
DBRMWorkerNode um1 MAN_OFFLINE Wed Jul 11 09:53:18 2018
ExeMgr um1 ACTIVE Wed Jul 11 09:54:11 2018 20972
DDLProc um1 MAN_OFFLINE Wed Jul 11 09:54:30 2018
DMLProc um1 MAN_OFFLINE Wed Jul 11 09:54:42 2018
mysqld um1 ACTIVE Wed Jul 11 09:54:18 2018 21247
ProcessMonitor pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
ProcessManager pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
DBRMControllerNode pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
ServerMonitor pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
DBRMWorkerNode pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
DecomSvr pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
PrimProc pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
WriteEngineServer pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
ProcessMonitor pm2 ACTIVE Wed Jul 11 09:45:11 2018 15035
ProcessManager pm2 ACTIVE Wed Jul 11 09:52:58 2018 15173
DBRMControllerNode pm2 AUTO_OFFLINE Wed Jul 11 09:58:53 2018
ServerMonitor pm2 ACTIVE Wed Jul 11 09:52:44 2018 16531
DBRMWorkerNode pm2 ACTIVE Wed Jul 11 09:53:27 2018 17060
DecomSvr pm2 ACTIVE Wed Jul 11 09:52:48 2018 16616
PrimProc pm2 ACTIVE Wed Jul 11 09:54:06 2018 17493
WriteEngineServer pm2 ACTIVE Wed Jul 11 09:54:20 2018 17726
ProcessMonitor pm3 ACTIVE Wed Jul 11 09:45:12 2018 29271
ProcessManager pm3 HOT_STANDBY Wed Jul 11 09:54:47 2018 30244
DBRMControllerNode pm3 COLD_STANDBY Wed Jul 11 09:52:50 2018
ServerMonitor pm3 ACTIVE Wed Jul 11 09:45:37 2018 29528
DBRMWorkerNode pm3 MAN_OFFLINE Wed Jul 11 09:53:57 2018
DecomSvr pm3 ACTIVE Wed Jul 11 09:45:41 2018 29570
PrimProc pm3 ACTIVE Wed Jul 11 09:54:07 2018 30144
WriteEngineServer pm3 ACTIVE Wed Jul 11 09:54:21 2018 30211
Active Alarm Counts: Critical = 2, Major = 3, Minor = 4, Warning = 0, Info = 0



 Comments   
Comment by David Hill (Inactive) [ 2018-07-25 ]

So the same issue showed up in the second scenerio. The system was in a non-funtioning state because there were missing DBRM files. This is the set of DBRM files from pm2 off of dbroot 1. As shown in the other MCOL, it looks like just a startup set of files. There is the OID file missing, which is causing the system not to startup.

total 16
rw-rw-rw 1 mariadb-user mariadb-user 72 Jul 11 09:52 BRM_saves_current
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 11 09:52 BRM_saves_em
rw-rw-rw 1 mariadb-user mariadb-user 0 Jul 11 09:52 BRM_saves_journal
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 11 09:52 BRM_saves_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 11 09:52 BRM_saves_vss
rw-rw-r- 1 mariadb-user mariadb-user 0 Jul 11 09:52 SMTxnID

                                  1. cat /home/mariadb-user/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves_current #################

Again, not sure what is happening to these files during the pm1 to pm2 failover. I'm trying to produice the issue.

Also here is some info on how PM DBROOT assignmenst work and how it is handle on FAILOVER.

After a normal install, PM1 is the Parent and it will have DBROOT 1 assigned to it. DBROOT 1 is "ALWAYS" assigned to the Parent Module. And PM2 has DBROOT 2 and PM3 has DBOOT 3. In the case, PM2 is the HOT-STANDBY Parent

When PM1 GOES DOWN, this is work happens:
1. HOST-STANDBY PARENT PM2 is made the Active Parent and DBROOT 1 is mounted and assigned to PM2. So with PM1, PM2 has DBOOT 1 and 2.

When PM1 RECOVERS:
1. PM2 stays as the Active Parent and DBROOT 1 will stay there, DBROOT 2 will now be assigned to PM1. This is normal.

So this failover process is all working as designed. But in your case, DBRM files that are on DBROOT 1 get deleted, lost, something leaving the system will partial DBRM files and the system failing to startup.

So that is your issue, will see if I can reproduce

Comment by David Hill (Inactive) [ 2018-07-25 ]

couldnt reproduce issue. process shown below shows the dbrm before outage with down and after recovery. didnt have any dbrm file issue and system recovered with pm1 up and running again.

– Disk BRM Data files –

total 16
rw-rw-rw 1 mariadb-user mariadb-user 72 Jul 11 09:52 BRM_saves_current
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 11 09:52 BRM_saves_em
rw-rw-rw 1 mariadb-user mariadb-user 0 Jul 11 09:52 BRM_saves_journal
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 11 09:52 BRM_saves_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 11 09:52 BRM_saves_vss
rw-rw-r- 1 mariadb-user mariadb-user 0 Jul 11 09:52 SMTxnID

                                  1. cat /home/mariadb-user/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves_current #################

/home/mariadb-user/mariadb/columnstore/data1/systemFiles/dbrm/BRM_saves

---------------------------------

new system install - dbrm files from PM1 / DBROOT1

ll
total 2100
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 25 20:04 BRM_savesA_em
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_savesA_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_savesA_vss
rw-rw-rw 1 mariadb-user mariadb-user 3372 Jul 25 20:04 BRM_savesB_em
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_savesB_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_savesB_vss
rw-rw-r- 1 mariadb-user mariadb-user 72 Jul 25 20:04 BRM_saves_current
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 25 20:04 BRM_saves_em
rw-rw-rw 1 mariadb-user mariadb-user 0 Jul 25 20:04 BRM_saves_journal
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_saves_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_saves_vss
rw-rw-rw 1 mariadb-user mariadb-user 2099202 Jul 25 20:03 oidbitmap
rw-rw-r- 1 mariadb-user mariadb-user 12 Jul 25 20:04 SMTxnID
[mariadb-user@ip-172-31-37-86 dbrm]$ PWD
-bash: PWD: command not found
[mariadb-user@ip-172-31-37-86 dbrm]$ pwd
/home/mariadb-user/mariadb/columnstore/data1/systemFiles/dbrm
[mariadb-user@ip-172-31-37-86 dbrm]$

PM1 STOP INSTANCE - FROM PM2 DBROOT 1,2 ASSIGNED AND ALL DBRM FILES EXIST

Performance Module (DBRoot) Storage Type = external
User Module Storage Type = internal
System Assigned DBRoot Count = 3
DBRoot IDs assigned to 'pm1' =
DBRoot IDs assigned to 'pm2' = 1, 2
DBRoot IDs assigned to 'pm3' = 3

ll
total 2104
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 25 20:04 BRM_savesA_em
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_savesA_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_savesA_vss
rw-rw-rw 1 mariadb-user mariadb-user 3372 Jul 25 20:04 BRM_savesB_em
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_savesB_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_savesB_vss
rw-rw-r- 1 mariadb-user mariadb-user 72 Jul 25 20:04 BRM_saves_current
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 25 20:04 BRM_saves_em
rw-rw-rw 1 mariadb-user mariadb-user 5 Jul 25 20:13 BRM_saves_journal
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_saves_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_saves_vss
rw-rw-rw 1 mariadb-user mariadb-user 2099202 Jul 25 20:03 oidbitmap
rw-rw-r- 1 mariadb-user mariadb-user 12 Jul 25 20:13 SMTxnID
[mariadb-user@ip-172-31-46-54 dbrm]$ pwd
/home/mariadb-user/mariadb/columnstore/data1/systemFiles/dbrm
[mariadb-user@ip-172-31-46-54 dbrm]$

PM1 START INSTANCE

Performance Module (DBRoot) Storage Type = external
User Module Storage Type = internal
System Assigned DBRoot Count = 3
DBRoot IDs assigned to 'pm1' = 2
DBRoot IDs assigned to 'pm2' = 1
DBRoot IDs assigned to 'pm3' = 3

Amazon EC2 Volume Name/Device Name/Amazon Device Name for DBRoot1: vol-0dbf71303b5d79d46, /dev/sdg, /dev/xvdg
Amazon EC2 Volume Name/Device Name/Amazon Device Name for DBRoot2: vol-0b3208991ca6310ad, /dev/sdh, /dev/xvdh
Amazon EC2 Volume Name/Device Name/Amazon Device Name for DBRoot3: vol-0d11a1953baea06b2, /dev/sdi, /dev/xvdi

[mariadb-user@ip-172-31-46-54 dbrm]$ home
[mariadb-user@ip-172-31-46-54 columnstore]$ cd dbrm
-bash: cd: dbrm: No such file or directory
[mariadb-user@ip-172-31-46-54 columnstore]$ dbrm
[mariadb-user@ip-172-31-46-54 dbrm]$ ll
total 2104
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 25 20:04 BRM_savesA_em
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_savesA_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_savesA_vss
rw-rw-rw 1 mariadb-user mariadb-user 3372 Jul 25 20:04 BRM_savesB_em
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_savesB_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_savesB_vss
rw-rw-r- 1 mariadb-user mariadb-user 72 Jul 25 20:04 BRM_saves_current
rw-rw-rw 1 mariadb-user mariadb-user 3436 Jul 25 20:04 BRM_saves_em
rw-rw-rw 1 mariadb-user mariadb-user 10 Jul 25 20:20 BRM_saves_journal
rw-rw-rw 1 mariadb-user mariadb-user 12 Jul 25 20:04 BRM_saves_vbbm
rw-rw-rw 1 mariadb-user mariadb-user 8 Jul 25 20:04 BRM_saves_vss
rw-rw-rw 1 mariadb-user mariadb-user 2099202 Jul 25 20:03 oidbitmap
rw-rw-r- 1 mariadb-user mariadb-user 12 Jul 25 20:20 SMTxnID
[mariadb-user@ip-172-31-46-54 dbrm]$ ma getsystemi
getsysteminfo Wed Jul 25 20:22:21 2018

System 1.1.5-ebs

System and Module statuses

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Wed Jul 25 20:20:34 2018

Module um1 ACTIVE Wed Jul 25 20:20:14 2018
Module pm1 ACTIVE Wed Jul 25 20:19:12 2018
Module pm2 ACTIVE Wed Jul 25 20:12:13 2018
Module pm3 ACTIVE Wed Jul 25 20:03:56 2018

Active Parent OAM Performance Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Jul 25 20:03:25 2018 2607
ServerMonitor um1 ACTIVE Wed Jul 25 20:03:41 2018 3012
DBRMWorkerNode um1 ACTIVE Wed Jul 25 20:19:40 2018 10058
ExeMgr um1 ACTIVE Wed Jul 25 20:20:04 2018 10192
DDLProc um1 ACTIVE Wed Jul 25 20:20:17 2018 10557
DMLProc um1 ACTIVE Wed Jul 25 20:20:23 2018 10595
mysqld um1 ACTIVE Wed Jul 25 20:20:14 2018 10468

ProcessMonitor pm1 ACTIVE Wed Jul 25 20:18:57 2018 1006
ProcessManager pm1 COLD_STANDBY Wed Jul 25 20:19:02 2018
DBRMControllerNode pm1 COLD_STANDBY Wed Jul 25 20:19:02 2018
ServerMonitor pm1 ACTIVE Wed Jul 25 20:19:05 2018 1285
DBRMWorkerNode pm1 ACTIVE Wed Jul 25 20:19:45 2018 1413
DecomSvr pm1 ACTIVE Wed Jul 25 20:19:10 2018 1327
PrimProc pm1 ACTIVE Wed Jul 25 20:19:58 2018 1440
WriteEngineServer pm1 ACTIVE Wed Jul 25 20:20:12 2018 1479

ProcessMonitor pm2 ACTIVE Wed Jul 25 20:03:27 2018 2099
ProcessManager pm2 ACTIVE Wed Jul 25 20:11:59 2018 2238
DBRMControllerNode pm2 ACTIVE Wed Jul 25 20:19:37 2018 11653
ServerMonitor pm2 ACTIVE Wed Jul 25 20:12:07 2018 3682
DBRMWorkerNode pm2 ACTIVE Wed Jul 25 20:19:49 2018 11859
DecomSvr pm2 ACTIVE Wed Jul 25 20:12:11 2018 3776
PrimProc pm2 ACTIVE Wed Jul 25 20:19:59 2018 12008
WriteEngineServer pm2 ACTIVE Wed Jul 25 20:20:13 2018 12276

ProcessMonitor pm3 ACTIVE Wed Jul 25 20:03:28 2018 2001
ProcessManager pm3 HOT_STANDBY Wed Jul 25 20:13:01 2018 2854
DBRMControllerNode pm3 COLD_STANDBY Wed Jul 25 20:19:36 2018
ServerMonitor pm3 ACTIVE Wed Jul 25 20:03:50 2018 2239
DBRMWorkerNode pm3 ACTIVE Wed Jul 25 20:19:53 2018 3422
DecomSvr pm3 ACTIVE Wed Jul 25 20:03:54 2018 2269
PrimProc pm3 ACTIVE Wed Jul 25 20:20:00 2018 3445
WriteEngineServer pm3 ACTIVE Wed Jul 25 20:20:14 2018 3491

Active Alarm Counts: Critical = 2, Major = 0, Minor = 0, Warning = 0, Info = 0
[mariadb-user@ip-172-31-46-54 dbrm]$

Comment by Developer [ 2018-07-25 ]

Hi David,
Thanks for responding.

Still I can't understood reason for below case. Please review this.

We found system has moved parentOAM to other PM and PM1 become disabled but its dbroot had not moved. Also we noticed database become readonly access means It allows only SELECT operation but "CREATE TABLE, UPDATE, INSERT, DELETE" had stopped working. Why?

Also can you check this 2 more questions?
2 > Can you please check attached "post-configure-steps-followed.txt" to see we have followed proper steps to configure system?

3 > Can you also provide us details about which EBS Volume Type (gp2, io1, sc1, st1, standard) is best suitable for large amount of data we have some tables which has more than 50 Million records?

Thanks.

Comment by Developer [ 2018-07-26 ]

Also I want to know how you are generating fail-over on PM1 (ParentOAM)? We are stopping PM1 instance from AWS Console to generate fail-over. is there any problem in that?

Comment by Todd Stoffel (Inactive) [ 2021-04-05 ]

OAM has been deprecated and all of these old bash scripts were removed as part of a cleanup sweep that was done recently.

Generated at Thu Feb 08 02:29:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.