[MCOL-1466] handling multi server columnstore failover Created: 2018-06-12  Updated: 2021-04-05  Resolved: 2021-04-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: 1.1.4
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Jewel Majumder Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Environment:

Linux


Attachments: Text File UM1_system_start_report.txt     File columnstoreSupportReport.mariadbcolumnstore.tar.gz     File columnstoreSupportReport.mycolumnstore.tar.gz     Text File post-configure-steps-followed.txt    
Issue Links:
PartOf
includes MCOL-1572 About our ParentOAM failure handling ... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
MCOL-1572 About our ParentOAM failure handling ... Sub-Task Closed  

 Description   

We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. But after few time we got failure on PM1 (instence stopped) which is our parentOAM. As per failover handling it has transfered dbroot 1 to PM2 and set PM2 as parentOAM. Now PM2 has 2 dbroot attached 1,2. Till this system is working fine queries also running on UM1.

When we encountered PM1 is stopped we have started instance from AWS console then after everything got messy. It has autometically attached dbroot 2 to PM1. We have tried multiple times to shutdown system and start again but it got failure on UM1 as attched "UM1_system_start_report.txt"

Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.515921 |0|0|0| E 18 CAL0000: Error: getDBRMdata failed
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.516521 |0|0|0| I 18 CAL0000: STARTALL: ACK back to ProcMgr, return status = 8
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.516577 |0|0|0| D 18 CAL0000: Send SET Alarm ID 27 on device DBRM
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.516996 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start All process request...
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543707 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543761 |0|0|0| D 18 CAL0000: STARTING Process: DBRMWorkerNode
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543784 |0|0|0| D 18 CAL0000: Process location: /usr/local/mariadb/columnstore/bin/workernode
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.545888 |0|0|0| D 18 CAL0000: getLocalDBRMID Worker Node ID = 2
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.547583 |0|0|0| D 18 CAL0000: getDBRMdata called
Jun 12 06:26:07 ip-172-31-7-171 messagequeue[7787]: 07.556991 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed

It is just getting in loop for this erroe again and again when I checked for port 8604 I got this entries but this all entries removed when I shutdown system.

netstat -a | grep 8604
tcp 0 0 ip-172-31-7-5.ec2:47186 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46932 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47208 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46768 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47022 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46758 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47232 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47178 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47118 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47244 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47032 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47266 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46842 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46852 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47108 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47196 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:47220 ip-172-31-7-211.ec:8604 TIME_WAIT
tcp 0 0 ip-172-31-7-5.ec2:46922 ip-172-31-7-211.ec:8604 TIME_WAIT

I tried different different solutions like to disable pm1 again, assigned dbrootId 2 to pm2 again, etc but nothing worked. So finnaly I tried to reconfigure system using below command but that also not worked. I have attacked report files generated by /usr/local/mariadb/columnstore/bin/columnstoreSupport.

Please provide me solution for this error. Also I need proper steps to recover system when any PM got failure and Its dbroot transffered to other PM (Please specify both cases 1> failed PM is parentOAM 2> failed PM is non ParentOAM).



 Comments   
Comment by David Hill (Inactive) [ 2018-06-12 ]

here is a couple of things to try... not sure what state the pm1 module is at this time, so lets try a few things

if pm2 is still active and from pm2

  1. mcsadmin
    > stopsystem y // it might fail, but just continue
    > altersystem-enable pm1 //incase pm1 is in a disabled state
    > move or assign dbroot 2 to pm1 // dbroot 1 always say to stay with the Active Parent module, which is pm2 now
    > startSystem

If that works and you want pm1 back as the active one with dbroot 1, run from pm2

  1. mcsadmin
    > switchParentOAMModule pm1

In a case where all else fails on getting the dbroots back assigned to the correct modules and the system active, you can always rerun postConfigure.

if pm2 is active,

  1. mcsadmin shutdownsystem y
    From pm1
  1. /usr/local/mariadb/columnstore/bin/postConfigure

if it reports a module is disabled, enabled it.
on the dbroots, assigned dbroot 1 to pm1
and dbroot 2 to pm2.

then just continue with the install and that should get the configuration back to what you want and the system up

Comment by Developer [ 2018-06-13 ]

Hi David,
I am working with Jewel for mariaDB integration.

This is current System State.

mcsadmin getsystemi
getsysteminfo Wed Jun 13 07:16:37 2018

System mariadbcolumnstore

System and Module statuses

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Tue Jun 12 22:46:10 2018

Module um1 FAILED Tue Jun 12 22:30:36 2018
Module pm1 DEGRADED Tue Jun 12 22:36:40 2018
Module pm2 MAN_OFFLINE Tue Jun 12 22:30:41 2018
Module pm3 MAN_OFFLINE Tue Jun 12 22:30:46 2018

Active Parent OAM Performance Module is 'pm1'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 INITIAL
ServerMonitor um1 INITIAL
DBRMWorkerNode um1 INITIAL
ExeMgr um1 INITIAL
DDLProc um1 INITIAL
DMLProc um1 INITIAL
mysqld um1 INITIAL

ProcessMonitor pm1 ACTIVE Tue Jun 12 22:25:20 2018 1299
ProcessManager pm1 ACTIVE Tue Jun 12 22:25:26 2018 1543
DBRMControllerNode pm1 AUTO_OFFLINE Tue Jun 12 22:36:40 2018
ServerMonitor pm1 ACTIVE Tue Jun 12 22:30:38 2018 7420
DBRMWorkerNode pm1 ACTIVE Tue Jun 12 22:30:38 2018 7451
DecomSvr pm1 ACTIVE Tue Jun 12 22:30:42 2018 7502
PrimProc pm1 ACTIVE Tue Jun 12 22:30:44 2018 7549
WriteEngineServer pm1 ACTIVE Tue Jun 12 22:30:45 2018 7569

ProcessMonitor pm2 INITIAL
ProcessManager pm2 INITIAL
DBRMControllerNode pm2 INITIAL
ServerMonitor pm2 INITIAL
DBRMWorkerNode pm2 INITIAL
DecomSvr pm2 INITIAL
PrimProc pm2 INITIAL
WriteEngineServer pm2 INITIAL

ProcessMonitor pm3 INITIAL
ProcessManager pm3 INITIAL
DBRMControllerNode pm3 INITIAL
ServerMonitor pm3 INITIAL
DBRMWorkerNode pm3 INITIAL
DecomSvr pm3 INITIAL
PrimProc pm3 INITIAL
WriteEngineServer pm3 INITIAL

Active Alarm Counts: Critical = 3, Major = 5, Minor = 0, Warning = 0, Info = 0

We are facing issue in UM1 thatswhy system is not working. We have already posted UM1 error in previous post also attached log.

Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.515921 |0|0|0| E 18 CAL0000: Error: getDBRMdata failed
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.516521 |0|0|0| I 18 CAL0000: STARTALL: ACK back to ProcMgr, return status = 8
Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.516577 |0|0|0| D 18 CAL0000: Send SET Alarm ID 27 on device DBRM
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.516996 |0|0|0| I 18 CAL0000: MSG RECEIVED: Start All process request...
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543707 |0|0|0| D 18 CAL0000: checkSpecialProcessState status return : 2
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543761 |0|0|0| D 18 CAL0000: STARTING Process: DBRMWorkerNode
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.543784 |0|0|0| D 18 CAL0000: Process location: /usr/local/mariadb/columnstore/bin/workernode
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.545888 |0|0|0| D 18 CAL0000: getLocalDBRMID Worker Node ID = 2
Jun 12 06:26:02 ip-172-31-7-171 ProcessMonitor[7787]: 02.547583 |0|0|0| D 18 CAL0000: getDBRMdata called
Jun 12 06:26:07 ip-172-31-7-171 messagequeue[7787]: 07.556991 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed

Please help to fix UM1 issue.

Comment by David Hill (Inactive) [ 2018-06-13 ]

ok, so it looks like pm1 is back to the parent module. I hope and assume pm2 has dbroot 2 assigned to it and that is all straightened out.

There are the 2 thinks you need to try from pm1:

  1. mcsadmin
    > shutdownsystem y
    > startssystem // if not using ssh-keys, you need to provide the user passwork. so it would be "startssystem 'password'

Again if this doesnt work

// then rerun the postConfigure install from pm1

  1. mcsadmin shutdownsystem y
  2. .../bin/postConfigure
Comment by Developer [ 2018-06-13 ]

Hi David,

I already tried to postConfigure that had not worked. After postConfigure system is not starting it is giving same error (getDBRMdata called
Jun 12 06:26:07 ip-172-31-7-171 messagequeue[7787]: 07.556991 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed).

Please review the logs which I have attached.

Thanks.

Comment by David Hill (Inactive) [ 2018-06-13 ]

actually dont see any attached files on this mcol..

Comment by Developer [ 2018-06-14 ]

UM1_system_start_report.txt columnstoreSupportReport.mariadbcolumnstore.tar.gz

Sorry thats our mistake. We have added on another mcol.

Comment by Developer [ 2018-06-18 ]

Hi David,

Are you working on this issue? any updates?

Comment by David Hill (Inactive) [ 2018-06-19 ]

Sorry, was busy working on getting out a new CS release of 1.1.5 the past 4 or 5 days..

Taking a look at the logs. some of these are issues causing problems, just notes

1. You are using NFS to mount to external storages. This isn't a recommended way to use external storage, but I know something you have to use what is available. NFS can be unreliable in the case where updates
can be slower and disk corruptions base on how NFS caching can work can occur more that EXT mounted storages. I also see that you have both all pms setting with the same mounts for for the dbroots. This also can cause data corruption when you have multiple nodes mounted to the same disk at the same time.
2. interesting pm1 status - generally points to a network issue, but in this case, procmgr pinging on own server. no sure why that would be shoing DEGRADED status.
be curious if you ran a ping test on pm1 to itself and see if it reports on data loss or failures
3. this critical log isnt good from pm1. point to a possible DBRM corruption, which is files that reside on dbroot #1
Jun 12 06:26:14 ip-172-31-0-155 ProcessManager[5075]: 14.642043 |0|0|0| C 17 CAL0000: getDBRMData: DBRM data files error, current file exist without OIDBitmapFile
Jun 12 06:26:18 ip-172-31-0-155 controllernode[6350]: 18.404190 |0|0|0| C 29 CAL0000: Extent Map not empty and /usr/local/mariadb/columnstore/data1/systemFiles/dbrm/oidbitmap no
t found. Setting system to read-only
So there is a loss of DBRM data files, this is what you have on your system. There is no oidbitmap file, which it what the critical logs is reporting. cant say if this is NFS related to something else and these files cant be recreated.. If you perform maintenance backups, then it can be recovered from there.

– Disk BRM Data files –

total 28
-rwxr-xr-t 1 root root 64 Jun 12 05:56 BRM_saves_current
-rwxr-xr-t 1 root root 4460 Jun 12 05:56 BRM_saves_em
-rwxr-xr-t 1 root root 0 Jun 11 12:54 BRM_saves_journal
-rwxr-xr-t 1 root root 60 Jun 12 05:56 BRM_saves_vbbm
-rwxr-xr-t 1 root root 8 Jun 12 05:56 BRM_saves_vss
rw-rw-r- 1 root root 0 Jun 11 12:47 SMTxnID

This is what a set of DBRM files should look like

ll
total 2100
rw-rw-rw. 1 root root 3436 Jun 18 08:49 BRM_savesA_em
rw-rw-rw. 1 root root 12 Jun 18 08:49 BRM_savesA_vbbm
rw-rw-rw. 1 root root 8 Jun 18 08:49 BRM_savesA_vss
rw-rw-rw. 1 root root 3372 Jun 18 08:49 BRM_savesB_em
rw-rw-rw. 1 root root 12 Jun 18 08:49 BRM_savesB_vbbm
rw-rw-rw. 1 root root 8 Jun 18 08:49 BRM_savesB_vss
rw-rw-r-. 1 root root 64 Jun 18 08:49 BRM_saves_current
rw-rw-rw. 1 root root 3436 Jun 18 08:49 BRM_saves_em
rw-rw-rw. 1 root root 10 Jun 19 10:43 BRM_saves_journal
rw-rw-rw. 1 root root 12 Jun 18 08:49 BRM_saves_vbbm
rw-rw-rw. 1 root root 8 Jun 18 08:49 BRM_saves_vss
rw-rw-rw. 1 root root 2099202 Jun 18 08:48 oidbitmap
rw-rw-r-. 1 root root 12 Jun 19 10:43 SMTxnID
[root@ip-172-31-31-216 dbrm]# pwd
/usr/local/mariadb/columnstore/data1/systemFiles/dbrm

So the system will not come up due to the missing DBRM files.

so options are
1. if you do maintenance backups, these files would be part of them. follow the DB restore procedure to put the backups on and start from there
2. if you don't, then DB is not recoverable. You would need to start with a fresh install.

Comment by Jewel Majumder [ 2018-07-12 ]

Hi David,

As per your suggestion we have intiated all our Amazon Instances with "MariaDB-ColumnStore-1.1.5 - ami-a0c09edf" AMI. Also we have added ext2 saperate volumes for each PM module. We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. Then to check what will happen if we will get failure on PM1 (instence stopped) which is our parentOAM.
We found system has moved parentOAM to other PM and PM1 become disabled but its dbroot had not moved. Also we noticed database become readonly access means It allows only SELECT operation but "CREATE TABLE, UPDATE, INSERT, DELETE" had stopped working. Why?

Please also help me to find out answer for some queries as below.
1 > can you please let us know what should be system behaviour when parentOAM got failure?
2 > Can you please check attached "post-configure-steps-followed.txt" to see we have followed proper steps to configure system?
3 > Can you also provide us details about which EBS Volume Type (gp2, io1, sc1, st1, standard) is best suitable for large amount of data we have some tables which has more than 50 Million records?

We have attached columnstoreSupport report with this.

Current system status is as below.

Component Status Last Status Change
------------ -------------------------- ------------------------
System ACTIVE Wed Jul 11 10:08:23 2018

Module um1 ACTIVE Wed Jul 11 09:45:53 2018
Module pm1 AUTO_DISABLED/DEGRADED Wed Jul 11 09:52:24 2018
Module pm2 DEGRADED Wed Jul 11 09:58:53 2018
Module pm3 ACTIVE Wed Jul 11 09:45:43 2018

Active Parent OAM Performance Module is 'pm2'
MariaDB ColumnStore Replication Feature is enabled

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Jul 11 09:45:09 2018 15729
ServerMonitor um1 ACTIVE Wed Jul 11 09:45:28 2018 16138
DBRMWorkerNode um1 MAN_OFFLINE Wed Jul 11 09:53:18 2018
ExeMgr um1 ACTIVE Wed Jul 11 09:54:11 2018 20972
DDLProc um1 MAN_OFFLINE Wed Jul 11 09:54:30 2018
DMLProc um1 MAN_OFFLINE Wed Jul 11 09:54:42 2018
mysqld um1 ACTIVE Wed Jul 11 09:54:18 2018 21247

ProcessMonitor pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
ProcessManager pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
DBRMControllerNode pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
ServerMonitor pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
DBRMWorkerNode pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
DecomSvr pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
PrimProc pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018
WriteEngineServer pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018

ProcessMonitor pm2 ACTIVE Wed Jul 11 09:45:11 2018 15035
ProcessManager pm2 ACTIVE Wed Jul 11 09:52:58 2018 15173
DBRMControllerNode pm2 AUTO_OFFLINE Wed Jul 11 09:58:53 2018
ServerMonitor pm2 ACTIVE Wed Jul 11 09:52:44 2018 16531
DBRMWorkerNode pm2 ACTIVE Wed Jul 11 09:53:27 2018 17060
DecomSvr pm2 ACTIVE Wed Jul 11 09:52:48 2018 16616
PrimProc pm2 ACTIVE Wed Jul 11 09:54:06 2018 17493
WriteEngineServer pm2 ACTIVE Wed Jul 11 09:54:20 2018 17726

ProcessMonitor pm3 ACTIVE Wed Jul 11 09:45:12 2018 29271
ProcessManager pm3 HOT_STANDBY Wed Jul 11 09:54:47 2018 30244
DBRMControllerNode pm3 COLD_STANDBY Wed Jul 11 09:52:50 2018
ServerMonitor pm3 ACTIVE Wed Jul 11 09:45:37 2018 29528
DBRMWorkerNode pm3 MAN_OFFLINE Wed Jul 11 09:53:57 2018
DecomSvr pm3 ACTIVE Wed Jul 11 09:45:41 2018 29570
PrimProc pm3 ACTIVE Wed Jul 11 09:54:07 2018 30144
WriteEngineServer pm3 ACTIVE Wed Jul 11 09:54:21 2018 30211

Active Alarm Counts: Critical = 2, Major = 3, Minor = 4, Warning = 0, Info = 0 columnstoreSupportReport.mycolumnstore.tar.gz post-configure-steps-followed.txt

Comment by Developer [ 2018-07-17 ]

Hi David,

Have you gone through above described problems? any updates?

Thanks.

Comment by Todd Stoffel (Inactive) [ 2021-04-05 ]

OAM has been deprecated and all of these old bash scripts were removed as part of a cleanup sweep that was done recently.

Generated at Thu Feb 08 02:28:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.