[MCOL-1466] handling multi server columnstore failover Created: 2018-06-12 Updated: 2021-04-05 Resolved: 2021-04-05 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | N/A |
| Affects Version/s: | 1.1.4 |
| Fix Version/s: | N/A |
| Type: | Task | Priority: | Major |
| Reporter: | Jewel Majumder | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux |
||
| Attachments: |
|
||||||||||
| Issue Links: |
|
||||||||||
| Sub-Tasks: |
|
||||||||||
| Description |
|
We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. But after few time we got failure on PM1 (instence stopped) which is our parentOAM. As per failover handling it has transfered dbroot 1 to PM2 and set PM2 as parentOAM. Now PM2 has 2 dbroot attached 1,2. Till this system is working fine queries also running on UM1. When we encountered PM1 is stopped we have started instance from AWS console then after everything got messy. It has autometically attached dbroot 2 to PM1. We have tried multiple times to shutdown system and start again but it got failure on UM1 as attched "UM1_system_start_report.txt" Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.515921 |0|0|0| E 18 CAL0000: Error: getDBRMdata failed It is just getting in loop for this erroe again and again when I checked for port 8604 I got this entries but this all entries removed when I shutdown system. netstat -a | grep 8604 I tried different different solutions like to disable pm1 again, assigned dbrootId 2 to pm2 again, etc but nothing worked. So finnaly I tried to reconfigure system using below command but that also not worked. I have attacked report files generated by /usr/local/mariadb/columnstore/bin/columnstoreSupport. Please provide me solution for this error. Also I need proper steps to recover system when any PM got failure and Its dbroot transffered to other PM (Please specify both cases 1> failed PM is parentOAM 2> failed PM is non ParentOAM). |
| Comments |
| Comment by David Hill (Inactive) [ 2018-06-12 ] |
|
here is a couple of things to try... not sure what state the pm1 module is at this time, so lets try a few things if pm2 is still active and from pm2
If that works and you want pm1 back as the active one with dbroot 1, run from pm2
In a case where all else fails on getting the dbroots back assigned to the correct modules and the system active, you can always rerun postConfigure. if pm2 is active,
if it reports a module is disabled, enabled it. then just continue with the install and that should get the configuration back to what you want and the system up |
| Comment by Developer [ 2018-06-13 ] |
|
Hi David, This is current System State. mcsadmin getsystemi System mariadbcolumnstore System and Module statuses Component Status Last Status Change Module um1 FAILED Tue Jun 12 22:30:36 2018 Active Parent OAM Performance Module is 'pm1' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Tue Jun 12 22:25:20 2018 1299 ProcessMonitor pm2 INITIAL ProcessMonitor pm3 INITIAL Active Alarm Counts: Critical = 3, Major = 5, Minor = 0, Warning = 0, Info = 0 We are facing issue in UM1 thatswhy system is not working. We have already posted UM1 error in previous post also attached log. Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.515921 |0|0|0| E 18 CAL0000: Error: getDBRMdata failed Please help to fix UM1 issue. |
| Comment by David Hill (Inactive) [ 2018-06-13 ] |
|
ok, so it looks like pm1 is back to the parent module. I hope and assume pm2 has dbroot 2 assigned to it and that is all straightened out. There are the 2 thinks you need to try from pm1:
Again if this doesnt work // then rerun the postConfigure install from pm1
|
| Comment by Developer [ 2018-06-13 ] |
|
Hi David, I already tried to postConfigure that had not worked. After postConfigure system is not starting it is giving same error (getDBRMdata called Please review the logs which I have attached. Thanks. |
| Comment by David Hill (Inactive) [ 2018-06-13 ] |
|
actually dont see any attached files on this mcol.. |
| Comment by Developer [ 2018-06-14 ] |
|
UM1_system_start_report.txt Sorry thats our mistake. We have added on another mcol. |
| Comment by Developer [ 2018-06-18 ] |
|
Hi David, Are you working on this issue? any updates? |
| Comment by David Hill (Inactive) [ 2018-06-19 ] |
|
Sorry, was busy working on getting out a new CS release of 1.1.5 the past 4 or 5 days.. Taking a look at the logs. some of these are issues causing problems, just notes 1. You are using NFS to mount to external storages. This isn't a recommended way to use external storage, but I know something you have to use what is available. NFS can be unreliable in the case where updates – Disk BRM Data files – total 28 This is what a set of DBRM files should look like ll So the system will not come up due to the missing DBRM files. so options are |
| Comment by Jewel Majumder [ 2018-07-12 ] |
|
Hi David, As per your suggestion we have intiated all our Amazon Instances with "MariaDB-ColumnStore-1.1.5 - ami-a0c09edf" AMI. Also we have added ext2 saperate volumes for each PM module. We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. Then to check what will happen if we will get failure on PM1 (instence stopped) which is our parentOAM. Please also help me to find out answer for some queries as below. We have attached columnstoreSupport report with this. Current system status is as below. Component Status Last Status Change Module um1 ACTIVE Wed Jul 11 09:45:53 2018 Active Parent OAM Performance Module is 'pm2' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 AUTO_OFFLINE Wed Jul 11 09:52:34 2018 ProcessMonitor pm2 ACTIVE Wed Jul 11 09:45:11 2018 15035 ProcessMonitor pm3 ACTIVE Wed Jul 11 09:45:12 2018 29271 Active Alarm Counts: Critical = 2, Major = 3, Minor = 4, Warning = 0, Info = 0 columnstoreSupportReport.mycolumnstore.tar.gz |
| Comment by Developer [ 2018-07-17 ] |
|
Hi David, Have you gone through above described problems? any updates? Thanks. |
| Comment by Todd Stoffel (Inactive) [ 2021-04-05 ] |
|
OAM has been deprecated and all of these old bash scripts were removed as part of a cleanup sweep that was done recently. |