[MCOL-1465] handling multi server columnstore failover Created: 2018-06-12 Updated: 2020-11-12 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 1.1.4 |
| Fix Version/s: | Icebox |
| Type: | Task | Priority: | Major |
| Reporter: | Jewel Majumder | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux |
||
| Attachments: |
|
| Description |
|
We have Multi Server ColumnStore System (1 UM, 3 PM). When we have configured system first time all seems fine. But after few time we got failure on PM1 (instence stopped) which is our parentOAM. As per failover handling it has transfered dbroot 1 to PM2 and set PM2 as parentOAM. Now PM2 has 2 dbroot attached 1,2. Till this system is working fine queries also running on UM1. When we encountered PM1 is stopped we have started instance from AWS console then after everything got messy. It has autometically attached dbroot 2 to PM1. We have tried multiple times to shutdown system and start again but it got failure on UM1 as attched "UM1_system_start_report.txt" Jun 12 06:26:01 ip-172-31-7-171 ProcessMonitor[7787]: 01.515921 |0|0|0| E 18 CAL0000: Error: getDBRMdata failed It is just getting in loop for this erroe again and again when I checked for port 8604 I got this entries but this all entries removed when I shutdown system. netstat -a | grep 8604 I tried different different solutions like to disable pm1 again, assigned dbrootId 2 to pm2 again, etc but nothing worked. So finnaly I tried to reconfigure system using below command but that also not worked. I have attacked report files generated by /usr/local/mariadb/columnstore/bin/columnstoreSupport. Please provide me solution for this error. Also I need proper steps to recover system when any PM got failure and Its dbroot transffered to other PM (Please specify both cases 1> failed PM is parentOAM 2> failed PM is non ParentOAM). |