[MCOL-1370] Network Error incorrect handled, Amazon DBROOT detach failed, but dbroot still was reassigned Created: 2018-04-26 Updated: 2023-10-26 Resolved: 2018-06-13 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | ? |
| Affects Version/s: | 1.1.4 |
| Fix Version/s: | 1.1.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | David Hill (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
amazon ec2 with ebs storage |
||
| Sprint: | 2018-11, 2018-12 |
| Description |
|
Issue reported by a customer, here is the analysis show that the system was in a bad state due to a detach failed. but the dbroot still got reassigned Ok here is what I see. There was some network issue where pm5 wasnt respoding to pings from pm1, so it went into failover state. But since PM5 was still up and some of the CS process were still running and had an access to DBROOT, it failed to get detach. I dont know if you have anything on your side to look at an network issues between pm1/pm5 on Apr 25 13:41:26 PM1 LOGS SHOWING THAT PM5 WASNT RESPONDING TO PINGS, WHICH INITIATE A MODULE DOWN ANA FAILURE FAILURE TO DETACH DBROOT 5 FROM PM5 PM5 THIS SAYS THAT PM5 WAS BASICLY IDLE AS FAR AS THE CS LOGS SHOW, NO ACTIVE CPIMPORTS THIS SHOWS THAT 3 SECONDS AFTER THE MODULE WASNT RESPONDING TO PINGS, IT RECEIVED MSGS FROM PM1 |
| Comments |
| Comment by David Hill (Inactive) [ 2018-05-04 ] |
|
The problem that caused this issue was a problem in the network, not a problem on pm5 module itself. So the module fault tolerance logic needs to be improved to be able to detect the difference and also handle the issue is a storage device (EBS in the case) fails to unmount from the detected bad node, it should not be left mounted on another module, like pm1 in this case. here is some additional information that was reported by the customer > Yes – it looks like just after ColumnStore noticed connectivity loss, the kernel noticed a module failure for the elastic networking adapter on pm5: > Apr 25 13:42:53 mcs1-pm5.us-west-2.prod.sasia.local kernel: ena 0000:00:03.0 eth0: Keep alive watchdog timeout. |
| Comment by David Thompson (Inactive) [ 2018-05-22 ] |
|
I think a specific action we can take is to fix the bug that results in the failover node pm1 thinking it should have dbroot5 when it doesn't. It would seem correct behavior to me that if the volume can't be mounted that we raise a critical error / alarm stating that and leaving dbroot5 with pm5 from a metadata point of view. If it is just a network adapter / network outage then it will recover when that is resolved. I don't know that if in general pm1 can distinguish a network problem from an instance problem. |
| Comment by David Hill (Inactive) [ 2018-05-22 ] |
|
agree with DT's last input.. 1. definitely can fix the logic where the dbroot gets assigned to pm1, but the unmount or unattached had failed leaving the system is a whose state. 2. and procmgr just does a ping to the instance, when it gets 3 pings failures in a row, it takes action. from this logic, it couldn't determine if the node was actually down or there was a network issue to that node. maybe the detach failed command could be take into effect telling procmgr that since the detach failed, the node is still active and a process is still mounted to the disk. in this case, dont process the failover like the node is down. But that would be just quessing, at best. |
| Comment by David Hill (Inactive) [ 2018-05-30 ] |
|
first change for this issue 1. change the auto failure in ProcMgr - try to detach EBS from reported downed module. if it fails, then leave that DBROOT assigned to that module and mark the module AUTO_OFFLINE. then ProcMgr will wait until it comes back online and then will bring it back into the system. So this will fix one of the issues, DBROOT being reassigned to another pm but then failure to get detached and mounted successfully leaving the system in a bad state that can recover automaticly |
| Comment by David Hill (Inactive) [ 2018-05-31 ] |
|
used to test - causes a ping failure while leaving the module up and running. |
| Comment by David Hill (Inactive) [ 2018-06-01 ] |
|
https://github.com/mariadb-corporation/mariadb-columnstore-engine/pull/488 |
| Comment by David Hill (Inactive) [ 2018-06-04 ] |
|
This issue was specific to Amazon system setup with EBS external storage. So that is how this needs to be tested. Test Cases - system with 2 or more pms THIS SIMULATES A DOWN MODULE THIS SIMULATES A NETWORK ISSUE BETWEEN PM1 AND ANOTHER PM
|
| Comment by Daniel Lee (Inactive) [ 2018-06-08 ] |
|
Build tested: 1.1.5-1 ami mcs-1.1.5 (ami-2571365d) With a 1um2pm stack, I stopped pm2, dbroot 2 did get remounted on pm1. But when I started pm2, the stack remain in this state: mcsadmin> getprocessstatus MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Fri Jun 8 15:24:35 2018 4314 ProcessMonitor pm2 AUTO_OFFLINE Fri Jun 8 15:33:28 2018 System Storage Configuration Performance Module (DBRoot) Storage Type = external Amazon EC2 Volume Name/Device Name for 'um1': vol-06ecc9b98ae685a4d, /dev/xvdf Amazon EC2 Volume Name/Device Name/Amazon Device Name for DBRoot1: vol-03bb47240de1462a7, /dev/sdg, /dev/xvdg mcsadmin> [mariadb-user@ip-172-31-26-89 columnstore]$ sudo cat crit.log err.log According to the creation date of the ami, it is created after the last comment so I assume it has the latest change. |
| Comment by David Hill (Inactive) [ 2018-06-11 ] |
|
started up the system and procmon didnt start on um1 and pm2. it not starting on pm2 PROBLEM WAS RC.LOCAL SERVICE WAS DISABLED ON UM1/PM2. MADE THEM ACTIVE ON DANIELS SYSTEM LOOKS LIKE A CHANGCE TO THE AMI ITSELF NEEDS TO BE MADE TO MAKE SURE THE SERVICE IS Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Mon Jun 11 13:43:58 2018 925 ProcessMonitor pm2 INITIAL systemctl status rc-local.service Jun 11 13:56:52 ip-172-31-20-232.us-west-2.compute.internal systemd[1]: Starting /etc/rc.d/rc.loca... after updating rc-local on um1 and pm2, reboot and procmon started on both nodes Component Status Last Status Change Module um1 ACTIVE Mon Jun 11 14:00:15 2018 Active Parent OAM Performance Module is 'pm1' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Mon Jun 11 13:59:32 2018 926 ProcessMonitor pm2 ACTIVE Mon Jun 11 14:00:00 2018 922 Active Alarm Counts: Critical = 2, Major = 0, Minor = 0, Warning = 0, Info = 0 retesting pm2 failover - after pm2 Component Status Last Status Change Module um1 ACTIVE Mon Jun 11 14:05:20 2018 Active Parent OAM Performance Module is 'pm1' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Mon Jun 11 13:59:32 2018 926 ProcessMonitor pm2 AUTO_OFFLINE Mon Jun 11 14:06:39 2018 Active Alarm Counts: Critical = 2, Major = 1, Minor = 0, Warning = 0, Info = 0 after start pm2 System and Module statuses Component Status Last Status Change Module um1 ACTIVE Mon Jun 11 14:05:20 2018 Active Parent OAM Performance Module is 'pm1' MariaDB ColumnStore Process statuses Process Module Status Last Status Change Process ID ProcessMonitor pm1 ACTIVE Mon Jun 11 13:59:32 2018 926 ProcessMonitor pm2 ACTIVE Mon Jun 11 14:07:58 2018 928 Active Alarm Counts: Critical = 2, Major = 0, Minor = 0, Warning = 0, Info = 0 |
| Comment by David Hill (Inactive) [ 2018-06-11 ] |
|
rc-local service wasnt active on um1/pm2, so that is why procmon didnt restart after a failure in daniels test. looking at a fix in the AMI |
| Comment by David Hill (Inactive) [ 2018-06-11 ] |
|
created a new test 1.1.5 ami with the rc-local service being actived on all nodes. retested pm2 failover and it all worked.. daniel can retest and it did work with micro instance.. |
| Comment by David Hill (Inactive) [ 2018-06-11 ] |
|
new ami for 1.1.5 is created and was working for me. Daniel is testing now.. if works for daniel, Ill assigned the JIRA back to QA |
| Comment by David Hill (Inactive) [ 2018-06-11 ] |
|
underlined text |
| Comment by Daniel Lee (Inactive) [ 2018-06-13 ] |
|
Build test: 1.1.5-1 ami (ami-1a541762) Test scenario #1 (failover due to pm2 is stop) Test scenario #2 (NIC on PM2 is down) |
| Comment by Daniel Lee (Inactive) [ 2018-06-13 ] |
|
The failover issue was not related to 1.1.5-1. It was an AMI issue. It is being track by |