[MCOL-804] UM1 looses connection to PMs Created: 2017-07-06 Updated: 2017-11-27 Resolved: 2017-11-27 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | DDLProc, DMLProc, PrimProc, ProcMgr |
| Affects Version/s: | 1.0.8 |
| Fix Version/s: | Icebox |
| Type: | Bug | Priority: | Critical |
| Reporter: | Abhinav santi | Assignee: | Unassigned |
| Resolution: | Not a Bug | Votes: | 1 |
| Labels: | None | ||
| Environment: |
Centos7, 4 Performance Modules, 1 User Module |
||
| Attachments: |
|
| Description |
|
We had several issues with Infinidb 4.6 where DML, DDL, PrimProc, ExeMgr will not run or looses connection so the system has been upgraded to MCS-1.8 and we still face the same issue. UM1 frequently looses connection with PMs or one of the PMs goes out of the cluster. here are some logs from UM1. Jul 5 17:42:52 ip- joblist[93803]: 52.693608 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection. Thanks in advance. |
| Comments |
| Comment by David Hill (Inactive) [ 2017-07-06 ] |
|
Can you run and provide output of the support tool. Run from um1 |
| Comment by David Hill (Inactive) [ 2017-07-07 ] |
|
the support report only had the um1 files with no logs.. Maybe you only ran it with the local option set. Really need the logs from all nodes, especially pm1. drwxr-xr-x 3 dhill dhill 4096 Jul 6 16:32 ./ Also are the logs setup, not sure why the um1 logs didn't exist in the support report I did see in the second attachment had the debug logs from um1.. but still need other logs from other nodes if there are logs. If there arent, not much to go by. If logging isn't setup, you can run this script on each node and do a restartsystem and see if logging does start up. Then when the issue occurs again, provide them.
|
| Comment by Abhinav santi [ 2017-07-07 ] |
|
Thanks for the response David. Sorry I ran the support command locally last time. We have logging set up and got the logs for all the modules in the latest report ls there any other way to share the attachment? Cant really upload it here. |
| Comment by David Hill (Inactive) [ 2017-07-07 ] |
|
you can put on our ftp upload site. Files may be uploaded to ftp://ftp.mariadb.com/uploads $ ftp just let us know when its uploaded, thanks |
| Comment by Abhinav santi [ 2017-07-07 ] |
|
Uploaded the file to ftp://ftp.mariadb.com/uploads. Thanks |
| Comment by Abhinav santi [ 2017-07-10 ] |
|
To add more to the investigation UM1 has been using up all the physical memory (30 GiB) and moving to swap space for large queries. Once it reaches 80% of swap, the system is getting restarted and unable to come up successfully after restart. You can find the same in uploaded logs too. This is happening way too frequently in one of our newly deployed environments. Thanks! |
| Comment by David Hill (Inactive) [ 2017-07-10 ] |
|
The system restart when swap gets to 80% + is something that the ColumnStore process are doing. This is there because the using of 100% swap space can lockup a kernal/server. So this functionality is there for a safe guard. If you so choice, you can disable this process or change it by change this setiing the the pm1 columnstore.xml and doing a shutdown and startsystem. So the default setting is to restartSystem. But if you want to disable it, changed to 'node. <SwapAction>restartSystem</SwapAction> <!-- OAM command (or 'none') to run when swap space exceeds Major Threshold --> BUT if you are running queries that are using ALL of the memory and the swap space, then the issue is either you need a more powerful server with alot more memory/disk. Or you need to change you Queries to not have the need to access that much data in one query. You query is trying to put so much of the data into tmp memory for processing that you are overwhelming the system. That is the reason why you are hitting the swap space to start with, which is not recommended. We are more interested in the PM user usage, so will look into the support files you provided. But I think the UM memory/swap usage can be easily explained.. |
| Comment by Abhinav santi [ 2017-07-10 ] |
|
I agree with restarting the system when swap reaches 100%.. but the issue is that the system is unable to come up successfully after a restart. You can find the same in provided logs. Thank you so much for the quick response. |
| Comment by David Hill (Inactive) [ 2017-07-10 ] |
|
for some reason getting errors when I try to decompress the log files. tar zxvf var.pm1.tar.gz gzip: stdin: not in gzip format gzip: stdin: not in gzip format |
| Comment by Abhinav santi [ 2017-07-10 ] |
|
I'm not sure what happened there It works on my system.. I've uploaded a new file to the same location. Please have a look. |
| Comment by David Hill (Inactive) [ 2017-07-11 ] |
|
Ok, got the latest logs... only showing debug and info for some reason, but I got those.. example from one of the nodes rwxrwxr-x 5 dhill dhill 4096 Jul 11 10:13 ./ In the last restart due to swap space issue on July 7, from the debiug logs it looks like the system successfully restarted... but I dont see any DB active on the um after the restart... Nothing in the logs related to any failures after the last systemrestart, just wondering what was not working? log from pm2 showing the restart due to swap space pm2_logs/log/mariadb/columnstore/debug.log:Jul 7 03:46 Swap Space usage over Major threashold:45 ip-172-30-3-96 ServerMonitor[22490]: 45.551476 |0|0|0| C 09 CAL0000: Swap Space usage over Major threashold, perform OAM command restartSystem logs from pm1 showing the start of the restartsystem and the successfully completion Jul 7 03:46:03 ip-172-30-3-117 ProcessManager[2434]: 03.693790 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request... Jul 7 03:47:42 ip-172-30-3-117 ProcessManager[2434]: 42.754901 |0|0|0| D 17 CAL0000: Set System State = ACTIVE logs from um1 showing its last log on the restart with no active until system was restarted at 15:18 Jul 7 03:47:42 ip-172-30-3-117 ProcessManager[2434]: 42.754901 |0|0|0| D 17 CAL0000: Set System State = ACTIVE |
| Comment by Abhinav santi [ 2017-07-11 ] |
|
Reason for other missing logs: err and crit.log files were owned by root not mariadb-user. Is this is a known issue? I have the new log file uploaded to the ftp server. Sorry for the confusion. Thanks |
| Comment by David Hill (Inactive) [ 2017-07-11 ] |
|
Actually those logs arent really needed to diagnosis the issue of the stats of the system after the restartsystem was done. Based on the debug and info logs as shown previously, the system looked like it recovered correctly. It went back to an ACTIVE state. You has said after the auto restartSystem due to the swap space issue, the system wasn't working. So was wondering in what case was it not working. Nothing in the logs showed anything failing. We arent sure what is going on with the PMs and why they are using so much memory, maybe the other logs might show something. 1 question on the UM running out of memory, are you doing any cross-connect queries? This might be using a lot of UM memory that isn't controlled by the UmTotalMem settings. |
| Comment by Abhinav santi [ 2017-07-11 ] |
|
RIght.. debug logs and mcsadmin getSystemStatus, getSystemInfo show that the system is 'Active' state. But, we will not be able to connect to mysql until we do a manual restart again from mcsadmin. CAL0000: IDB-2039: Data file does not exist, please contact your system administrator for more information. We are not using any cross connect queries. |
| Comment by David Hill (Inactive) [ 2017-07-12 ] |
|
Ok, thanks for that info... That is a issue we have seen in the past that we thought had been resolved. after a system restart or in some cases if a PrimProc restarts on a multi-node, ExeMgr gets out-of-sync with PM/dbroot assignments that causes ExeMgr to send the request to the wrong PM. So the file not found error mean that incorrect PM receive the request from ExeMgr to read a file from disk. But that file is actually in a Dbroot assigned to another PM. So this is a BUG that we need to investigate and fix further. |
| Comment by Abhinav santi [ 2017-07-17 ] |
|
Thanks David. I've encountered the same this morning and found a different error message in logs. Hope this helps you in investigation and fix in the future. This is from debug log from PM2 writeengineserver[22058]: 01.216826 |0|0|0| C 32 CAL0000: WriteEngineServer : Exception caught on accept(): boost::thread_resource_error: Resource temporarily unavailable |
| Comment by David Thompson (Inactive) [ 2017-08-09 ] |
|
abhinav.santi Checking in on the original issue. There was a fix in 1.0.9 that may or may not have helped this issue. Did you upgrade to 1.0.9 or 10 yet. Are you still seeing the issue? |
| Comment by Abhinav santi [ 2017-08-09 ] |
|
Hi David, below are logs from today's outage (pm1-critic logs) Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.799909 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Connection reset by peer – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364 |
| Comment by David Thompson (Inactive) [ 2017-08-09 ] |
|
Do you also see memory alarms in the alarm log around the time of the restarts? I know this was identified earlier. |
| Comment by Abhinav santi [ 2017-08-09 ] |
|
Yes. I see a bunch of MODULE_DOWN, PROCESS_DOWN_MANUAL, CONN_FAILURE alarms |
| Comment by David Thompson (Inactive) [ 2017-08-09 ] |
|
specifically i mean: if so did you try disabling the auto restart? |
| Comment by Abhinav santi [ 2017-08-09 ] |
|
I got a machine with more memory. I haven't seen any memory issue from then. For some reason Primproc restarts and fails to bring up mysql on UM1 |
| Comment by David Thompson (Inactive) [ 2017-08-10 ] |
|
Ok that is good news on the first point. You'll need to explain the 2nd one a bit more and maybe provide a new support report from this env that covers a primproc restart. |
| Comment by Abhinav santi [ 2017-09-14 ] |
|
Hi David, I found the following doc on upgrade procedure. How different is the upgrade procedure if we used MCS-AMI to install MCS in first place. https://mariadb.com/kb/en/library/mariadb-columnstore-software-upgrade-108-to-109/ |
| Comment by David Thompson (Inactive) [ 2017-09-14 ] |
|
There should be no difference. You'll need to download the centos7 binary install to your pm1 then follow the process for non root install. The AMI is just for fresh installs. |
| Comment by Abhinav santi [ 2017-09-14 ] |
|
Awesome. Thanks for the confirmation. |
| Comment by David Thompson (Inactive) [ 2017-11-27 ] |
|
Appears to have been caused by undersized physical memory for the data set / queries |