[MCOL-804] UM1 looses connection to PMs Created: 2017-07-06  Updated: 2017-11-27  Resolved: 2017-11-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc, DMLProc, PrimProc, ProcMgr
Affects Version/s: 1.0.8
Fix Version/s: Icebox

Type: Bug Priority: Critical
Reporter: Abhinav santi Assignee: Unassigned
Resolution: Not a Bug Votes: 1
Labels: None
Environment:

Centos7, 4 Performance Modules, 1 User Module


Attachments: File columnstoreSupportReport1.tar.gz     File debug.log.tar.gz    

 Description   

We had several issues with Infinidb 4.6 where DML, DDL, PrimProc, ExeMgr will not run or looses connection so the system has been upgraded to MCS-1.8 and we still face the same issue. UM1 frequently looses connection with PMs or one of the PMs goes out of the cluster. here are some logs from UM1.

Jul 5 17:42:52 ip- joblist[93803]: 52.693608 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Jul 5 17:42:52 ip- joblist[93803]: 52.693658 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr
Jul 5 17:43:13 ip- joblist[93803]: 13.719869 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Jul 6 03:52:57 ip- joblist[47893]: 57.817911 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Jul 6 03:52:57 ip- joblist[47893]: 57.817962 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr
Jul 6 03:52:58 ip- joblist[66076]: 58.901732 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to pm1
Jul 6 03:52:59 ip- joblist[66076]: 59.281593 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to pm2
Jul 6 03:52:59 ip- joblist[66076]: 59.325477 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to pm3
Jul 6 03:52:59 ip- joblist[66076]: 59.555777 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to pm4
Jul 6 03:53:21 ip- joblist[47893]: 21.153464 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Jul 6 03:53:21 ip- joblist[47893]: 21.153500 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Jul 6 03:53:21 ip- mysqld[47893]: 21.153521 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr
Jul 6 03:53:45 ip- joblist[47893]: 45.174945 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Jul 6 03:54:19 ip- joblist[66076]: 19.532358 |2147483648|0|0| C 05 CAL0000: IDB-2023: PrimProc is not running (or connection to PrimProc dropped).
Jul 6 03:54:19 ip- joblist[66076]: 19.532434 |2147483648|0|0| C 05 CAL0000: IDB-2023: PrimProc is not running (or connection to PrimProc dropped).
Jul 6 03:54:19 ip- joblist[66076]: 19.533035 |2147483648|0|0| C 05 CAL0000: IDB-2023: PrimProc is not running (or connection to PrimProc dropped).
Jul 6 03:54:19 ip- joblist[66076]: 19.533362 |2147483648|0|0| C 05 CAL0000: IDB-2023: PrimProc is not running (or connection to PrimProc dropped).
Jul 6 03:54:19 ip- joblist[66076]: 19.533428 |2147483648|0|0| C 05 CAL0000: IDB-2023: PrimProc is not running (or connection to PrimProc dropped).

Thanks in advance.



 Comments   
Comment by David Hill (Inactive) [ 2017-07-06 ]

Can you run and provide output of the support tool. Run from um1

https://mariadb.com/kb/en/mariadb/system-troubleshooting-mariadb-columnstore/#mariadb-columnstore-support-tool

Comment by David Hill (Inactive) [ 2017-07-07 ]

the support report only had the um1 files with no logs.. Maybe you only ran it with the local option set. Really need the logs from all nodes, especially pm1.

drwxr-xr-x 3 dhill dhill 4096 Jul 6 16:32 ./
drwxr-xr-x 14 dhill dhill 20480 Jul 7 09:26 ../
drwxr-xr-x 3 dhill dhill 4096 Jul 6 16:42 tar/
rw-rr- 1 dhill dhill 64769 Jul 6 16:26 um1_configReport.txt
rw-rr- 1 dhill dhill 76467 Jul 6 16:30 um1_dbmsReport.txt
rw-rr- 1 dhill dhill 18067 Jul 6 16:28 um1_hardwareReport.txt
rw-rr- 1 dhill dhill 766 Jul 6 16:23 um1_logReport.txt
rw-rr- 1 dhill dhill 434678 Jul 6 16:23 um1_resourceReport.txt
rw-rr- 1 dhill dhill 1693 Jul 6 16:22 um1_softwareReport.txt

Also are the logs setup, not sure why the um1 logs didn't exist in the support report

I did see in the second attachment had the debug logs from um1.. but still need other logs from other nodes if there are logs. If there arent, not much to go by.

If logging isn't setup, you can run this script on each node and do a restartsystem and see if logging does start up. Then when the issue occurs again, provide them.

  1. /usr/local/mariadb/columnstore/bin/syslogSetup.sh install
Comment by Abhinav santi [ 2017-07-07 ]

Thanks for the response David.

Sorry I ran the support command locally last time.

We have logging set up and got the logs for all the modules in the latest report ls there any other way to share the attachment? Cant really upload it here.

Comment by David Hill (Inactive) [ 2017-07-07 ]

you can put on our ftp upload site.

Files may be uploaded to ftp://ftp.mariadb.com/uploads
Anonymous uploads are allowed; no login is necessary.

$ ftp
ftp> open ftp.mariadb.com
Connected to ftp01.mariadb.com.
220 Welcome to ftp.skysql.com. Anonymous logins allowed. (username: anonymous; password: your email address)
Name (ftp.mariadb.com: anonymous
331 Please specify the password.
Password:
230-Welcome to ftp.skysql.com.
ftp> cd uploads
ftp> put columnstoreSupportReport.tar.gz

just let us know when its uploaded, thanks

Comment by Abhinav santi [ 2017-07-07 ]

Uploaded the file to ftp://ftp.mariadb.com/uploads.

Thanks

Comment by Abhinav santi [ 2017-07-10 ]

To add more to the investigation

UM1 has been using up all the physical memory (30 GiB) and moving to swap space for large queries. Once it reaches 80% of swap, the system is getting restarted and unable to come up successfully after restart. You can find the same in uploaded logs too.

This is happening way too frequently in one of our newly deployed environments.

Thanks!

Comment by David Hill (Inactive) [ 2017-07-10 ]

The system restart when swap gets to 80% + is something that the ColumnStore process are doing. This is there because the using of 100% swap space can lockup a kernal/server. So this functionality is there for a safe guard.

If you so choice, you can disable this process or change it by change this setiing the the pm1 columnstore.xml and doing a shutdown and startsystem. So the default setting is to restartSystem. But if you want to disable it, changed to 'node.

<SwapAction>restartSystem</SwapAction> <!-- OAM command (or 'none') to run when swap space exceeds Major Threshold -->

BUT if you are running queries that are using ALL of the memory and the swap space, then the issue is either you need a more powerful server with alot more memory/disk. Or you need to change you Queries to not have the need to access that much data in one query. You query is trying to put so much of the data into tmp memory for processing that you are overwhelming the system. That is the reason why you are hitting the swap space to start with, which is not recommended.

We are more interested in the PM user usage, so will look into the support files you provided. But I think the UM memory/swap usage can be easily explained..

Comment by Abhinav santi [ 2017-07-10 ]

I agree with restarting the system when swap reaches 100%.. but the issue is that the system is unable to come up successfully after a restart. You can find the same in provided logs.

Thank you so much for the quick response.

Comment by David Hill (Inactive) [ 2017-07-10 ]

for some reason getting errors when I try to decompress the log files.
also there aren't the standard names that are used when the support tool creates them. So not sure what is going on here. Let me know if you have some additional info about the files

tar zxvf var.pm1.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
dhill@srvhill01:~/Downloads/MCOL-804-columnstoreSupportReport$ tar zxvf var.pm2.tar.gz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
dhill@srvhill01:~/Downloads/MCOL-804-columnstoreSupportReport$

Comment by Abhinav santi [ 2017-07-10 ]

I'm not sure what happened there It works on my system.. I've uploaded a new file to the same location. Please have a look.

Comment by David Hill (Inactive) [ 2017-07-11 ]

Ok, got the latest logs... only showing debug and info for some reason, but I got those..

example from one of the nodes

rwxrwxr-x 5 dhill dhill 4096 Jul 11 10:13 ./
drwxrwxr-x 3 dhill dhill 4096 Jul 11 09:44 ../
rw-rr- 1 dhill dhill 0 Jul 7 10:19 activeAlarms
rw-rr- 1 dhill dhill 84853 Jul 7 10:19 alarm.log
drwxrwxr-x 2 dhill dhill 4096 Mar 23 17:29 archive/
drwxr-xr-x 2 dhill dhill 4096 Mar 23 17:29 corefiles/
-rwxr-xr-x 1 dhill dhill 764379249 Jul 7 10:21 debug.log*
-rwxr-xr-x 1 dhill dhill 152765465 Jul 7 10:21 info.log*
rw-rr- 1 dhill dhill 78 Jul 7 10:18 load_brm.log1
rw-rr- 1 dhill dhill 191 Jul 7 10:18 reset_locks.log1
rw-rr- 1 dhill dhill 81 Jul 7 10:18 save_brm.log1
drwxrwxr-x 2 dhill dhill 4096 Apr 12 16:00 trace/
rw-rw-r- 1 dhill dhill 22845 Jul 7 10:27 uiCommands.log

In the last restart due to swap space issue on July 7, from the debiug logs it looks like the system successfully restarted... but I dont see any DB active on the um after the restart... Nothing in the logs related to any failures after the last systemrestart, just wondering what was not working?

log from pm2 showing the restart due to swap space

pm2_logs/log/mariadb/columnstore/debug.log:Jul 7 03:46 Swap Space usage over Major threashold:45 ip-172-30-3-96 ServerMonitor[22490]: 45.551476 |0|0|0| C 09 CAL0000: Swap Space usage over Major threashold, perform OAM command restartSystem

logs from pm1 showing the start of the restartsystem and the successfully completion

Jul 7 03:46:03 ip-172-30-3-117 ProcessManager[2434]: 03.693790 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request...

Jul 7 03:47:42 ip-172-30-3-117 ProcessManager[2434]: 42.754901 |0|0|0| D 17 CAL0000: Set System State = ACTIVE
Jul 7 03:47:42 ip-172-30-3-117 ProcessMonitor[1957]: 42.755089 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = ACTIVE
Jul 7 03:47:42 ip-172-30-3-117 ProcessManager[2434]: 42.757719 |0|0|0| D 17 CAL0000: startSystemThread Exit
Jul 7 15:18:07 ip-172-30-3-117 ProcessManager[2434]: 07.253780 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request...

logs from um1 showing its last log on the restart with no active until system was restarted at 15:18

Jul 7 03:47:42 ip-172-30-3-117 ProcessManager[2434]: 42.754901 |0|0|0| D 17 CAL0000: Set System State = ACTIVE
Jul 7 03:47:42 ip-172-30-3-117 ProcessMonitor[1957]: 42.755089 |0|0|0| D 18 CAL0000: statusControl: REQUEST RECEIVED: Set System State = ACTIVE
Jul 7 03:47:42 ip-172-30-3-117 ProcessManager[2434]: 42.757719 |0|0|0| D 17 CAL0000: startSystemThread Exit
Jul 7 15:18:07 ip-172-30-3-117 ProcessManager[2434]: 07.253780 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request...

Comment by Abhinav santi [ 2017-07-11 ]

Reason for other missing logs: err and crit.log files were owned by root not mariadb-user. Is this is a known issue?
I've used Columnstore AMI to spin up the cluster.

I have the new log file uploaded to the ftp server. Sorry for the confusion.

Thanks

Comment by David Hill (Inactive) [ 2017-07-11 ]

Actually those logs arent really needed to diagnosis the issue of the stats of the system after the restartsystem was done. Based on the debug and info logs as shown previously, the system looked like it recovered correctly. It went back to an ACTIVE state. You has said after the auto restartSystem due to the swap space issue, the system wasn't working. So was wondering in what case was it not working. Nothing in the logs showed anything failing.

We arent sure what is going on with the PMs and why they are using so much memory, maybe the other logs might show something.

1 question on the UM running out of memory, are you doing any cross-connect queries? This might be using a lot of UM memory that isn't controlled by the UmTotalMem settings.

Comment by Abhinav santi [ 2017-07-11 ]

RIght.. debug logs and mcsadmin getSystemStatus, getSystemInfo show that the system is 'Active' state. But, we will not be able to connect to mysql until we do a manual restart again from mcsadmin.
I found these logs in crit and Err. logs from the latest outage.

CAL0000: IDB-2039: Data file does not exist, please contact your system administrator for more information.
4639517 Jul 9 11:47:09 ip-IDBFile[82765]: 09.373622 |0|0|0| D 35 CAL0002:Failed to open file: /000.dir/000.dir/012.dir/179.dir/000.dir/FILE003.cdf, exception: unable to open Unbuffered file

We are not using any cross connect queries.

Comment by David Hill (Inactive) [ 2017-07-12 ]

Ok, thanks for that info... That is a issue we have seen in the past that we thought had been resolved. after a system restart or in some cases if a PrimProc restarts on a multi-node, ExeMgr gets out-of-sync with PM/dbroot assignments that causes ExeMgr to send the request to the wrong PM. So the file not found error mean that incorrect PM receive the request from ExeMgr to read a file from disk. But that file is actually in a Dbroot assigned to another PM. So this is a BUG that we need to investigate and fix further.
Work-around is having to do a fresh system restart to clear it up.

Comment by Abhinav santi [ 2017-07-17 ]

Thanks David.

I've encountered the same this morning and found a different error message in logs. Hope this helps you in investigation and fix in the future.

This is from debug log from PM2

writeengineserver[22058]: 01.216826 |0|0|0| C 32 CAL0000: WriteEngineServer : Exception caught on accept(): boost::thread_resource_error: Resource temporarily unavailable
ProcessMonitor[1742]: 02.127629 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: WriteEngineServer, old PID = 22058
ProcessMonitor[1742]: 50.796453 |0|0|0| C 18 CAL0000: checkDataMount error, startmodule failed
ProcessManager[21885]: 09.018541 |0|0|0| C 17 CAL0000: module is down: pm1

Comment by David Thompson (Inactive) [ 2017-08-09 ]

abhinav.santi Checking in on the original issue. There was a fix in 1.0.9 that may or may not have helped this issue. Did you upgrade to 1.0.9 or 10 yet. Are you still seeing the issue?

Comment by Abhinav santi [ 2017-08-09 ]

Hi David,
We haven't upgraded to MCS 1.0.9/10 yet. but with the current 1.0.8 deployment I see a lot of random system restarts (apart from high swap usage). All the PMs come up after the restart successfully but mysql on UM1 fails to start with error 'cannot connect to local server through socket'

below are logs from today's outage (pm1-critic logs)

Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.799909 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Connection reset by peer – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.800804 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Connection reset by peer – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.801368 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.801577 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.801721 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.802192 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.803785 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.803895 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.804835 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.806566 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Broken pipe – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.807452 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 12 inet: UM1 port: 44372
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.807798 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.808220 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.808952 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.809881 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.813268 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364
Aug 9 19:08:17 ip-PM1 PrimProc[990]: 17.813674 |0|0|0| C 28 CAL0000: InetStreamSocket::write error: Bad file descriptor – write from InetStreamSocket: sd: 11 inet: UM1 port: 44364

Comment by David Thompson (Inactive) [ 2017-08-09 ]

Do you also see memory alarms in the alarm log around the time of the restarts? I know this was identified earlier.

Comment by Abhinav santi [ 2017-08-09 ]

Yes. I see a bunch of MODULE_DOWN, PROCESS_DOWN_MANUAL, CONN_FAILURE alarms

Comment by David Thompson (Inactive) [ 2017-08-09 ]

specifically i mean:
CRITICAL ALARM
MEMORY_USAGE_HIGH

if so did you try disabling the auto restart?

Comment by Abhinav santi [ 2017-08-09 ]

I got a machine with more memory. I haven't seen any memory issue from then.

For some reason Primproc restarts and fails to bring up mysql on UM1

Comment by David Thompson (Inactive) [ 2017-08-10 ]

Ok that is good news on the first point. You'll need to explain the 2nd one a bit more and maybe provide a new support report from this env that covers a primproc restart.

Comment by Abhinav santi [ 2017-09-14 ]

Hi David, I found the following doc on upgrade procedure. How different is the upgrade procedure if we used MCS-AMI to install MCS in first place.

https://mariadb.com/kb/en/library/mariadb-columnstore-software-upgrade-108-to-109/

Comment by David Thompson (Inactive) [ 2017-09-14 ]

There should be no difference. You'll need to download the centos7 binary install to your pm1 then follow the process for non root install. The AMI is just for fresh installs.

Comment by Abhinav santi [ 2017-09-14 ]

Awesome. Thanks for the confirmation.

Comment by David Thompson (Inactive) [ 2017-11-27 ]

Appears to have been caused by undersized physical memory for the data set / queries

Generated at Thu Feb 08 02:23:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.