[MCOL-1788] Process ExeMgr,DDLProc,DMLPROC and PRIMPROC they often go offline without a workload Created: 2018-10-10  Updated: 2022-11-05  Resolved: 2022-11-05

Status: Closed
Project: MariaDB ColumnStore
Component/s: DDLProc, DMLProc, ExeMgr, PrimProc
Affects Version/s: 1.1.6
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Nicola Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Environment:

Vmware Esxi 6.7
2 um and 3 pm with Oracle Linux 7


Attachments: File columnstoreSupportReport.DWH_CSTORE.tar.gz    

 Description   

Hello to all,
For months now known that the system is unstable, after a few days the system crashes without a particular reason.
Even if the system does not have a workload (job etl etc.) the system crashes equally.
The system is practically unusable today.
I have applied all the best practices that are reported on the site.
I've upload the support report.

Please help me make this system stable.
Thanks,
Regards.
Nicola Battista



 Comments   
Comment by Nicola [ 2018-10-10 ]

MariaDB ColumnStore Process statuses

Process Module Status Last Status Change Process ID
------------------ ------ --------------- ------------------------ ----------
ProcessMonitor um1 ACTIVE Wed Oct 10 05:01:14 2018 19879
ServerMonitor um1 ACTIVE Wed Oct 10 05:06:03 2018 97151
DBRMWorkerNode um1 ACTIVE Wed Oct 10 05:06:04 2018 97175
ExeMgr um1 AUTO_OFFLINE Wed Oct 10 05:05:53 2018
DDLProc um1 AUTO_OFFLINE Wed Oct 10 05:05:53 2018
DMLProc um1 AUTO_OFFLINE Wed Oct 10 05:05:53 2018
mysqld um1 ACTIVE Wed Oct 10 05:06:03 2018 97109

ProcessMonitor um2 ACTIVE Sun Oct 7 19:01:33 2018 21510
ServerMonitor um2 ACTIVE Sun Oct 7 19:02:20 2018 43160
DBRMWorkerNode um2 ACTIVE Wed Oct 10 05:01:21 2018 9068
ExeMgr um2 MAN_OFFLINE Wed Oct 10 05:01:39 2018
DDLProc um2 COLD_STANDBY Wed Oct 10 05:01:48 2018
DMLProc um2 COLD_STANDBY Wed Oct 10 05:01:49 2018
mysqld um2 ACTIVE Wed Oct 10 05:02:31 2018 9460

ProcessMonitor pm1 ACTIVE Tue Oct 2 10:33:27 2018 82477
ProcessManager pm1 ACTIVE Tue Oct 2 10:33:33 2018 83085
DBRMControllerNode pm1 ACTIVE Wed Oct 10 05:01:18 2018 127594
ServerMonitor pm1 ACTIVE Tue Oct 2 10:34:11 2018 84542
DBRMWorkerNode pm1 ACTIVE Wed Oct 10 05:01:25 2018 127707
DecomSvr pm1 ACTIVE Tue Oct 2 10:34:15 2018 84749
PrimProc pm1 AUTO_OFFLINE Wed Oct 10 05:06:27 2018
WriteEngineServer pm1 ACTIVE Wed Oct 10 05:01:45 2018 128055

ProcessMonitor pm2 ACTIVE Tue Oct 2 10:33:58 2018 5100
ProcessManager pm2 HOT_STANDBY Wed Oct 10 05:06:01 2018 118654
DBRMControllerNode pm2 COLD_STANDBY Tue Oct 2 10:34:19 2018
ServerMonitor pm2 ACTIVE Tue Oct 2 10:34:22 2018 5504
DBRMWorkerNode pm2 ACTIVE Wed Oct 10 05:01:30 2018 117720
DecomSvr pm2 ACTIVE Tue Oct 2 10:34:27 2018 5543
PrimProc pm2 AUTO_OFFLINE Wed Oct 10 05:06:41 2018
WriteEngineServer pm2 ACTIVE Wed Oct 10 05:01:46 2018 117789

ProcessMonitor pm3 ACTIVE Tue Oct 2 10:33:59 2018 4116
ProcessManager pm3 COLD_STANDBY Tue Oct 2 10:34:24 2018
DBRMControllerNode pm3 COLD_STANDBY Tue Oct 2 10:34:24 2018
ServerMonitor pm3 ACTIVE Tue Oct 2 10:34:27 2018 4486
DBRMWorkerNode pm3 ACTIVE Wed Oct 10 05:01:35 2018 21371
DecomSvr pm3 ACTIVE Tue Oct 2 10:34:32 2018 4523
PrimProc pm3 AUTO_OFFLINE Wed Oct 10 05:06:34 2018
WriteEngineServer pm3 ACTIVE Wed Oct 10 05:01:47 2018 21429

Comment by Nicola [ 2018-10-15 ]

Someone who can help me solve this problem?

Thanks,
Nicola Battista

Comment by Andrew Hutchings (Inactive) [ 2018-10-16 ]

There appear to be connection issues between your nodes which are causing various failures. Do you have a firewall in place or have you made some network changes recently?

Comment by Nicola [ 2018-10-16 ]

Hi Andrew Hutchings,
The firewall is disabled in all nodes.
you made some network changes recently? No because the The network where the vm are located is the production (Network for DB) and therefore we can not make changes at the network level without giving notice.

Thanks,
Regards.
Nicola Battista

Comment by David Hill (Inactive) [ 2018-10-17 ]

Memory is your problem. 20GB of memory is too small.
32gb is recommend minimum and really to utilize full force of Columnstore, should have 64 or more.

this is from the support report:

Oct 10 05:01:40 cstore-pm01 PrimProc[127913]: 40.419421 |0|0|0| C 28 CAL0045: FATAL ERROR: PrimProc has allocated too much memory! PrimProc is restarting.

MemTotal: 20542236 kB = 20gb

from top

KiB Mem : 20542236 total, 1165616 free, 15208756 used, 4167864 buff/cache

primproc memory setting in ../etc/Columnstore.xml

<NumBlocksPct>70</NumBlocksPct>

If you are stuck with 20GB, you could try lower the NumBlocksPct settings to say 50 or less, but this will effect query performance.

Comment by Nicola [ 2018-10-17 ]

Hi David,
Thanks for your support.
I will try to up memory for all PM from 20GB to 32 GB, Instead for the user modules enough 12 GB per node?
Thanks again,
Regards.
Nicola Battista

Comment by David Hill (Inactive) [ 2018-10-17 ]

32 is also recommend minimum for UM. if you can up there also, that would be great..

Comment by Nicola [ 2018-11-09 ]

Hi David Hill,
I increased all vm to 32gb but the system continues to be unstable and the processes go down without a valid reason.

Comment by David Hill (Inactive) [ 2018-11-09 ]

From the ColumnStore logs, can you check to see if its restarting because of SWAP space issues?

Comment by Nicola [ 2018-11-09 ]

In the log directory I checked but did not write anything.
Even launching restartSystem gave me the error that he could not contact the processes of the other nodes.
I had to restart all the servers to fix it (but in a few days I'm sure it crashes).

Comment by Abhinav santi [ 2018-12-05 ]

I had the same issue with 64 GiB memory on all 4 PMs and 1 UM. Swap space exceeds the limit and system tries to restart but fails to come up successfully (as UM1's swap is not cleared at restart.
Is this a known bug ?

Comment by Roman [ 2019-03-15 ]

I would suggest to upgrade to 1.2.3 release b/c it contain an important changes in memory management.

Comment by Todd Stoffel (Inactive) [ 2022-11-05 ]

This item is being closed because it was well passed the expiration date with no activity. If you suspect this was done in error please create a new ticket.

Generated at Thu Feb 08 02:31:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.