[MCOL-1355] Calpont Process Restarting: ExeMgr Created: 2018-04-20  Updated: 2023-07-02  Resolved: 2023-07-02

Status: Closed
Project: MariaDB ColumnStore
Component/s: N/A
Affects Version/s: 1.1.3
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Muhammad Abbas Assignee: Leonid Fedorov
Resolution: Won't Do Votes: 0
Labels: None
Environment:

Linux, 64 GB RAM, 2 TB SSD


Attachments: Text File pm1_configReport.txt     Text File pm1_dbmsReport.txt     Text File pm1_hardwareReport.txt     Text File pm1_logReport.txt     File pm1_mysqllogReport.tar.gz     Text File pm1_resourceReport.txt     Text File pm1_softwareReport.txt    

 Description   

I am trying to run a query , its joining from 5 tables and ton of columns. I am getting the follow error. When i run this query for 6 days it runs fine, for running it over 30 days it gives error.

Apr 20 16:39:49 VerticaVM1 ProcessMonitor[118869]: 49.319771 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ExeMgr, old PID = 26211
Apr 20 16:39:49 VerticaVM1 joblist[43730]: 49.343012 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Apr 20 16:39:49 VerticaVM1 joblist[43730]: 49.343207 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr
Apr 20 17:00:33 VerticaVM1 ExeMgr[102237]: 33.254076 |184|0|0| E 16 All rows hashed to the same bucket
Apr 20 17:11:07 VerticaVM1 ProcessMonitor[118869]: 07.309241 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ExeMgr, old PID = 102237
Apr 20 17:11:07 VerticaVM1 joblist[43730]: 07.830508 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Apr 20 17:11:07 VerticaVM1 joblist[43730]: 07.830628 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr
Apr 20 17:33:51 VerticaVM1 ProcessMonitor[118869]: 51.310968 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ExeMgr, old PID = 104586
Apr 20 17:33:51 VerticaVM1 joblist[43730]: 51.350624 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
Apr 20 17:33:51 VerticaVM1 joblist[43730]: 51.350749 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-04-21 ]

Can you please generate a support report so we can see what went wrong? https://mariadb.com/kb/en/library/system-troubleshooting-mariadb-columnstore/#mariadb-columnstore-support-report-tool

Comment by Muhammad Abbas [ 2018-04-21 ]

I have attached all the files, one file is exceeding the limit. Please suggest how to upload.

pm1_logReport.tar.gz

Comment by Andrew Hutchings (Inactive) [ 2018-04-21 ]

Please give it a unique name and then upload to ftp://ftp.mariadb.com/uploads.

Comment by Muhammad Abbas [ 2018-04-22 ]

File has been uploaded with the name of "MCOL-1355pm1_logReport.tar". Please confirm.

Comment by Muhammad Abbas [ 2018-04-23 ]

Team, any update here?

Comment by Andrew Hutchings (Inactive) [ 2018-05-01 ]

Long story short, you are running out of RAM on the system and it is causing ExeMgr to be restarted. The important part of the log can be seen here:

Apr 20 17:20:53 VerticaVM1 joblist[104586]: 53.566132 |0|0|0| I 05 CAL0000: IDB-2052: Out of UM memory, switching to disk-based join.
Apr 20 17:21:05 VerticaVM1 ServerMonitor[43096]: 05.185284 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  87
Apr 20 17:22:05 VerticaVM1 ServerMonitor[43096]: 05.191859 |0|0|0| W 09 CAL0000: Swap  alarm # 10 cleared
Apr 20 17:22:05 VerticaVM1 ServerMonitor[43096]: 05.194871 |0|0|0| W 09 CAL0000: Swap  alarm # 11 cleared
Apr 20 17:22:05 VerticaVM1 ServerMonitor[43096]: 05.197915 |0|0|0| W 09 CAL0000: Swap  alarm # 12 cleared
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.228883 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of  92  ; Swap  61
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.230897 |0|0|0| I 09 CAL0000: Local-Memory  usage at percentage of  92 , Alarm set:  7
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.287361 |0|0|0| I 09 CAL0000: Memory Usage for Process:  mysqld  : Memory Used  1814  : % Used  1
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.287455 |0|0|0| I 09 CAL0000: Memory Usage for Process:  systemd-journald  : Memory Used  1858  : % Used  1
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.287506 |0|0|0| I 09 CAL0000: Memory Usage for Process:  WriteEngineServ  : Memory Used  5352  : % Used  1
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.287548 |0|0|0| I 09 CAL0000: Memory Usage for Process:  ExeMgr  : Memory Used  2496768  : % Used  39
Apr 20 17:29:05 VerticaVM1 ServerMonitor[43096]: 05.287589 |0|0|0| I 09 CAL0000: Memory Usage for Process:  PrimProc  : Memory Used  3020428  : % Used  47
Apr 20 17:32:53 VerticaVM1 ServerMonitor[43096]: 53.349621 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  72
Apr 20 17:32:53 VerticaVM1 ServerMonitor[43096]: 53.358009 |0|0|0| I 09 CAL0000: Swap  usage at percentage of  72 , Alarm set:  12
Apr 20 17:32:54 VerticaVM1 ServerMonitor[43096]: 54.361258 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  77
Apr 20 17:32:55 VerticaVM1 ServerMonitor[43096]: 55.364938 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of  77
Apr 20 17:32:57 VerticaVM1 ServerMonitor[43096]: 57.790897 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  82
Apr 20 17:32:57 VerticaVM1 ServerMonitor[43096]: 57.793766 |0|0|0| I 09 CAL0000: Swap  usage at percentage of  82 , Alarm set:  11
Apr 20 17:32:58 VerticaVM1 ServerMonitor[43096]: 58.797464 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:32:59 VerticaVM1 ServerMonitor[43096]: 59.801117 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:33:00 VerticaVM1 ServerMonitor[43096]: 00.804876 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:33:01 VerticaVM1 ServerMonitor[43096]: 01.808579 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:33:02 VerticaVM1 ServerMonitor[43096]: 02.812335 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:33:03 VerticaVM1 ServerMonitor[43096]: 03.816117 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:33:04 VerticaVM1 ServerMonitor[43096]: 04.819968 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  83
Apr 20 17:33:05 VerticaVM1 ServerMonitor[43096]: 05.824756 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:06 VerticaVM1 ServerMonitor[43096]: 06.828678 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:07 VerticaVM1 ServerMonitor[43096]: 07.832789 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:08 VerticaVM1 ServerMonitor[43096]: 08.837039 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:09 VerticaVM1 ServerMonitor[43096]: 09.840708 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:10 VerticaVM1 ServerMonitor[43096]: 10.844833 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:11 VerticaVM1 ServerMonitor[43096]: 11.848957 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:12 VerticaVM1 ServerMonitor[43096]: 12.852836 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:13 VerticaVM1 ServerMonitor[43096]: 13.856805 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:14 VerticaVM1 ServerMonitor[43096]: 14.901080 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:15 VerticaVM1 ServerMonitor[43096]: 15.905992 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:16 VerticaVM1 ServerMonitor[43096]: 16.909849 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  84
Apr 20 17:33:17 VerticaVM1 ServerMonitor[43096]: 17.914519 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:18 VerticaVM1 ServerMonitor[43096]: 18.918401 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:19 VerticaVM1 ServerMonitor[43096]: 19.922308 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:20 VerticaVM1 ServerMonitor[43096]: 20.928536 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:21 VerticaVM1 ServerMonitor[43096]: 21.932375 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:22 VerticaVM1 ServerMonitor[43096]: 22.936241 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:23 VerticaVM1 ServerMonitor[43096]: 23.940177 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:24 VerticaVM1 ServerMonitor[43096]: 24.944061 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:25 VerticaVM1 ServerMonitor[43096]: 25.947760 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:26 VerticaVM1 ServerMonitor[43096]: 26.951411 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:27 VerticaVM1 ServerMonitor[43096]: 27.954957 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:28 VerticaVM1 ServerMonitor[43096]: 28.970947 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:29 VerticaVM1 ServerMonitor[43096]: 29.979514 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:30 VerticaVM1 ServerMonitor[43096]: 30.983498 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  85
Apr 20 17:33:31 VerticaVM1 ServerMonitor[43096]: 31.987209 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  86
Apr 20 17:33:32 VerticaVM1 ServerMonitor[43096]: 32.990633 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of  88
Apr 20 17:33:34 VerticaVM1 ServerMonitor[43096]: 34.291509 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  91
Apr 20 17:33:34 VerticaVM1 ServerMonitor[43096]: 34.306556 |0|0|0| I 09 CAL0000: Swap  usage at percentage of  91 , Alarm set:  10
Apr 20 17:33:35 VerticaVM1 ServerMonitor[43096]: 35.317123 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:36 VerticaVM1 ServerMonitor[43096]: 36.320993 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:37 VerticaVM1 ServerMonitor[43096]: 37.324866 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:38 VerticaVM1 ServerMonitor[43096]: 38.328766 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:39 VerticaVM1 ServerMonitor[43096]: 39.332588 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:40 VerticaVM1 ServerMonitor[43096]: 40.337078 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:41 VerticaVM1 ServerMonitor[43096]: 41.341285 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  95
Apr 20 17:33:42 VerticaVM1 ServerMonitor[43096]: 42.345123 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  96
Apr 20 17:33:43 VerticaVM1 ServerMonitor[43096]: 43.348847 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  98
Apr 20 17:33:44 VerticaVM1 ServerMonitor[43096]: 44.352589 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  100
Apr 20 17:33:45 VerticaVM1 ServerMonitor[43096]: 45.356632 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  99
Apr 20 17:33:46 VerticaVM1 ServerMonitor[43096]: 46.375337 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  99
Apr 20 17:33:47 VerticaVM1 ServerMonitor[43096]: 47.379568 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  100
Apr 20 17:33:48 VerticaVM1 ServerMonitor[43096]: 48.648055 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  99
Apr 20 17:33:49 VerticaVM1 ServerMonitor[43096]: 49.657541 |0|0|0| W 09 CAL0000: Local-Memory  alarm # 7 cleared
Apr 20 17:33:49 VerticaVM1 ServerMonitor[43096]: 49.658454 |0|0|0| I 09 CAL0000: Swap above Critical Memory threshold with a percentage of  100
Apr 20 17:33:51 VerticaVM1 ProcessMonitor[118869]: 51.310968 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: ExeMgr, old PID = 104586
Apr 20 17:33:51 VerticaVM1 ProcessMonitor[118869]: 51.311315 |0|0|0| D 18 CAL0000: STOPPING Process: ExeMgr

ExeMgr is using more memory than it should despite the disk based join. In theory this query should have aborted already.

I'll assign this to be looked into, in the mean time I would highly recommend considering a setup with a separate UM and PM so that the UM has more RAM to grow in and the PM can use more RAM for block cache.

Comment by Muhammad Abbas [ 2018-05-01 ]

It might not be possible to change any thing related to setup, please suggest a way that ExeMgr uses the RAM within given limits, so that it does not crashes due to the above error.

Comment by Muhammad Abbas [ 2018-05-17 ]

Any update?

Comment by David Hill (Inactive) [ 2019-01-23 ]

As reported by another customer, support report added.

Comment by Patrick LeBlanc (Inactive) [ 2019-01-24 ]

It sounds like a mem leak to me. We've fixed a couple leaks, and we've fixed an abort problem which can cause it to never free whole joblists, including large buffers. Worth retesting.

Comment by Todd Stoffel (Inactive) [ 2023-07-02 ]

The "create date" on this ticket is pre-convergence with MariaDB server. If the issue still exists in a modern version of the engine/plugin please submit a new ticket.

Generated at Thu Feb 08 02:28:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.