[MCOL-1161] ColumnStore automatically restarted Created: 2018-01-15  Updated: 2020-08-25  Resolved: 2018-05-25

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.0.11
Fix Version/s: Icebox

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Andrew Hutchings (Inactive)
Resolution: Incomplete Votes: 0
Labels: None

Issue Links:
Relates
relates to MCOL-1162 Document information about auto-restarts Closed

 Description   

ColumnStore automatically restarted after the following messages were logged to warning.log:

Jan 4 15:23:57 db1 messagequeue[179600]: 57.847109 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed
Jan 4 15:23:58 db1 joblist[179600]: 58.092497 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.

I assume that the ExeMgr error is related to the preceding network error, since the documentation says that ExeMgr listens on a TCP/IP port:

Execution Manager: ExeMgr

ExeMgr listens on a TCP/IP port for query parse trees from mysqld. ExeMgr is responsible for converting the query parse tree into a job list, which is a construct in MariaDB ColumnStore that represents the sequence of instructions necessary to answer the query. ExeMgr walks the query parse tree and iteratively generates job steps, optimizing and re-optimizing the job list as it goes. The major categories of job steps are application of a column filter, processing a table join, and projection of returned columns. Each operation in a query plan is executed in parallel by the job list itself and has the capability of running entirely on the User Module, entirely on the Performance Module or in some combination. Each node uses the Extent Map to determine which Performance Modules to send work orders to (see ColumnStore Storage Architecture for details on the Extent).

https://mariadb.com/kb/en/library/columnstore-user-module/

Would this ExeMgr error have led to the auto-restart of ColumnStore? If so, then I think the error message should be improved. e.g. it could say something like:

Lost TCP/IP connection to ExeMgr process. ColumnStore might automatically restart to restore functionality!

The message is short, but it clearly says that all of the following are related:

  • The network error.
  • The ExeMgr error.
  • The auto-restart.


 Comments   
Comment by David Hill (Inactive) [ 2018-01-15 ]

There is only 1 thing that would cause ColumnStore to be automatically restarted and that is if one of the servers with the system hit a high percent of swap space usage, 70% or more.

So the ExeMgr logs is probably a side effect of the system restarting. You would want to check the server logs to see if SWAP SPACE high usage was reported.

Comment by Geoff Montee (Inactive) [ 2018-01-15 ]

Hi hill,

The logs do not contain any messages related to excessive swapping.

Comment by Nan Yu [ 2018-01-17 ]

Hi, David Hill:
This might not be a whole database restart?

In the debug.log:
Jan 4 15:24:03 usfit-scdb1 ProcessMonitor[23597]: 03.055432 |0|0|0| D 18 CAL0000: Inform Process Mgr that process was restarted: ExeMgr
Jan 4 15:24:03 usfit-scdb1 ProcessMonitor[23597]: 03.056098 |0|0|0| I 18 CAL0000: Calpont Process ExeMgr restarted successfully!!

In the info.log:
Jan 4 15:24:03 usfit-scdb1 ProcessMonitor[23597]: 03.056098 |0|0|0| I 18 CAL0000: Calpont Process ExeMgr restarted successfully!!

Thanks!
Nan

Comment by Zdravelina Sokolovska (Inactive) [ 2018-01-24 ]

Have seen these messages in case of lack of memory
which caused MCS to automatically to restart
Case 1. Lack of memory in ExeMgr

From UM
 um1 ExeMgr[2101]: 16.888057 |0|0|0| C 16 CAL0044: FATAL ERROR: ExeMgr has allocated too much memory! Percent allocation-96, allowed-95. ExeMgr is restarting.
~~
 um1 messagequeue[2022]: 35.152713 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed
 um1 joblist[2022]: 35.363075 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
 um1 joblist[2022]: 35.363403 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 146 /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp: Could not get a connection to a ExeMgr

Case 2 . Lack of memory in PrimProc followed by restating MCS System

from PM
Dec 15 05:11:05 ip-172-31-41-239 ServerMonitor[18471]: 05.412880 |0|0|0| C 09 CAL0000: Swap Space usage over Major threashold, perform OAM command  restartSystem
 
From UM
ip-172-31-47-128 ProcessMonitor[1905]: 58.591694 |0|0|0| D 18 CAL0000: Successfully ran DBRM clearShm
ip-172-31-47-128 messagequeue[80760]: 02.210509 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic(): I/O error2.1: err = -1 e = 104: Connection reset by peer
ip-172-31-47-128 messagequeue[80760]: 02.210512 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic: Remote is closed
ip-172-31-47-128 messagequeue[80760]: 02.210509 |0|0|0| W 31 CAL0000: Client read close socket for InetStreamSocket::readToMagic(): I/O error2.1: err = -1 e = 104: Connection reset by peer
ip-172-31-47-128 messagequeue[80760]: 02.210804 |0|0|0| W 31 CAL0000: MessageQueueClient::write: error writing 4 bytes to IOSocket: sd: 75 inet: 172.31.47.128 port: 8601. Socket error was InetStreamSocket::write error: Broken pipe -- write from InetStreamSocket: sd: 75 inet: 172.31.47.128 port: 8601
ip-172-31-47-128 joblist[80760]: 02.212936 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.
ip-172-31-47-128 joblist[80760]: 02.212959 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/execplan/clientrotator.cpp @ 318 Could not get a ExeMgr connection.

Comment by Nan Yu [ 2018-01-29 ]

Hi, Winstone:
Thanks!
Looks like my case is very similar to your Case 1 example. The only thing different is that I could not find the " ExeMgr has allocated too much memory" error message.

Nan

Comment by Geoff Montee (Inactive) [ 2018-04-23 ]

What feedback is needed on this? nyu said that this problem seemed to be similar to case #1 mentioned by winstone, but that the log didn't look exactly the same. Is it safe to make the conclusion that ExeMgr ran out of memory, but it didn't log that message for some reason?

Comment by David Hill (Inactive) [ 2018-05-14 ]

In both cases, ExeMgr restarting due to using too much memory or the system restarting due to swap space being hit on the UM, ExeMgr is using too much memory.

1. need to know the setting of TotalUmMemory. This setting might need to be lowered.
2. if they can reproduce, then need to know what queries they are running at the time.
a. need un log files at the time of exemgr or system restart. It will show what processes are using what memory.
Could be mysql or another apps are using a lot of memory besides exemgr which is causing the system restart.

Generated at Thu Feb 08 02:26:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.