[MCOL-1556] SwapAction = restartSystem Issue Created: 2018-07-11  Updated: 2019-02-18  Resolved: 2019-02-18

Status: Closed
Project: MariaDB ColumnStore
Component/s: ProcMgr
Affects Version/s: 1.1.4
Fix Version/s: 1.1.4

Type: Bug Priority: Major
Reporter: ssauravy Assignee: Roman
Resolution: Not a Bug Votes: 1
Labels: None
Environment:

CentOS 6.9



 Description   

[Conf]
CPU : 14 Core (Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz)
Memory : 28Gbytes
Disk : External Storage (/usr/local/mariadb/columnstore/data1)
CentOS release 6.9 (Final)
SoftwareVersion = 1.1.4 , combined single node
SoftwareRelease = 1

TotalUmMemory : 25
TotalPmUmMemory :10
NumBlocksPct :50
<SwapAction>restartSystem</SwapAction>

max_length_for_sort_data : 2048
innodb_buffer_pool_size = 128M

[Question]
After setting SwapAction to restartSystem and executing SELECT SQL, the following log is recorded in info.log
All ssh sessions and no connections. An OS hang occurs.
Please refer to the log below.

Jul 11 13:22:19 EDWPOCDB1 joblist[5359]: 19.142655 |0|0|0| I 05 CAL0000: IDB-2052: Out of UM memory, switching to disk-based join.
Jul 11 13:22:19 EDWPOCDB1 joblist[5359]: 19.163140 |0|0|0| I 05 CAL0000: IDB-2052: Out of UM memory, switching to disk-based join.
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.425116 |0|0|0| I 09 CAL0000: Local Memory above Critical Memory threshold with a percentage of 100 ; Swap 32
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.454046 |0|0|0| I 09 CAL0000: Local-Memory usage at percentage of 100 , Alarm set: 7
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.780321 |0|0|0| I 09 CAL0000: Memory Usage for Process: DMLProc : Memory Used 1184 : % Used 1
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.782910 |0|0|0| I 09 CAL0000: Memory Usage for Process: mysqld : Memory Used 3416 : % Used 1
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.783005 |0|0|0| I 09 CAL0000: Memory Usage for Process: WriteEngineServ : Memory Used 123383 : % Used 5
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.783074 |0|0|0| I 09 CAL0000: Memory Usage for Process: ExeMgr : Memory Used 1286296 : % Used 45
Jul 11 13:23:10 EDWPOCDB1 ServerMonitor[5207]: 10.783147 |0|0|0| I 09 CAL0000: Memory Usage for Process: PrimProc : Memory Used 1395679 : % Used 49
Jul 11 13:31:07 EDWPOCDB1 ServerMonitor[5207]: 07.357300 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of 72
Jul 11 13:31:07 EDWPOCDB1 ServerMonitor[5207]: 07.367925 |0|0|0| I 09 CAL0000: Swap usage at percentage of 72 , Alarm set: 12
Jul 11 13:31:08 EDWPOCDB1 ServerMonitor[5207]: 08.371554 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of 74
Jul 11 13:31:09 EDWPOCDB1 ServerMonitor[5207]: 09.375483 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of 74
Jul 11 13:31:10 EDWPOCDB1 ServerMonitor[5207]: 10.380719 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of 74
...
Jul 11 13:32:33 EDWPOCDB1 ServerMonitor[5207]: 33.180240 |0|0|0| I 09 CAL0000: Swap above Minor Memory threshold with a percentage of 78
Jul 11 13:32:34 EDWPOCDB1 ServerMonitor[5207]: 34.186631 |0|0|0| I 09 CAL0000: Swap above Major Memory threshold with a percentage of 80
Jul 11 13:32:34 EDWPOCDB1 ServerMonitor[5207]: 34.193804 |0|0|0| I 09 CAL0000: Swap usage at percentage of 80 , Alarm set: 11
Jul 11 13:32:34 EDWPOCDB1 ProcessManager[3011]: 34.296494 |0|0|0| I 17 CAL0000: MSG RECEIVED: Restart System request...
Jul 11 13:32:34 EDWPOCDB1 ProcessMonitor[2921]: 34.363129 |0|0|0| I 18 CAL0000: MSG RECEIVED: Stop All process request...
Jul 11 13:32:34 EDWPOCDB1 ProcessMonitor[2921]: 34.365556 |0|0|0| I 18 CAL0000: STOPALL: ACK back to ProcMgr, STATUS_UPDATE only performed

ref) If SwapAction is set to none, only ExeMgr is restarted and the columnstore is normalized.



 Comments   
Comment by Roman [ 2018-07-16 ]

Greetings Kim,

Thank you for reporting the issue.
Did I get it right, that OS hangs with this action set? If it is so do you use virtualized hardware? What is the value of vm.swapiness used by your Linux? Did you try to lower it? Did you check syslog for OOM killer log entries? How big is the dataset and what kind of join you do?

Comment by ssauravy [ 2018-07-18 ]

Good morning. Sorry for the delay.
1. Operating in the current vm environment.
2. vm.swappness = 1 is the setting state.
3. There are no OOM Killer traces or HW Crash / Falut traces in / var / log / dmsg and / var / log / messages.
4. Data is about 1.1Tera.

Comment by Roman [ 2018-07-23 ]

Dear Kim,
It looks like ExeMgr takes more then it is allowed according to your config snippet. The system is tottaly out of memory according to this message 'Swap above Major Memory threshold with a percentage of 80'. I presume the OS couldn't get enough memory and restart or crashes.
Could you post an output from support report tool? Here is the guide how to collect the info. It is good to collect a report right after the OS has rebooted.
Meanwhile you could decrease below mentioned values and try to run the query:
ModuleSwapMinorThreshold3 and ModuleSwapMajorThreshold3 to more appropriate values like ModuleSwapMinorThreshold3 = 30 ModuleSwapMajorThreshold3 = 40
You could also decrease TotalUmMemory to 15%.

Comment by Roman [ 2018-07-26 ]

Dear Kim, do you have any news on this issue?

Comment by Roman [ 2019-02-18 ]

Close on no response.

Generated at Thu Feb 08 02:29:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.