[MXS-3369] maxscale OOM Created: 2021-01-08  Updated: 2021-09-12  Resolved: 2021-09-01

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 2.4.12
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Allen Lee (Inactive) Assignee: markus makela
Resolution: Incomplete Votes: 0
Labels: need_feedback

Attachments: Text File CS0200670.RAM_usage_bumps.txt     File maxscale.cnf    
Sprint: MXS-SPRINT-139

 Description   

customer reported that their maxscale node OOMed due to increasing memory usage.
Here is what customer tested and attached config and logs.

To debug the memory usage issue, I've gone through the following steps.
 
[root@rnqmax401 ~]# date
Thu Jan 7 08:41:52 PST 2021
[root@rnqmax401 ~]#
Using top I've captured the PID that is taking up all the memory.
top - 07:57:24 up 2 days, 9:33, 1 user, load average: 0.19, 0.14, 0.11
Tasks: 156 total, 1 running, 155 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.9 us, 0.8 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 98.2/16247560 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
KiB Swap: 54.2/4194300 [|||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
 
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
54343 maxscale 20 0 15.7g 14.0g 376 S 0.7 90.7 14:05.80 maxscale
 
Check which process is running with PID 54343. It's the systemctl MaxScale service.
[root@rnqmax401 ~]# ps -ef | grep 54343
root 43013 68600 0 08:01 pts/2 00:00:00 grep --color=auto 54343
maxscale 54343 1 0 Jan05 ? 00:14:07 /usr/bin/maxscale
[root@rnqmax401 ~]#
 
Check for the admin port for that MaxScale instance. It is 6111.
[root@rnqmax401 ~]# grep port /etc/maxscale.cnf
admin_port = 8991
port = 3111
port = 3111
port = 3111
port = 6111
port = 3111
#port=4442
port = 3111
port = 3111
port = 3111
port = 9994
# These listeners represent the ports the
[root@rnqmax401 ~]#
 
Before I started this debugging, I've redirected the application connections through a different MaxScale server. As you can see below, there are no active connections while I collected these stats. However, the memory allocated to MaxScale was not released back to the OS. This was captured on Thu Jan 7 08:41:52 PST 2021.
[root@rnqmax401 ~]# maxadmin -pmariadb -P6111 list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server | Address | Port | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
server1 | 10.142.108.141 | 3111 | 0 | Master, Synced, Running
server2 | 10.142.108.142 | 3111 | 0 | Slave, Synced, Running
server3 | 10.142.108.143 | 3111 | 0 | Slave, Synced, Running
server1AD | 10.142.108.141 | 3111 | 0 | Master, Synced, Running
server2AD | 10.142.108.142 | 3111 | 0 | Slave, Synced, Running
server3AD | 10.142.108.143 | 3111 | 0 | Slave, Synced, Running
-------------------+-----------------+-------+-------------+--------------------
[root@rnqmax401 ~]#
 
MaxScale usage at Tue Jan 5 16:17:49 PST 2021, this was captured before this debug test.
-------------------+-----------------+-------+-------------+--------------------
Server | Address | Port | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
server1 | 10.142.108.141 | 3111 | 966 | Master, Synced, Running
server2 | 10.142.108.142 | 3111 | 966 | Slave, Synced, Running
server3 | 10.142.108.143 | 3111 | 966 | Slave, Synced, Running
server1AD | 10.142.108.141 | 3111 | 0 | Master, Synced, Running
server2AD | 10.142.108.142 | 3111 | 0 | Slave, Synced, Running
server3AD | 10.142.108.143 | 3111 | 0 | Slave, Synced, Running
-------------------+-----------------+-------+-------------+--------------------
 
I've restarted the MaxScale on 2021-01-05 15:02:06 and it was having low usage until Tue Jan 5 16:00:35 PST 2021. Between 16:00 and 16:04, the RAM usage went up from 890 MB to 15327 MB.

  • maxscale log is too large to attach so please check support case.


 Comments   
Comment by markus makela [ 2021-01-08 ]

What happened between 16:00 and 16:04 that caused the memory usage to spike? Was there a data dump of some sorts being done through MaxScale? If so, configuring writeq_high_water and writeq_low_water is likely to solve the problem.

If possible, please test whether this happens with MaxScale 2.5 as well.

Comment by markus makela [ 2021-09-01 ]

I'll close this as Incomplete since we don't know if this is fixed by an upgrade to 2.5 or by enabling the writeq watermarks.

Generated at Thu Feb 08 04:20:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.