[MXS-773] 100% CPU on idle MaxScale with MaxInfo Created: 2016-06-16  Updated: 2017-04-28  Resolved: 2016-09-06

Status: Closed
Project: MariaDB MaxScale
Component/s: maxinfo
Affects Version/s: 1.4.1, 1.4.3
Fix Version/s: 2.0.1

Type: Bug Priority: Major
Reporter: Mathew Hornbeek Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu 14.04.3, AWS EC2 c4.4xlarge (16core, 30GB)


Attachments: PNG File maxscale_idle_cpu.png     PNG File maxscale_leak.png    
Issue Links:
Problem/Incident
causes MXS-1224 Connections in CLOSE_WAIT state Closed
Sprint: 2016-17

 Description   

I have 3 MaxScale instances with the same configuration. 1 is always active (any one of the 3), the other 2 are standby. Every 10 seconds or so, information is checked through the MaxInfo JSON listener. I've noticed that only idle instances exhibit what appears to be an memory leak issue with MaxInfo sockets not closing. The active instance, which is also running MaxInfo and having its JSON listener queried, doesn't have this problem. If a new instance is made active and the former active instance becomes idle, the former instance starts to have this problem eventually.

We have a monit daemon on each MaxScale instance which is checking the availability of the MaxInfo's /status URI, which is why in the attached image you'll see the spike drop; that's monit restarting the MaxScale service. We also have a curl request that runs periodically to gather status information. If monit is off the and MaxScale reaches this high memory state the curl requests just hang forever until MaxScale is restarted.

I have reason to believe this is a problem specific to our heavy use of MaxInfo, simply because when we started using MaxInfo this started to happen, but it could also be a coincidence. I have not tried to reproduce this with a newer version of MaxScale yet but I do plan to. I've found this behavior to exist on both versions 1.4.1 and 1.4.3



 Comments   
Comment by Timofey Turenko [ 2016-06-23 ]

I can't reproduce it.

My test sends 'status' requests to JSON listener in the loop. After 1000 iterations I do not see any memory consumption increase. There is no any load on Maxscale.

Comment by Mathew Hornbeek [ 2016-07-15 ]

Thanks for looking at this Timofey. Looking at this issue a bit closer, I have more details after looking at this again:

I've upgrade the instances to 1.4.3 and see the same issue. It doesn't appear that there's really a memory leak in MaxScale, this is a symptom of the problem with another service running on the machine. MaxScale CPU usage seems to gradually increase over the course about 6 hours while idle. Eventually, 7 out of the 8 cores I have configured for these instances become completely taxed. I've attached a screen shot of htop showing the result of this after an instance has been idle all night. You should see that there are many pending curl requests, this is what consumes the memory as it runs. Once the CPU usage hits high levels (again, while not accepting mysql queries), requests to maxinfo simply hang which causes this curl task to build up until all of the memory is consumed.

It may be worth noting that about every minute or so, MaxScale's CPU usage drops to 0 for about 1 second and then climbs right back up to 100%.

Comment by Mathew Hornbeek [ 2016-08-08 ]

I have more information about this issue.

I've noticed on a normally running instance of MaxScale with log_info enabled that querying the MaxInfo status URI (in my case, localhost:8003/status) MaxScale logs "Started session [0] for MaxInfo service" and then logs "Stopped MaxInfo client session [137]". On a badly behaving instance all that is logged is "Started session [0] for MaxInfo service ". With this in mind I checked open sockets with ss, and found that MaxScale is holding open tcp sockets in CLOSE-WAIT and never closing. Running "curl localhost:8003/status" the entire expected repose body returns but (without setting a timeout for the request of course) the curl request hangs because the socket doesn't close.

I have still yet to find exactly what triggers this to happen except that it only happens on an idle instance.

Comment by markus makela [ 2016-09-06 ]

This is most likely caused by a spinlock not being released under certain conditions. This has been fixed for 2.0.1.

Comment by Marco Menzel [ 2017-04-10 ]

Hello, will there be a fix for 1.4.x?

Generated at Thu Feb 08 04:01:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.