[MXS-2213] Growing memory consumption with 2.2.17 Created: 2018-12-05  Updated: 2018-12-22  Resolved: 2018-12-12

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 2.2.17
Fix Version/s: 2.2.18

Type: Bug Priority: Critical
Reporter: Mikko Mensonen Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: Memory_leak
Environment:

Debian 8.6


Attachments: File maxscale.cnf     PNG File memory-pinpoint=1541166924,1544018124.png     PNG File memory-pinpoint=1543913724,1544021724.png     PNG File memory-pinpoint=1544103927,1544211927.png    
Issue Links:
Problem/Incident
causes MXS-2224 MaxScale 2.3.2 Memory leaks causes OOM Closed
Sprint: MXS-SPRINT-72

 Description   

We have just upgraded from 2.0.2 to 2.2.17 and have spotted an immediate growing memory consumption. On 2.0.2 it was constant, around 400MB (instance running for about two years nonstop):

But as soon as we switched over to 2.2.17, the consumption started rising and seems to be rising until it OOM killer kicks in:

There has been no change in traffic, type of traffic, amount of clients, connections, amount and types of queries etc. We're running readwritesplit service with slave_selection_criteria=LEAST_CURRENT_OPERATIONS router options and galeramon, configuration settings unchanged since 2.0.2.



 Comments   
Comment by markus makela [ 2018-12-07 ]

Possibly a duplicate of MXS-2183.

Comment by markus makela [ 2018-12-07 ]

Here are packages built from the tip of the 2.2 branch: http://max-tst-01.mariadb.com/ci-repository/2.2-dec7/mariadb-maxscale/

You can try to see if this fixes the issue.

Comment by Mikko Mensonen [ 2018-12-07 ]

Thanks, installed maxscale-2.2.18-1.debian.jessie.x86_64.deb on a separate (identical) instance and diverted traffic there.

I'm still seeing a slow but steady increase in memory consumption, but at this point it's hard to say if it's normal or not. 2.0.2 had this too after startup, but it then settled quickly at a reasonable amount, varying constantly between 200-400MB. I'll wait a few hours to see if this fluctuates at all or if it just keeps rising constantly.

Comment by Mikko Mensonen [ 2018-12-07 ]

I'd say this does not change anything. I still see steadily increasing (in linear fashion even) memory consumption on 2.2.18. It never goes down even a bit and now within just 3 hours it has exceeded even the highest peaks that 2.0.2 ever produced and grows at identical rate as 2.2.17.

Perhaps useful information: I still left the 2.2.17 instance running and even though it still has almost as many open connections as the 2.2.18, just with zero queries and current operations, the memory consumption has stopped increasing completely and has evened out. The memory consumption isn't dropping on its own though.

Comment by markus makela [ 2018-12-07 ]

Thanks, this helps us a lot by telling there's still something we're not seeing. We'll continue our search and as soon as we find something, we'll report back.

We'll have to try and reproduce your environment complete so, if possible, please upload your configuration with all sensitive information removed. Even though you did describe your setup with quite a high level of detail, we'd like to avoid any possibility on misinterpretation by using an identical configuration.

Another thing that would be very helpful is either a sample or a description of the workload going through MaxScale. Knowing what sort of queries, transactions and statements are being done speeds up the search by a significant amount.

Comment by Mikko Mensonen [ 2018-12-07 ]

maxscale.cnf attached with sensitive information masked

A further observation of the now idle 2.2.17 instance: as soon as all clients finally excited (persistent sessions timed out) and amount of open connections on maxscale dropped to zero, the memory consumption did drop. In fact, it seems to have halved. From the graph below, the moment at around 11:00 where the consumption evens out is where I moved traffic away, the sudden drop is all clients timing out more or less simultaneously.

The workload going through this instance is a pretty mixed bunch of everything.

Mostly we're talking about a lot of very short, very quick, highspeed queries with practically no transactions and autocommit always enabled. Majority of queries run within 10-20ms, averaging (across three galera nodes in total) about 800 queries / second, with peaks at about 1200 / second. About 20% of these get routed to the write master, the remaining is split fairly evenly across the two read slaves (though maxscale seems to favour one server slightly more). Of these 20% going to the write master about one quarter is plain SELECTs with no write operations.

Connections are persistent and pooled from client side, averaging ~200 connections per server but occasionally peaking up to 1000 connections per server. Connections stay open indefinitely, but clients close them after 720 seconds of idling. MySQL itself has the 8 hour default idle timeout, but we never actually reach this. On average the "new connections per second" rate is practically zero, as connections often stay open for days or weeks.

Traffic throughput is about 160kb/s received, 100kb/s sent on the write master, 300kb/s received 200kb/s sent on the slaves. Occasional sustained peaks up to 10mb/s on the slaves do occur, but at most once per week.

One of the galera nodes is also a replication master, galera itself, same as the mysql servers, is mostly idling and not really under any load. Maxscale instance also has very low load average (rarely more than 0.2) and maxscale is pretty much stable at 30% CPU. Queries themselves are routed very fast and maxscale stats for queued events give 99.99% under 100ms and for executed events 99.94% under 100ms.

OS-wise, the MaxScale instances are virtual: Debian 8.6 with about 6GB RAM and 8 CPUs. Galera nodes are running Percona XtraDB Cluster 5.6.27 on hardware with Debian 8.2.

I can look into providing also an actual sample of the workload privately if you require one.

Comment by markus makela [ 2018-12-07 ]

Thanks for the extremely detailed description. I think this is enough and we can start trying to figure out why this is happening. One piece of information that would help rule something out is whether server-side prepared statements are in use.

Given that the memory consumption dropped once all connections were closed, I think that at least a part of this might be explained by something being kept in memory for too long. One test to verify this would be to see if memory usage always drops down to the same level after the connections close. If it does, we're not facing a true memory leak but something is not freed as early as it should be. My immediate guess would be the session command history but that is disabled by default and your configuration does not define it. If prepared statements are in use, the linear growth might indicate that some part of the prepared statement handle tracking is not clearing out the references of the mappings of client side handles to backend handles.

If server-side prepared statements are in use this would be the most probable reason as the prepared statement routing was a new feature in 2.2. I believe that it is less likely that an older part of the code base is responsible for this behavior as the core functionality hasn't changed much between 2.1 and 2.2. One thing to try would be to try with MaxScale 2.1 and see if that exhibits the same behavior. If it does, we know the prepared statements are not the culprit. If it doesn't it is again very likely that the prepared statements are what cause this.

We'll start looking at whether this is actually caused by unbounded memory usage for sessions as it seems to be a likely candidate at this point. We'll again let you know once there is progress and we can most likely provide packages for testing since it appears you can easily divert traffic to track the memory use of each version.

Comment by Mikko Mensonen [ 2018-12-07 ]

No server-side prepared statements in use, all statements are prepared client-side.

About connections closing: I can confirm at least that connections being closed in general does not have any effect - just today we had the connections multiply considerably (from 150 average to nearly 800) and then drop back down after 3-4 hours, yet memory consumption kept growing linearly. We also observed connections fluctuating constantly on 2.2.17, yet memory consumption never dropped. Only when all connections were closed did the consumption halve itself. But again also mentioning that growth of consumption dropped immediately as soon as query rate went down to zero.

Also worth mentioning: when started fresh, with zero connections, the instance idled at only a few hundred MB memory consumed, but now that the instance is again idling with zero connections the memory has not dropped even close to the initial stage, it's still close to 2GB.

Also something maybe trivial; while 2.2.17 had at least consumption growing sometimes faster, sometimes slower, 2.2.18 is really just a straight x = y line; the growth rate of more memory consumed is constant and doesn't alter at all.

Session command history we had disabled already on 2.0.2, as we discovered in MXS-1050 that long-lived persistent connections caused a significant performance drop.

I can consider giving 2.1 a go in production next week, but only for a short period to see if it leaks or not - our reason for upgrade to 2.2 was MXS-1397, which was not fixed in 2.1 yet, so don't want to stick to that for too long.

Comment by markus makela [ 2018-12-08 ]

Looking into the code with a different mindset, I did find two problems:

  1. The binary prepared statement handles weren't being removed from the set of active handles when they were closed
  2. Session command response map wasn't properly pruned if all servers responded quickly enough

Since you aren't using server-side prepared statements, the only explanation for a steady growth in memory consumption is the lack of pruning for session command responses. This makes sense since the pooled connections live for a very long time and closing all connections frees all of the memory that was used. I believe it is expected that not all of the memory is released to the OS as the default malloc implementation appears to only return the memory when very large chunks are freed.

I've built packages from commit 39b55661016628f437477da58afdf74df72ea32d that should correctly clear out the memory: http://max-tst-01.mariadb.com/ci-repository/MXS-2.2-markusjm-dec8/mariadb-maxscale/

If you can, please try and report back whether this helps with the memory growth issues.

Comment by Mikko Mensonen [ 2018-12-08 ]

Okay this is much better. After two hours of running this version, memory consumption has been sitting at exactly 0.2% and not increasing at all. I'll keep an eye on this over the weekend but so far looking good.

Comment by Mikko Mensonen [ 2018-12-10 ]

I can confirm that this version does seem to solve the issue. After 48 hours of traffic, still looking at only 0.2 - 0.4% memory consumption. And most importantly, the consumption has also periodically gone down, not just up.

Generated at Thu Feb 08 04:12:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.