[MXS-1050] Long-lived persistent connections slow down maxscale Created: 2016-12-09  Updated: 2017-01-03  Resolved: 2017-01-03

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 2.0.1
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Mikko Mensonen Assignee: Esa Korhonen
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Debian 8


Attachments: PNG File load-pinpoint=1480622399,1481272991.png     PNG File memory-pinpoint=1480239021,1483090221.png    

 Description   

Hello,

as described here, Markus asked me to open an issue as this may be a performance bug.

We recently increased connection idle timeouts on our clientside connection pools from low figures (10-15 minutes) to something more reasonable, e.g 2-3 hours, in order to cope with traffic spikes, with a pool size maxing out at 3k-4k total persistent connections.

This resulted in maxscale starting to slow down considerably the longer the connection pools stayed open, resulting in client response times increasing from low <10ms up to 1000ms or longer. During this, the amount of actual queries/sec or actual DB load did not increase at all, it was simply that the longer the persistent connection pools stayed open, the slower maxscale got. Queries themselves ran as fast as usual, maxscale simply took longer to get to processing the query. Clearing the persistent pool from client side immediately fixed the problem.

This was with a 2-CPU machine, with poll_sleep = 100 and non_blocking_polls = 10, epoll stats were showing

No. of epoll cycles:                           532253223
No. of epoll cycles with wait:                         27676568
No. of epoll calls returning events:           31331453
No. of non-blocking calls returning events:    3802205
No. of read events:                            62667701
No. of write events:                           62785204
No. of error events:                           741
No. of hangup events:                          25823
No. of accept events:                          16595
No. of times no threads polling:               8
Current event queue length:                    3
Maximum event queue length:                    2563
No. of DCBs with pending events:               0
No. of wakeups with pending queue:             13685
No of poll completions with descriptors
    No. of descriptors  No. of poll completions.
     1          20015023
     2          4643493
     3          2863016
     4          1377815
     5          773572
     6          465835
     7          297985
     8          201369
     9          144546
    >= 10           548744

This particular instance was still running 2.0.1, but has since been upgraded to 2.0.2. As a temporary workaround to the solution I have simply increased the amount of CPUs for the virtual machine and am now observing what happens.



 Comments   
Comment by Mikko Mensonen [ 2016-12-09 ]

Perhaps to illustrate the issue from another point of view, here is a graph of the load average on the maxscale server during this week:

The moment it starts to increase rapidly is when connection pools were changed from 12min idle timeout to 2 hours. The even higher peak is when the timeout had been increased to 3 hours and there was more traffic than usual (around 4k open connections vs normal 1-2k). Each drop in load corresponds to me testing and resetting the client connection pools again to short timeouts, so to me it looks like there is a definitive correlation.

Comment by Esa Korhonen [ 2016-12-20 ]

Hello, Mikko.
I'm trying to recreate this issue and I have some questions:
1) What router(s) and filter were used in MaxScale?
2) What kind of SQL-statements were sent? Was it mostly read-only, or was data written often?

Comment by Mikko Mensonen [ 2016-12-20 ]

Hey Esa,

  1. A readwritesplit router is in use, no filters at all. The R/W splitter routes to a (Percona) galera cluster with three nodes and there's also a galeramon monitor active
  2. The SQL statements are about 22% write, 78% read. The nature of the statements is that they are all very short transactions; 99% of all queries, both read and write, run in less than 20ms. It's a sustained rate of about 350 queries / second on each server
Comment by Esa Korhonen [ 2016-12-23 ]

Are the writes just updates, inserts and deletes, or are there more exotic commands being given regularly?

Comment by Esa Korhonen [ 2016-12-28 ]

Also, did the memory consumption of MaxScale increase with time? This issue may be due to session variables which the readwritesplit-router stores in case it needs to switch backend servers.

Comment by Mikko Mensonen [ 2016-12-30 ]

There are no exotic commands being run regularly. Maintenance and administrative commands don't get run through maxscale anyway and the client applications only have basic permissions for select/insert/update.

About the memory consumption: actually yes, I can see memory consumption shoot rapidly upwards immediately after making the changes for the connection pools, see graph below:

Comment by Esa Korhonen [ 2017-01-03 ]

Indeed, this looks like it's linked to the session variable storage. Unfortunately, there is no perfect solution to this at the moment.

A partial solution is to disable the session variable storage entirely, which should stop the increasing memory (and most likely also the cpu) consumption. This does have the downside of causing additional load on the master server in case slave servers fail often. To disable the storage, add "router_options=disable_sescmd_history=true" to the router configuration section.

For more information on the router settings, please see https://github.com/mariadb-corporation/MaxScale/blob/2.0/Documentation/Routers/ReadWriteSplit.md

Comment by Mikko Mensonen [ 2017-01-03 ]

Okay, giving that a go. I think we can live with the risk of additional load on the master; an unscheduled slave failure is something we have so far never experienced.

After changing the router options (and a service restart, reloading the config file didn't seem to help), we're immediately back at low CPU (<20%) and low memory (<1%) usage. Going to keep an eye on this for a while, but it looks promising.

Comment by Johan Wikman [ 2017-01-03 ]

Closing, as this does not seem to be a direct bug in MaxScale.

In the longer term we need to handle the maintenance of the session state in some different way, so that long lived persistent connections do not as a side effect cause excessive memory consumption. For instance, by fetching the state from the master when a new slave is taken into use.

If the problem did not disappear, please reopen this or create a new bug report.

Generated at Thu Feb 08 04:03:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.