[MXS-1050] Long-lived persistent connections slow down maxscale Created: 2016-12-09 Updated: 2017-01-03 Resolved: 2017-01-03 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | N/A |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Mikko Mensonen | Assignee: | Esa Korhonen |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Debian 8 |
||
| Attachments: |
|
| Description |
|
Hello, as described here, Markus asked me to open an issue as this may be a performance bug. We recently increased connection idle timeouts on our clientside connection pools from low figures (10-15 minutes) to something more reasonable, e.g 2-3 hours, in order to cope with traffic spikes, with a pool size maxing out at 3k-4k total persistent connections. This resulted in maxscale starting to slow down considerably the longer the connection pools stayed open, resulting in client response times increasing from low <10ms up to 1000ms or longer. During this, the amount of actual queries/sec or actual DB load did not increase at all, it was simply that the longer the persistent connection pools stayed open, the slower maxscale got. Queries themselves ran as fast as usual, maxscale simply took longer to get to processing the query. Clearing the persistent pool from client side immediately fixed the problem. This was with a 2-CPU machine, with poll_sleep = 100 and non_blocking_polls = 10, epoll stats were showing
This particular instance was still running 2.0.1, but has since been upgraded to 2.0.2. As a temporary workaround to the solution I have simply increased the amount of CPUs for the virtual machine and am now observing what happens. |
| Comments |
| Comment by Mikko Mensonen [ 2016-12-09 ] |
|
Perhaps to illustrate the issue from another point of view, here is a graph of the load average on the maxscale server during this week: The moment it starts to increase rapidly is when connection pools were changed from 12min idle timeout to 2 hours. The even higher peak is when the timeout had been increased to 3 hours and there was more traffic than usual (around 4k open connections vs normal 1-2k). Each drop in load corresponds to me testing and resetting the client connection pools again to short timeouts, so to me it looks like there is a definitive correlation. |
| Comment by Esa Korhonen [ 2016-12-20 ] |
|
Hello, Mikko. |
| Comment by Mikko Mensonen [ 2016-12-20 ] |
|
Hey Esa,
|
| Comment by Esa Korhonen [ 2016-12-23 ] |
|
Are the writes just updates, inserts and deletes, or are there more exotic commands being given regularly? |
| Comment by Esa Korhonen [ 2016-12-28 ] |
|
Also, did the memory consumption of MaxScale increase with time? This issue may be due to session variables which the readwritesplit-router stores in case it needs to switch backend servers. |
| Comment by Mikko Mensonen [ 2016-12-30 ] |
|
There are no exotic commands being run regularly. Maintenance and administrative commands don't get run through maxscale anyway and the client applications only have basic permissions for select/insert/update. About the memory consumption: actually yes, I can see memory consumption shoot rapidly upwards immediately after making the changes for the connection pools, see graph below: |
| Comment by Esa Korhonen [ 2017-01-03 ] |
|
Indeed, this looks like it's linked to the session variable storage. Unfortunately, there is no perfect solution to this at the moment. A partial solution is to disable the session variable storage entirely, which should stop the increasing memory (and most likely also the cpu) consumption. This does have the downside of causing additional load on the master server in case slave servers fail often. To disable the storage, add "router_options=disable_sescmd_history=true" to the router configuration section. For more information on the router settings, please see https://github.com/mariadb-corporation/MaxScale/blob/2.0/Documentation/Routers/ReadWriteSplit.md |
| Comment by Mikko Mensonen [ 2017-01-03 ] |
|
Okay, giving that a go. I think we can live with the risk of additional load on the master; an unscheduled slave failure is something we have so far never experienced. After changing the router options (and a service restart, reloading the config file didn't seem to help), we're immediately back at low CPU (<20%) and low memory (<1%) usage. Going to keep an eye on this for a while, but it looks promising. |
| Comment by Johan Wikman [ 2017-01-03 ] |
|
Closing, as this does not seem to be a direct bug in MaxScale. In the longer term we need to handle the maintenance of the session state in some different way, so that long lived persistent connections do not as a side effect cause excessive memory consumption. For instance, by fetching the state from the master when a new slave is taken into use. If the problem did not disappear, please reopen this or create a new bug report. |