[MXS-478] Load distribution with Read/Write router diverges over time Created: 2015-11-17  Updated: 2016-03-01  Resolved: 2016-03-01

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 1.2.1
Fix Version/s: 1.3.0

Type: Bug Priority: Major
Reporter: Keith Swett Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu 14.04
Percona Server 5.6 Galera Cluster backend.


Issue Links:
Relates
relates to MXS-480 Readwritesplit defaults cause connect... Closed

 Description   

Over a 24-48 hour period the distribution of load between our 3 node cluster diverges.

After a fresh restart of maxscale everything lines up again.

When exhibiting the odd distribution, using dbadmin to examine the stats on the servers "show servers" shows a "Current no. of operations:" for the effected machine (s) significantly higher than the rest, and the number doesn't change over time. Examining process list on the backend server shows no long running processes to explain the stuck number of operations.



 Comments   
Comment by markus makela [ 2015-11-19 ]

This could possibly be related to the fact that readwritesplit doesn't truly round-robin across the slaves when it is configured with slave_selection_criteria=LEAST_CURRENT_OPERATIONS. Instead of sending one query to each slave it will send the query to the first available slave with the least amount of current operations. With low enough load, the load distribution will skew towards the first slave found which is the last slave server listed in the servers parameter for the service.

Comment by Keith Swett [ 2015-11-19 ]

Shouldn't the system eventually come out of it with higher load? Or will it continue to build over time? I'll make a note of which machine is most effected next time it happens and see if it matches the config (last slave server).

And if this is the case, wouldn't we see it return to this state after a restart without changing the incoming load?

Comment by markus makela [ 2015-11-19 ]

One thing to note is the max_slave_connections parameter. This should be set to 100% to have an even distribution among the slaves.

The current defaults for the readwritesplit are somewhat confusing because it uses LEAST_CURRENT_OPERATIONS and only one slave for each session. This possibly ends up causing a significant skew towards one slave server.

Comment by Keith Swett [ 2015-11-19 ]

Confirmed, it is set to 100%, and for the majority of the time it balances everything beautifully. We're really happy with it when it's working!

Comment by Keith Swett [ 2015-11-22 ]

Just happened again, noticed something interesting:

Server 0x1aef390 (db2)
Server: 192.168.x
Status: Slave, Synced, Running
Protocol: MySQLBackend
Port: 3306
Server Version: 5.6.26-74.0-56-log
Node Id: 2
Master Id: -1
Repl Depth: 0
Number of connections: 50914257
Current no. of conns: 16274
Current no. of operations: 1
Server 0x1aee070 (db1)
Server: 192.168.x
Status: Master, Synced, Running
Protocol: MySQLBackend
Port: 3306
Server Version: 5.6.26-74.0-56-log
Node Id: 0
Master Id: -1
Repl Depth: 0
Number of connections: 50922206
Current no. of conns: 18262
Current no. of operations: 0
Server 0x1aedf60 (db3)
Server: 192.168.x
Status: Slave, Synced, Running
Protocol: MySQLBackend
Port: 3306
Server Version: 5.6.26-74.0-56-log
Node Id: 1
Master Id: -1
Repl Depth: 0
Number of connections: 50905554
Current no. of conns: 19076
Current no. of operations: 8

Current number of connections for each host seems really high.

db3 is basically stuck out of rotation (getting little to no load directed at it.)

Comment by Keith Swett [ 2015-11-23 ]

Just caught it again, this time right away. It would seem that the current no of operations is stuck at 10 and never goes below that value. I'm going to spin up an alternate maxscale host and migrate traffic off of this one. We'll see if the operations zero out, or if they stick at 10.

Comment by Keith Swett [ 2015-11-24 ]

As suspected, the following is with NO active connections to the balancer:

Server 0x1aef750 (db2)
Server: 192.x
Status: Slave, Synced, Running
Protocol: MySQLBackend
Port: 3306
Server Version: 5.6.26-74.0-56-log
Node Id: 2
Master Id: -1
Repl Depth: 0
Number of connections: 8149874
Current no. of conns: 1754
Current no. of operations: 10
Server 0x1aee430 (db1)
Server: 192.x
Status: Master, Synced, Running
Protocol: MySQLBackend
Port: 3306
Server Version: 5.6.26-74.0-56-log
Node Id: 0
Master Id: -1
Repl Depth: 0
Number of connections: 8211952
Current no. of conns: 2667
Current no. of operations: 0
Server 0x1aee320 (db3)
Server: 192.x
Status: Slave, Synced, Running
Protocol: MySQLBackend
Port: 3306
Server Version: 5.6.26-74.0-56-log
Node Id: 1
Master Id: -1
Repl Depth: 0
Number of connections: 8212656
Current no. of conns: 4469
Current no. of operations: 0

I think accounting is messed up

Comment by markus makela [ 2015-12-01 ]

Not directly related to this but I managed to get MaxScale to show one open session even though no network connections were open. It could be the same reason as why this is happening.

So far I haven't found a way to reproduce it.

Comment by Johan Wikman [ 2015-12-01 ]

Downgraded to major, because we first need to investigate and fully understand what is going on. A clear candidate for 1.3.1.

Comment by Keith Swett [ 2015-12-01 ]

Any additional information that I can provide to assist?

Comment by markus makela [ 2015-12-10 ]

If you can find a somewhat reliable way of reproducing this, we might be able to reproduce it on our side.

Things to look for:
Abnormal queries (LOAD DATA LOCAL INFILE etc.)
Network errors
Client-side errors (Client closes connection implicitly etc.)

Comment by Keith Swett [ 2015-12-19 ]

I've got a vagrant setup that reproduces the biggest issue we're seeing, take a look at:

https://github.com/wheniwork/maxscale-bughunting

and let me know if you have any issues.

Comment by markus makela [ 2015-12-20 ]

I tested the vagrant setup and was able to reproduce the connection counter for the servers being higher than it should be but I wasn't able to reproduce the current number of operations being larger than it should be. When I did the same test with 1.3.0-beta I wasn't able to reproduce the connection count being wrong.

Could you test this with the 1.3.0-beta version of MaxScale? You can get it here: http://maxscale-jenkins.mariadb.com/ci-repository/1.3.0-beta-debug/

Comment by markus makela [ 2016-02-04 ]

keith.swett Have you had a chance to try it with the 1.3.0 beta version of MaxScale?

Comment by Keith Swett [ 2016-02-05 ]

Yes, we did a test run in our staging environment. The connection count did remain correct, but we experienced a much higher rate of app disconnects (it was too unstable for us to use)

Comment by markus makela [ 2016-03-01 ]

Closing as fixed since this was not reproducible with 1.3.0.

Generated at Thu Feb 08 03:59:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.