[MXS-3855] incorrect routing of selects with disable_master_failback Created: 2021-11-05 Updated: 2021-11-17 Resolved: 2021-11-17 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | galeramon, readwritesplit |
| Affects Version/s: | 6.1.4 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Thomas Benkert | Assignee: | markus makela |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Environment: |
centos7, kernel 3.10.0-1127.el7.x86_64 |
||
| Description |
|
Description: With disable_master_failback=true set with a readwritesplit router, the selects are not distributed correctly after the master went down and up again. How to repeat: galera cluster config:
maxscale config:
1. Do a `tail -f general.log` on all servers. Expected behaviour: The select queries should be evenly distributed on all slaves in a round-robin fashion again, after the host comes back up. Solution: Restarting the maxscale service on the current master fixes the problem. Restarting the service on the slaves does not. |
| Comments |
| Comment by markus makela [ 2021-11-16 ] | |||||
|
Are you doing queries with only one connection? If so then this would be expected behavior as readwritesplit never truly round-robins the reads: each query is routed according to the selection criteria and in the cases of ties, it routes the request to the server which has processed the smallest amount of traffic. This just happens to end up in round-robin behavior when MaxScale is started with a static set of servers and there's only one client doing requests. There are also a couple of internal factors that affect the load balancing decisions. The score that readwritesplit calculates for each candidate server has some hard-coded factors in it to avoid opening extra connections when valid sub-optimal candidates are available. This is based on the assumption that using an already opened TCP connection that is slightly worse score-wise is preferable over opening a new TCP connection to the "perfect" candidate. At the time of writing this factor is:
This means that with two candidates, one with a score of 95 (by default the score is the number of ongoing queries) to which we are already connected to and another which we aren't connected to, the latter candidate must have a score of 55 or lower before the sessions start favoring it over the already connected ones. Having measured and tested this, with about 100 concurrent clients executing SELECT SLEEP(1) with two candidates, this translates to rough 60/40 split favoring the already connected server. This reveals that factor could be lowered as the other server ends up doing more work. Even if this bias value was changed, it wouldn't change the behavior with only one client connection as the open connection would always be preferred. | |||||
| Comment by markus makela [ 2021-11-16 ] | |||||
|
tbenkert One more question: did you reconnect the client after you restarted the server? | |||||
| Comment by markus makela [ 2021-11-17 ] | |||||
|
I'll close this as Not a Bug since it looks like expected behavior. |