[MXS-3088] Support for replication lag monitoring/on-demand availability Created: 2020-07-22 Updated: 2022-09-08 Resolved: 2022-09-08 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readconnroute |
| Affects Version/s: | None |
| Fix Version/s: | N/A |
| Type: | New Feature | Priority: | Major |
| Reporter: | Daniel Almeida (Inactive) | Assignee: | Todd Stoffel (Inactive) |
| Resolution: | Won't Do | Votes: | 1 |
| Labels: | None | ||
| Description |
|
Hello folks, 1. replication lag greater then X seconds Queries returned by replicas lagging too far behind their primary server can possibly return stale/wrong data. In order to prevent 2. number of active queries is greater then Y queries Clients would like the ability to prevent queries from hitting a server after a given number of active queries has been reached. We would like MaxScale to prevent new query requests from being sent to a replica whenever one of the 2 thresholds above
Both thresholds should be independent of each other and these settings should be dynamic and no restart required. Scenario: 1. replication lag greater then X seconds or If #1 or #2 is met, new queries are not sent to replicas matching that threshold. |
| Comments |
| Comment by markus makela [ 2020-07-23 ] |
|
We could make the readwritesplit max_slave_replication_lag parameter a generic service parameter. This would allow other routers to use the same parameter name. As for the behavior, it could be the same as readwritesplit: don't use a server if it's too far behind. This would only affect the server selection done at the start of the client session which in turn means that if the slave starts lagging behind after the connection is created, there's not much that can be done. As for ongoing queries, the benefits of this are not so clear. Readconnroute won't (and can't) open new connections when the current server it uses is no longer valid. This means that if the ongoing query counter reaches the configured value, there's nowhere the router can route queries to. If this mode would have the same behavior as the replication lag one, then the load balancing could end up being very uneven due to transient spikes in the active query count. The granularity of a connection and an active query are very different; a connection is always as large or larger than a query. This means that selecting an action (creating a connection) that lasts longer than the effect it was based on (a query) is bound to create an uneven distribution of actions. Instead, we could use an average of the active query count. The connection distribution would eventually start following the actual amount of work done by each server. This would come at the cost of slower response times to sudden spikes in the active query count but this might not be a bad thing. |
| Comment by Daniel Almeida (Inactive) [ 2020-07-23 ] |
|
Thanks Markus, both items above as you explained make sense to me, and you're correct, on going connections that were already made prior to the thresholds being reached would not be impacted (they should not be terminated and should be allowed to continue their operation). |