[MXS-3088] Support for replication lag monitoring/on-demand availability Created: 2020-07-22  Updated: 2022-09-08  Resolved: 2022-09-08

Status: Closed
Project: MariaDB MaxScale
Component/s: readconnroute
Affects Version/s: None
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Daniel Almeida (Inactive) Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 1
Labels: None


 Description   

Hello folks,
We would like MaxScale to have the ability to control when replicas are available or placed on a "standby" mode, whenever 2
specific thresholds are reached:

1. replication lag greater then X seconds

Queries returned by replicas lagging too far behind their primary server can possibly return stale/wrong data. In order to prevent
wrong information sent back to the client, we would like to prevent new queries hitting the replica with replication lag greater than
a given threshold.

2. number of active queries is greater then Y queries

Clients would like the ability to prevent queries from hitting a server after a given number of active queries has been reached.
This can be for a variety of reasons, i.e.: application design, on-going backups causing locks, etc ...

We would like MaxScale to prevent new query requests from being sent to a replica whenever one of the 2 thresholds above
are exceeded. A new State within MaxScale would show the servers which are affected by the above as
"standby (throttled)" (or something else you deem more appropriate) and also a new column showing the lag, example below:

┌───────────────┬────────────────┬──────┬─────────────┬─────────────────┬────────────────────────────┐─────────────────┐
│ Server        │ Address        │ Port │    Lag      |   Connections   │    State                   │    GTID         │
├───────────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────────────────────┤─────────────────┤
│ dbServer1     │ 192.168.88.101 │ 3306 │      0      |     20          │ Master, Running            │ 0-8180-15692671 │
├───────────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────────────────────┤─────────────────┤
│ dbServer2     │ 192.168.88.102 │ 3306 │      0      |     40          │ Slave, Running             │ 0-8180-15692671 │
├───────────────┼────────────────┼──────┼─────────────┼─────────────────┼────────────────────────────┤─────────────────┤
│ dbServer3     │ 192.168.88.103 │ 3306 │     500     |     40          │ Slave, Standby(throttled)  │ 0-8180-15690132 │
└───────────────┴────────────────┴──────┴─────────────┴─────────────────┴────────────────────────────┘─────────────────┘

Both thresholds should be independent of each other and these settings should be dynamic and no restart required.
We could have a failsafe logic, and if only 1 replica is available, these 2 thresholds would be ignored.
Once threshold is cleared (i.e. lag falls below it), replicas are automatically made available and the state is updated.
Existing queries are not affected.

Scenario:

1. replication lag greater then X seconds or
2. number of active queries is greater then Y queries

If #1 or #2 is met, new queries are not sent to replicas matching that threshold.
Once #1 or #2 falls below the configured threshold, new queries can be routed to the replica again.



 Comments   
Comment by markus makela [ 2020-07-23 ]

We could make the readwritesplit max_slave_replication_lag parameter a generic service parameter. This would allow other routers to use the same parameter name. As for the behavior, it could be the same as readwritesplit: don't use a server if it's too far behind. This would only affect the server selection done at the start of the client session which in turn means that if the slave starts lagging behind after the connection is created, there's not much that can be done.

As for ongoing queries, the benefits of this are not so clear. Readconnroute won't (and can't) open new connections when the current server it uses is no longer valid. This means that if the ongoing query counter reaches the configured value, there's nowhere the router can route queries to. If this mode would have the same behavior as the replication lag one, then the load balancing could end up being very uneven due to transient spikes in the active query count. The granularity of a connection and an active query are very different; a connection is always as large or larger than a query. This means that selecting an action (creating a connection) that lasts longer than the effect it was based on (a query) is bound to create an uneven distribution of actions.

Instead, we could use an average of the active query count. The connection distribution would eventually start following the actual amount of work done by each server. This would come at the cost of slower response times to sudden spikes in the active query count but this might not be a bad thing.

Comment by Daniel Almeida (Inactive) [ 2020-07-23 ]

Thanks Markus, both items above as you explained make sense to me, and you're correct, on going connections that were already made prior to the thresholds being reached would not be impacted (they should not be terminated and should be allowed to continue their operation).

Generated at Thu Feb 08 04:18:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.