[MXS-2489] ReadWriteSplit service redirect some queries to laggy slave Created: 2019-05-15  Updated: 2020-03-27  Resolved: 2020-03-02

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon, Monitor, readwritesplit
Affects Version/s: 2.3.7
Fix Version/s: 2.5.0

Type: Bug Priority: Major
Reporter: Abdul Rahman Babil Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: Maxscale, mariadb

Issue Links:
Relates
relates to MXS-1720 Priori causal read Closed

 Description   

I'm using MaxScale 2.3 with 3 MariaDB servers, 1 of them is master and the others are slaves and I set max_slave_replication_lag to 5 secs

[Read-Write-Service]
type=service
router=readwritesplit
servers=server1,server3,server2
max_slave_replication_lag=5
master_failure_mode=fail_on_write

After that I stop slaves for a period of time, and kept master running, all queries redirected to master, then slaves come back online, but they were hours behind master, when Monitor check slave status, sometime SHOW SLAVE STATUS return:
Slave_IO_Running Preparing
Seconds_Behind_Master NULL

then the Monitor decides that slave is up to date and redirect some queries to that slave, even slave is hours behind master!

I took a look over source code and IMHO I think this block of code is the reason

static inline bool rpl_lag_is_ok(SRWBackend& backend, int max_rlag)
{
   return max_rlag == MXS_RLAG_UNDEFINED || backend->server()->rlag <= max_rlag;
}

So maybe removing ( max_rlag == MXS_RLAG_UNDEFINED) from condition might help in this case and damage in case slave was really up to date, maybe can check GTID for master and slave to determine that master and slave are really at same point of transaction

Master and slaves are running MariaDB 10.1.40



 Comments   
Comment by markus makela [ 2019-07-04 ]

Seems that the monitor shouldn't update the replication lag if the replication hasn't started.

Comment by markus makela [ 2020-03-02 ]

This will actually be fixed by MXS-1720 which allows routing to use GTIDs to pick only up-to-date servers.

Comment by markus makela [ 2020-03-02 ]

Closing as fixed since MXS-1720 provides a way to route queries to servers that are known to not lag behind.

Generated at Thu Feb 08 04:14:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.