[MXS-2138] Fallback to the stalled slaves if the master down Created: 2018-11-05  Updated: 2022-09-08  Resolved: 2022-09-08

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 2.2.15
Fix Version/s: N/A

Type: New Feature Priority: Major
Reporter: Alexander Sheiko Assignee: Todd Stoffel (Inactive)
Resolution: Won't Do Votes: 0
Labels: None

Epic Link: Failover / Recovery Improvements

 Description   

There is a case when the slaves with a large lag are disabled due to max_slave_replication_lag and the entire load went to the master and it collapsed ...

-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status              
-------------------+-----------------+-------+-------------+--------------------
server1            | master          |  3306 |           0 | Down
server2            | slave1          |  3306 |           0 | Running
server3            | slave2          |  3306 |           0 | Running
-------------------+-----------------+-------+-------------+--------------------

We really need the setting that allows MaxScale to use working, but not quite relevant, slaves only for reading (like router_options = running in readconnroute)



 Comments   
Comment by markus makela [ 2018-11-05 ]

Which version of MaxScale were you using?

Comment by Alexander Sheiko [ 2018-11-05 ]

2.2.15

Comment by markus makela [ 2019-06-18 ]

Can you try with the latest 2.3 release?

Comment by Alexander Sheiko [ 2019-06-18 ]

MaxScale 2.3.4 after the down of the master server automatically assigns a new one, contrary to `auto_failover=false`

warning: The current master server 'server1' is no longer valid because it has been down over 5 (failcount) monitor updates and it does not have any running slaves. Selecting new master server.
warning: 'server1' is not a valid master candidate because it is down.
notice : Setting 'server2' as master.
notice : Server changed state: server2[slave1:3306]: new_master. [Running] -> [Master, Running]

This is normal??!

-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status              
-------------------+-----------------+-------+-------------+--------------------
server1            | master          |  3306 |           0 | Down
server2            | slave1          |  3306 |           0 | Master, Running
server3            | slave2          |  3306 |           0 | Running
-------------------+-----------------+-------+-------------+--------------------

Comment by markus makela [ 2019-06-18 ]

Can you try with a more recent 2.3 version? If you can, the output of list servers before and after would help as well as the MaxScale logs.

Comment by Alexander Sheiko [ 2019-06-18 ]

For the test lab, I used the last docker image of mariadb/maxscale:latest ...

Comment by Alexander Sheiko [ 2019-06-18 ]

The behavior is the same

MaxScale> show version
2.3.8
MaxScale> list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status              
-------------------+-----------------+-------+-------------+--------------------
server1            | master          |  3306 |           0 | Down
server2            | slave1          |  3306 |           0 | Master, Running
server3            | slave2          |  3306 |           0 | Running
-------------------+-----------------+-------+-------------+--------------------

MaxScale> show monitors
Monitor:                0x56374f011650
Name:                   MariaDB-Monitor
State:                  Running
Times monitored:        164
Sampling interval:      2000 milliseconds
Connect Timeout:        3 seconds
Read Timeout:           1 seconds
Write Timeout:          2 seconds
Connect attempts:       1 
Monitored servers:      [master]:3306, [slave1]:3306, [slave2]:3306
Automatic failover:     Disabled
Failcount:              5
Failover timeout:       90
Switchover timeout:     90
Automatic rejoin:       Disabled
Enforce read-only:      Disabled
Detect stale master:    Enabled
Non-promotable servers (failover): 'server2', 'server3'
 
Server information:
-------------------
 
Server:                 server1
Server ID:              3000
Read only:              No
Gtid current position:  0-3000-5
Gtid binlog position:   0-3000-5
No slave connections
 
Server:                 server2
Server ID:              3001
Read only:              No
Gtid current position:  0-3000-5
Gtid binlog position:   0-3000-5
Slave connections:
  Host:          [master]:3306, IO/SQL running:  No/Yes, Master ID: 3000, Gtid_IO_Pos: 0-3000-5, R.Lag: -1
 
Server:                 server3
Server ID:              3002
Read only:              No
Gtid current position:  0-3000-5
Gtid binlog position:   0-3000-5
Slave connections:
  Host:          [master]:3306, IO/SQL running:  No/Yes, Master ID: 3000, Gtid_IO_Pos: 0-3000-5, R.Lag: -1

Comment by markus makela [ 2019-06-18 ]

Can you post the full configuration logs? We think that the value of the master-retry-count might be too low and the slaves stop even trying to connect to the master which in turn causes them to stop being treated as slaves.

Make sure the master-retry-count is set to a reasonably high value. The default of 86400 should be adequate for most purposes.

Comment by Alexander Sheiko [ 2019-06-18 ]

mysqld --log-bin=mariadb-bin --binlog-format=ROW --server-id=3000
mysqld --log-bin=mariadb-bin --binlog-format=ROW --server-id=3001 --log-slave-updates
mysqld --log-bin=mariadb-bin --binlog-format=ROW --server-id=3002 --log-slave-updates

The rest are default values.

Comment by Alexander Sheiko [ 2019-06-18 ]

I also started on the slaves
STOP SLAVE IO_THREAD;
to simulate lag

Comment by markus makela [ 2019-06-18 ]

OK, that might cause it.

Comment by markus makela [ 2019-09-05 ]

We can use a mechanism similar to the rank mechanism added in 2.4 to prioritize servers that aren't lagging and allow use of lagging servers if nothing else is available.

Generated at Thu Feb 08 04:11:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.