[MXS-579] Slave failure not handled gracefully in readwritesplit Created: 2016-02-11  Updated: 2016-06-06  Resolved: 2016-06-06

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 1.2.1, 1.3.0
Fix Version/s: 2.0.0

Type: Bug Priority: Major
Reporter: Krzysztof Książek Assignee: Timofey Turenko
Resolution: Not a Bug Votes: 0
Labels: None

Attachments: File maxscale.cnf    
Issue Links:
Duplicate
is duplicated by MXS-587 readwritesplit module - intensive loa... Closed
Relates
relates to MXS-756 Retry read after slave failure Closed

 Description   

My understanding is that when user uses readwritesplit router and connects to MaxScale, slave failures should be handled gracefully. Please correct me if I'm wrong.

If that's the case, I can easily reproduce crashes using:

while true ; do sysbench --test=/root/sysbench/sysbench/tests/db/oltp.lua --num-threads=2 --max-requests=0 --max-time=0 --mysql-host=172.30.4.15 --mysql-user=sbtest --mysql-password=sbtest --mysql-port=4008 --oltp-tables-count=32 --report-interval=10 --oltp-skip-trx=on --oltp-table-size=1000000 run ; done

on MaxScale 1.2.1 and 1.3.0 using attached maxscale.cnf.

It's enough to restart slave A and then, when it recovers, slave B and after one of the restarts see errors like below:

WARNING: Both max-requests and max-time are 0, running endless test
sysbench 0.5: multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 2
Report intermediate results every 10 second(s)
Random number generator seed is 0 and will be ignored

Threads started!

ALERT: mysql_drv_query() returned error 2003 (Lost connection to backend server.) for query 'SELECT c FROM sbtest1 WHERE id=501119'
ALERT: mysql_drv_query() returned error 2003 (Lost connection to backend server.) for query 'SELECT c FROM sbtest8 WHERE id=502367'

Looking at Com_select on both slaves it seems like MaxScale picks one of them as an 'active' slave and its failure impacts the backend availability. Restart of the 'non-active' slave does not impact the application. Let me know if there's anything wrong with my setup - from what I remember this bit worked correctly in MaxScale 1.0.



 Comments   
Comment by markus makela [ 2016-02-11 ]

This is most likely related to MXS-564. Could you try with these packages and see if it happens again: http://maxscale-jenkins.mariadb.com/ci-repository/release-1.3.0-release/mariadb-maxscale/

Comment by Krzysztof Książek [ 2016-02-11 ]

Markus,
Unfortunately, it did not help. Additionally, it looks like this build breaks password encryption (I'm not able to use encrypted password as explained in https://mariadb.com/kb/en/mariadb-enterprise/mariadb-maxscale/mariadb-maxscale-installation-guide/). I had to revert to plain text passwords to make it connect to the backend.

Comment by markus makela [ 2016-04-15 ]

I can reproduce this by disconnecting a slave when running sysbench but the cause for the failure isn't a real bug in MaxScale but rather a "feature" of the readwritesplit module.

When a SELECT is sent to a slave and the connection to that slave is lost before the result returns, the client is disconnected. This is the only safe option to do since data could have been modified between the sending of the query and the problem with the connection.

We could add this as a feature so that reads are retried on an available slave so that the slave failure would be truly transparent to the client.

drag0nius If you could try to run sysbench with --oltp-skip-trx=off we can confirm that this is caused by a network error from a slave.

Comment by Krzysztof Książek [ 2016-04-15 ]

Hmm, frankly, in such case, as long as the SELECT is not a part of any transaction, 99% of the time such request can be (and will be) repeated by the application therefore it should be perfectly safe to repeat this query within the proxy. Remaining 1%, when such request cannot/should not be repeated, is a result of misunderstanding how (not) to use transactional SQL - you cannot expect any state of the database if you do not make changes/checks within a transaction - if you run a auto-commit DML's and selects and you make any assumptions, you are in deep troubles already.

Of course, if the SELECT (or any other type of query for that matter) is a part of the transaction, there's no safe way other than roll it back and restart from scratch.

Comment by Krzysztof Książek [ 2016-04-15 ]

Of course, if there are session variables set for the connection, user variables etc., then it still may not be safe to reexecute the query in the proxy, if you can't reproduce the exact environment settings. That's another story, though.

Comment by markus makela [ 2016-06-06 ]

I'm closing this as Not a Bug since this is expected behavior. I've created a new feature request for this functionality in MXS-756 and I believe it is possible to achieve transparent slave failure handling in readwritesplit once this new functionality is implemented.

Generated at Thu Feb 08 04:00:23 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.