[MXS-579] Slave failure not handled gracefully in readwritesplit Created: 2016-02-11 Updated: 2016-06-06 Resolved: 2016-06-06 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | readwritesplit |
| Affects Version/s: | 1.2.1, 1.3.0 |
| Fix Version/s: | 2.0.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Krzysztof Książek | Assignee: | Timofey Turenko |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
My understanding is that when user uses readwritesplit router and connects to MaxScale, slave failures should be handled gracefully. Please correct me if I'm wrong. If that's the case, I can easily reproduce crashes using: while true ; do sysbench --test=/root/sysbench/sysbench/tests/db/oltp.lua --num-threads=2 --max-requests=0 --max-time=0 --mysql-host=172.30.4.15 --mysql-user=sbtest --mysql-password=sbtest --mysql-port=4008 --oltp-tables-count=32 --report-interval=10 --oltp-skip-trx=on --oltp-table-size=1000000 run ; done on MaxScale 1.2.1 and 1.3.0 using attached maxscale.cnf. It's enough to restart slave A and then, when it recovers, slave B and after one of the restarts see errors like below: WARNING: Both max-requests and max-time are 0, running endless test Running the test with following options: Threads started! ALERT: mysql_drv_query() returned error 2003 (Lost connection to backend server.) for query 'SELECT c FROM sbtest1 WHERE id=501119' Looking at Com_select on both slaves it seems like MaxScale picks one of them as an 'active' slave and its failure impacts the backend availability. Restart of the 'non-active' slave does not impact the application. Let me know if there's anything wrong with my setup - from what I remember this bit worked correctly in MaxScale 1.0. |
| Comments |
| Comment by markus makela [ 2016-02-11 ] |
|
This is most likely related to |
| Comment by Krzysztof Książek [ 2016-02-11 ] |
|
Markus, |
| Comment by markus makela [ 2016-04-15 ] |
|
I can reproduce this by disconnecting a slave when running sysbench but the cause for the failure isn't a real bug in MaxScale but rather a "feature" of the readwritesplit module. When a SELECT is sent to a slave and the connection to that slave is lost before the result returns, the client is disconnected. This is the only safe option to do since data could have been modified between the sending of the query and the problem with the connection. We could add this as a feature so that reads are retried on an available slave so that the slave failure would be truly transparent to the client. drag0nius If you could try to run sysbench with --oltp-skip-trx=off we can confirm that this is caused by a network error from a slave. |
| Comment by Krzysztof Książek [ 2016-04-15 ] |
|
Hmm, frankly, in such case, as long as the SELECT is not a part of any transaction, 99% of the time such request can be (and will be) repeated by the application therefore it should be perfectly safe to repeat this query within the proxy. Remaining 1%, when such request cannot/should not be repeated, is a result of misunderstanding how (not) to use transactional SQL - you cannot expect any state of the database if you do not make changes/checks within a transaction - if you run a auto-commit DML's and selects and you make any assumptions, you are in deep troubles already. Of course, if the SELECT (or any other type of query for that matter) is a part of the transaction, there's no safe way other than roll it back and restart from scratch. |
| Comment by Krzysztof Książek [ 2016-04-15 ] |
|
Of course, if there are session variables set for the connection, user variables etc., then it still may not be safe to reexecute the query in the proxy, if you can't reproduce the exact environment settings. That's another story, though. |
| Comment by markus makela [ 2016-06-06 ] |
|
I'm closing this as Not a Bug since this is expected behavior. I've created a new feature request for this functionality in |