[MXS-1893] CHANGE MASTER lost connection on auto_rejoin should retry Created: 2018-06-01  Updated: 2018-08-17  Resolved: 2018-08-17

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.2.5
Fix Version/s: 2.2.10

Type: Bug Priority: Major
Reporter: Richard Lane Assignee: Esa Korhonen
Resolution: Fixed Votes: 0
Labels: None

Attachments: Text File maxscale.log    

 Description   

When node/container comes up and attempts to auto_rejoin as slave, if a CHANGE MASTER looses connection (but actually works), you end up with the node pointing to the master however the slave is never started and auto_rejoin is disabled (which is the worst part).

If you just went ahead and did the START SLAVE anyway this would have worked, or there should be a retry of the auto_rejoin before disabling it and calling it a failure. I really don't know why the lost connection but this was a transient event.

Following is the excerpt from the maxscale.log (since I couildn't attach it).
Here is the times:
14:13:12 mariadb-0 comes back up and attempts to rejoin master, however CHANGE MASTER TO... seems to get a lost connection and fails. The CHANGE MASTER actually worked and the slave had the new master already setup.
14:18:19: I went into mariadb-0 node and saw that the CHANGE MASTER did work and manually did a START SLAVE; Now this node was successfully replicating from the local Master.
------------------------------------
2018-05-31 14:13:10 notice : Server changed state: mdb-dc1-mariadb-
0[192.168.1.218:3306]: server_up. [Down] -> [Running]
2018-05-31 14:13:11 notice : Executed monitor script '/usr/lib/maxscale/maxscale_notify.py --initiator=[192.168.1.218]:3306 --event=server_up --servers=[192.168.1.78]:3306,[192.168.1.218]:3306,[192.168.1.128]:3306 --masters=[192.168.1.78]:3306 --slaves=[192.168.1.128]:3306' on event 'server_up'
2018-05-31 14:13:11 notice : [mariadbmon] Server 'mdb-dc1-mariadb-0' is replicating from a server other than 'mdb-dc1-mariadb-1', redirecting it to 'mdb-dc1-mariadb-1'.
2018-05-31 14:13:12 warning: [mariadbmon] Slave 'mdb-dc1-mariadb-0' redirection failed: 'Lost connection to MySQL server during query'. Query: 'CHANGE MASTER TO ...'.
2018-05-31 14:13:12 error : [mariadbmon] A cluster join operation failed, disabling automatic rejoining. To re-enable, manually set 'auto_rejoin' to 'true' for monitor 'MariaDB-Monitor' via MaxAdmin or the REST API.
2018-05-31 14:18:19 notice : Server changed state: mdb-dc1-mariadb-0[192.168.1.218:3306]: new_slave. [Running] -> [Slave, Running]
2018-05-31 14:18:19 notice : Executed monitor script '/usr/lib/maxscale/maxscale_notify.py --initiator=[192.168.1.218]:3306 --event=new_slave --servers=[192.168.1.78]:3306,[192.168.1.218]:3306,[192.168.1.128]:3306 --masters=[192.168.1.78]:3306 --slaves=[192.168.1.218]:3306,[192.168.1.128]:3306' on event 'new_slave'



 Comments   
Comment by markus makela [ 2018-06-02 ]

As the auto-rejoin operation is "non-destructive", it should be perfectly OK to keep on trying to rejoin servers even if a rejoin fails.

Comment by Richard Lane [ 2018-06-08 ]

I actually am requesting that maxscale have an option to retry the rejoin operation if one of the STOP SLAVE, RESET SLAVE, CHANGE MASTER TO fails.

Comment by markus makela [ 2018-06-12 ]

As a temporary workaround, adding query_retries=2 and query_retry_timeout=10 under the [maxscale] section should allow automated retrying of these queries.

Comment by Esa Korhonen [ 2018-08-17 ]

As of 2.2.10, auto_rejoin is no longer turned off if it fails. This may lead to a situation where it's attempted every loop, but that is quite unlikely. With this and the options mentioned above the rejoin seems quite error-tolerant.

Generated at Thu Feb 08 04:10:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.