[MXS-1748] Failed rejoin can lead to a slave with read-only OFF Created: 2018-03-28 Updated: 2018-06-21 Resolved: 2018-06-21 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | mariadbmon |
| Affects Version/s: | 2.2.3 |
| Fix Version/s: | 2.2.6, 2.3.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Esa Korhonen | Assignee: | Esa Korhonen |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Description |
|
Currently, any cluster operation which sets read-only=1, will disable read-only if a later stage fails. It seems the errors from Connector/C are not exactly correct, as the server may perform the operation anyways. In this case, monitor thinks that the operation failed and sets read-only=0, leading to a slave without read-only. This can be confusing for users. It seems MariaDB Backup has something to do with this as it also messes with the SQL-threads. Will need further investigation, but making the rejoin etc code smarter is probably a good step. Maybe, when getting an error the code should wait a second, reconnect, check status and try again. |
| Comments |
| Comment by Massimo [ 2018-03-28 ] |
|
While there are many scenario where the query fail and Status of a slave move from " Slave, Running", to "Running", once maxscale is ok with action to make the server re-join the cluster and move the status from "Running" to " Slave, Running", the read_only should set to ON. The case of binary backup where "stop slave" is require, need to be handle. The situation should be more or less what is happening in galera when you desync a node. So the force of START SLAVE could interfering with STOP SLAVE of backup (remotely but still possible), while read_only=ON usually, even when the slave is used as backup, can keep as it is read_only=ON. There is a choice to do in order to be able to have a single server on maxscale recognise as (RUNNING) status and read_only=OFF. So the --safe-slave-backup just stop SQL_THEAD and not both thread. Some Topology use a slave in READING even the slave is use for backup and they stop sql_thread in order to have a consistency, there is a parameter call --safe-slave-backup-timeout. tell how long to wait before failing. The possibility is to introduce a new variable that considering node as used for backup and do not move out of the cluster in a timeout second. so read_only will remain =ON, the status remain "Slave, Running", the variables is for server. After that timeout, the server should be move in status "Running". So this will cover:
Observing this from a maxscale case, we see that from show monitors : Server: server-05 |
| Comment by Massimo [ 2018-03-30 ] |
|
reading the option that the monitor has from 2.2. may would be much simple to use the option script make sure that when then event happen from "running"->"slave,running" , the script execture the READ_ONLY=ON ( in theory READ_ONLY can apply to the new conncetions, not to the active one, which in theory could still write to the slave ) |
| Comment by Esa Korhonen [ 2018-06-21 ] |
|
The rejoin code no longer sets read_only to OFF on error. I don't know about the rest of the issues raised here. |
| Comment by Esa Korhonen [ 2018-06-21 ] |
|
Closing issue for now. Another issue should be opened for any additional requirements. |