[MXS-1748] Failed rejoin can lead to a slave with read-only OFF Created: 2018-03-28  Updated: 2018-06-21  Resolved: 2018-06-21

Status: Closed
Project: MariaDB MaxScale
Component/s: mariadbmon
Affects Version/s: 2.2.3
Fix Version/s: 2.2.6, 2.3.0

Type: Bug Priority: Major
Reporter: Esa Korhonen Assignee: Esa Korhonen
Resolution: Fixed Votes: 1
Labels: None


 Description   

Currently, any cluster operation which sets read-only=1, will disable read-only if a later stage fails. It seems the errors from Connector/C are not exactly correct, as the server may perform the operation anyways. In this case, monitor thinks that the operation failed and sets read-only=0, leading to a slave without read-only. This can be confusing for users. It seems MariaDB Backup has something to do with this as it also messes with the SQL-threads. Will need further investigation, but making the rejoin etc code smarter is probably a good step. Maybe, when getting an error the code should wait a second, reconnect, check status and try again.



 Comments   
Comment by Massimo [ 2018-03-28 ]

While there are many scenario where the query fail and Status of a slave move from " Slave, Running", to "Running", once maxscale is ok with action to make the server re-join the cluster and move the status from "Running" to " Slave, Running", the read_only should set to ON.

The case of binary backup where "stop slave" is require, need to be handle. The situation should be more or less what is happening in galera when you desync a node.

So the force of START SLAVE could interfering with STOP SLAVE of backup (remotely but still possible), while read_only=ON usually, even when the slave is used as backup, can keep as it is read_only=ON.

There is a choice to do in order to be able to have a single server on maxscale recognise as (RUNNING) status and read_only=OFF.

So the --safe-slave-backup just stop SQL_THEAD and not both thread. Some Topology use a slave in READING even the slave is use for backup and they stop sql_thread in order to have a consistency, there is a parameter call --safe-slave-backup-timeout. tell how long to wait before failing.

The possibility is to introduce a new variable that considering node as used for backup and do not move out of the cluster in a timeout second. so read_only will remain =ON, the status remain "Slave, Running", the variables is for server. After that timeout, the server should be move in status "Running". So this will cover:

  • case when stop slave is perform quick for backup in a specific server
  • case when even a node that is used for backup still allow to be part of the cluster and accept reading while talking backup and read_only is necessary
  • still be able to be part of master promotion in case ( there is still the case to check what could happen if master promotion is requested during a backup time)

Observing this from a maxscale case, we see that
from list server:
list servers
Servers.
--------------------------------------------------------------------
Server | Address | Port | Connections | Status
--------------------------------------------------------------------
server-05 | 192.168.1.5 | 3306 | xx | Slave, Running
server-04 | 192.168.1.4 | 3306 | xx | Slave, Running
server-03 | 192.168.1.3 | 3306 | xx | Master, Running

from show monitors :

Server: server-05
Server ID: 5
Read only: NO
Slave configured: YES
Slave IO running: YES
Slave SQL running: YES
Master ID: 3

Comment by Massimo [ 2018-03-30 ]

reading the option that the monitor has from 2.2. may would be much simple to use the option

script

make sure that when then event happen from "running"->"slave,running" , the script execture the READ_ONLY=ON ( in theory READ_ONLY can apply to the new conncetions, not to the active one, which in theory could still write to the slave )

Comment by Esa Korhonen [ 2018-06-21 ]

The rejoin code no longer sets read_only to OFF on error. I don't know about the rest of the issues raised here.

Comment by Esa Korhonen [ 2018-06-21 ]

Closing issue for now. Another issue should be opened for any additional requirements.

Generated at Thu Feb 08 04:09:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.