Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
23.02.14
-
None
-
MXS-SPRINT-246, MXS-SPRINT-247
Description
Title was: After "stop slave" times out, check replication status with "show slave status"
When executing "stop slave" with semisync replication, the replica server will tell the primary server that it's stopping replication. If the primary is down, this part can stall (even though this stall should be limited by rpl_semi_sync_slave_kill_conn_timeout). Other possible reasons for stalling could be an ongoing backup or FTWRL.
In some cases, this has caused MariaDB Monitor switchover/failover to time out. To deal with this, the monitor should check replication status with "show all slaves status" after "stop slave" has timed out. If replication status shows that slave threads have stopped, failover/switchover can proceeed.
Further comments from MaxScale team:
Despite several attempts, we cannot reproduce the stop slave timeout with either failover or switchover. Thus, the "fix" to this issue (show slave status query) may not be a valid fix at all. The lack of stop slave timeout during failover is expected, as the monitor waits until the relay log of the replica is clear until sending the stop slave. When stopping semisynchronous replication, the replica does try to connect to the primary to kill the binlog dump thread. However, this connection should simply fail as the primary is assumed down, after which the replica continues with stopping replication.
Regardless, due to demand, we are adding a MariaDB Monitor setting check_repl_on_stop_slave_timeout that activates the requested behavior. The feature is disabled by default.