[MXS-847] server_down event is executed 8 times due to putting sever into maintenance mode Created: 2016-08-26  Updated: 2016-08-31  Resolved: 2016-08-30

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 1.4.3
Fix Version/s: 2.0.1

Type: Bug Priority: Critical
Reporter: Michaël de groot Assignee: markus makela
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-845 "Server down" event is re-triggered a... Closed
relates to MXS-846 MMMon: Maintenance mode on slave logs... Closed

 Description   

Hi,

I use mmmon. Putting priority to critical because I have a strong fieling mrm (MariaDB replication manager) has issues with this. Perhaps tanj can say if this is actually the case.

I have a failover script that puts the failing server into maintenance mode to avoid failbacks. Somehow, with this setup, the server_down script is executed ~8 times.

My config file:

# MaxScale documentation on GitHub:
# https://github.com/mariadb-corporation/MaxScale/blob/master/Documentation/Documentation-Contents.md
 
# Global parameters
#
# Complete list of configuration options:
# https://github.com/mariadb-corporation/MaxScale/blob/master/Documentation/Getting-Started/Configuration-Guide.md
 
[maxscale]
#threads=1
threads=4
#log_debug=1
 
# Server definitions
#
# Set the address of the server to the network
# address of a MySQL server.
#
 
[core01]
type=server
address=customer-prod-db-core01
port=3306
protocol=MySQLBackend
masterweight=1
 
[core11]
type=server
address=customer-prod-db-core11
port=3306
protocol=MySQLBackend
masterweight=0
 
[history01]
type=server
address=customer-prod-db-history01
port=3306
protocol=MySQLBackend
 
[history11]
type=server
address=customer-prod-db-history11
port=3306
protocol=MySQLBackend
 
#### MASTER - MASTER - WRITE ####
[Core11 master slave  Monitor]
type=monitor
module=mmmon
servers=core01,core11
user=maxscale
passwd=***
script=/root/replication-scripts/failover-master.sh --event=$EVENT --initiator=$INITIATOR --nodelist=$NODELIST
events=master_down,server_down
monitor_interval=500
# replication_lag_monitor=1 ## Does not work yet in mmmon (or multimaster in mysqlmon) --michael@MariaDB 2016-08-27
# max_slave_replication_lag=5 ## https://jira.mariadb.org/browse/MXS-839
 
[Core01 Master read-write Service]
type=service
router=readconnroute
servers=core01,core11
user=maxscale
passwd=***
router_options=master
 
[Core01 Master read-write Listener]
type=listener
service=Core01 Master read-write Service
protocol=MySQLClient
port=3310
 
##### READ ONLY #####
 
[History01 Read-Only Service]
type=service
router=readconnroute
servers=history01, history11
user=maxscale
passwd=***
# Impossible to use router_option slave because mmmon does not monitor these.
# mysqlmon cannot monitor it because there is a multi master setup causing no master to be selected by mysqlmon and the cluster of 2 slaves getting 'slave from external master' state. --michael@mariadb 2016-08-26
# router_options=slave 
#filters=MyRegexFilter
 
[History01 Read-Only Listener]
type=listener
service=History01 Read-Only Service
protocol=MySQLClient
port=3317
 
##
 
 
[MaxAdmin Service]
type=service
router=cli
 
[MaxAdmin Listener]
type=listener
service=MaxAdmin Service
protocol=maxscaled
port=6603

My script:

#!/bin/bash
# failover_master.sh
 
ARGS=$(getopt -o '' --long 'event:,initiator:,nodelist:' -- "$@")
eval set -- "$ARGS"
 
while true; do
    case "$1" in
        --event)
            shift;
            event=$1
            shift;
        ;;
        --initiator)
            shift;
            initiator=$1
            shift;
        ;;
        --nodelist)
            shift;
            nodelist=$1
            shift;
        ;;
        --)
            shift;
            break;
        ;;
    esac
done
 
candidate=`echo "$nodelist" | awk -F':' '{print $1}'`
maxscale_host=`echo "$initiator" | awk -F'-' '{print $5}'`
maxscale_host=`echo "$maxscale_host" | awk -F':' '{print $1}'`
 
if [ -z $candidate ]; then
   echo "ERROR!!! NO candidate master found when failing over $initiator! The system might be down."|wall
   echo "ERROR!!! NO candidate master found! The system might be down."
   exit 0
fi
 
# WORK AROUND for race condition, see https://jira.mariadb.org/browse/MXS-845
currently_in_maintenance=`maxadmin -pmariadb list servers|grep Maintenance|grep $maxscale_host|wc -l`
if [ $currently_in_maintenance =  "0" ]; then
   maxadmin -pmariadb set server $maxscale_host maintenance
   maxadmin -pmariadb clear server $maxscale_host running
else
   echo "This script is not the first one, exiting."|wall
   exit 1
fi
 
# loosen (ACI)D to speedup any lag. 
mysql -u maxscale -p'***' --host=$candidate -e "set global sync_binlog=0; set global innodb_flush_log_at_trx_commit=0;set global innodb_io_capacity=50000;"
 
while true; do
   echo "Waiting until all transactions have been applied on candidate master $candidate..."
   sleep 1
   SLAVESTAT=$(mysql -umaxscale -p'***' --host=$candidate -e "show slave status\G");
   exec_master_pos=`echo "$SLAVESTAT" | grep -w 'Exec_Master_Log_Pos:' | awk '{print $2}';`
   read_master_pos=`echo "$SLAVESTAT" | grep -w 'Read_Master_Log_Pos:' | awk '{print $2}';`
   if [ -n $old_read_master_pos ] && [ ! $old_read_master_pos = $read_master_pos ]; then
      echo "ERROR!!! Old master $initiator still receives transactions after putting it into maintenance! Manual intervention required to make sure the old master is really down."
      echo "ERROR!!! Old master $initiator still receives transactions after putting it into maintenance! Manual intervention required to make sure the old master is really down."|wall
      exit 0
   fi
   old_read_master_pos=read_master_pos
   count=`expr $read_master_pos - $exec_master_pos`
 
   if [ $count -eq 0 ]; then
      mysql -umaxscale -p'***' --host=$candidate -e "set global read_only=OFF; set global sync_binlog=1; set global innodb_flush_log_at_trx_commit=1;SET GLOBAL innodb_io_capacity=200" 
      break;
   fi
done



 Comments   
Comment by markus makela [ 2016-08-29 ]

I can reproduce this and it seems to be some sort of a race condition. Setting the server to maintenance shouldn't trigger any scripts as it isn't an external state change.

Comment by markus makela [ 2016-08-30 ]

This duplicates MXS-845.

Generated at Thu Feb 08 04:02:22 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.