[MXS-847] server_down event is executed 8 times due to putting sever into maintenance mode - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 1.4.3
Fix Version/s: 2.0.1
Component/s: Core
Labels:
None

Description

Hi,

I use mmmon. Putting priority to critical because I have a strong fieling mrm (MariaDB replication manager) has issues with this. Perhaps tanj can say if this is actually the case.

I have a failover script that puts the failing server into maintenance mode to avoid failbacks. Somehow, with this setup, the server_down script is executed ~8 times.

My config file:

# MaxScale documentation on GitHub:

# https://github.com/mariadb-corporation/MaxScale/blob/master/Documentation/Documentation-Contents.md

# Global parameters

# Complete list of configuration options:

# https://github.com/mariadb-corporation/MaxScale/blob/master/Documentation/Getting-Started/Configuration-Guide.md

[maxscale]

#threads=1

threads=4

#log_debug=1

# Server definitions

# Set the address of the server to the network

# address of a MySQL server.

[core01]

type=server

address=customer-prod-db-core01

port=3306

protocol=MySQLBackend

masterweight=1

[core11]

type=server

address=customer-prod-db-core11

port=3306

protocol=MySQLBackend

masterweight=0

[history01]

type=server

address=customer-prod-db-history01

port=3306

protocol=MySQLBackend

[history11]

type=server

address=customer-prod-db-history11

port=3306

protocol=MySQLBackend

#### MASTER - MASTER - WRITE ####

[Core11 master slave  Monitor]

type=monitor

module=mmmon

servers=core01,core11

user=maxscale

passwd=***

script=/root/replication-scripts/failover-master.sh --event=$EVENT --initiator=$INITIATOR --nodelist=$NODELIST

events=master_down,server_down

monitor_interval=500

# replication_lag_monitor=1 ## Does not work yet in mmmon (or multimaster in mysqlmon) --michael@MariaDB 2016-08-27

# max_slave_replication_lag=5 ## https://jira.mariadb.org/browse/MXS-839

[Core01 Master read-write Service]

type=service

router=readconnroute

servers=core01,core11

user=maxscale

passwd=***

router_options=master

[Core01 Master read-write Listener]

type=listener

service=Core01 Master read-write Service

protocol=MySQLClient

port=3310

##### READ ONLY #####

[History01 Read-Only Service]

type=service

router=readconnroute

servers=history01, history11

user=maxscale

passwd=***

# Impossible to use router_option slave because mmmon does not monitor these.

# mysqlmon cannot monitor it because there is a multi master setup causing no master to be selected by mysqlmon and the cluster of 2 slaves getting 'slave from external master' state. --michael@mariadb 2016-08-26

# router_options=slave

#filters=MyRegexFilter

[History01 Read-Only Listener]

type=listener

service=History01 Read-Only Service

protocol=MySQLClient

port=3317

##

[MaxAdmin Service]

type=service

router=cli

[MaxAdmin Listener]

type=listener

service=MaxAdmin Service

protocol=maxscaled

port=6603

My script:

#!/bin/bash

# failover_master.sh

ARGS=$(getopt -o '' --long 'event:,initiator:,nodelist:' -- "$@")

eval set -- "$ARGS"

while true; do

    case "$1" in

        --event)

            shift;

            event=$1

            shift;

;;

        --initiator)

            shift;

            initiator=$1

            shift;

;;

        --nodelist)

            shift;

            nodelist=$1

            shift;

;;

--)

            shift;

            break;

;;

    esac

done

candidate=`echo "$nodelist" | awk -F':' '{print $1}'`

maxscale_host=`echo "$initiator" | awk -F'-' '{print $5}'`

maxscale_host=`echo "$maxscale_host" | awk -F':' '{print $1}'`

if [ -z $candidate ]; then

   echo "ERROR!!! NO candidate master found when failing over $initiator! The system might be down."|wall

   echo "ERROR!!! NO candidate master found! The system might be down."

   exit 0

fi

# WORK AROUND for race condition, see https://jira.mariadb.org/browse/MXS-845

currently_in_maintenance=`maxadmin -pmariadb list servers|grep Maintenance|grep $maxscale_host|wc -l`

if [ $currently_in_maintenance =  "0" ]; then

   maxadmin -pmariadb set server $maxscale_host maintenance

   maxadmin -pmariadb clear server $maxscale_host running

else

   echo "This script is not the first one, exiting."|wall

   exit 1

fi

# loosen (ACI)D to speedup any lag.

mysql -u maxscale -p'***' --host=$candidate -e "set global sync_binlog=0; set global innodb_flush_log_at_trx_commit=0;set global innodb_io_capacity=50000;"

while true; do

   echo "Waiting until all transactions have been applied on candidate master $candidate..."

   sleep 1

   SLAVESTAT=$(mysql -umaxscale -p'***' --host=$candidate -e "show slave status\G");

   exec_master_pos=`echo "$SLAVESTAT" | grep -w 'Exec_Master_Log_Pos:' | awk '{print $2}';`

   read_master_pos=`echo "$SLAVESTAT" | grep -w 'Read_Master_Log_Pos:' | awk '{print $2}';`

   if [ -n $old_read_master_pos ] && [ ! $old_read_master_pos = $read_master_pos ]; then

      echo "ERROR!!! Old master $initiator still receives transactions after putting it into maintenance! Manual intervention required to make sure the old master is really down."

      echo "ERROR!!! Old master $initiator still receives transactions after putting it into maintenance! Manual intervention required to make sure the old master is really down."|wall

      exit 0

fi

   old_read_master_pos=read_master_pos

   count=`expr $read_master_pos - $exec_master_pos`

   if [ $count -eq 0 ]; then

      mysql -umaxscale -p'***' --host=$candidate -e "set global read_only=OFF; set global sync_binlog=1; set global innodb_flush_log_at_trx_commit=1;SET GLOBAL innodb_io_capacity=200"

      break;

fi

done

Attachments

Issue Links

relates to

MXS-845 "Server down" event is re-triggered after maintenance mode is repeated

Closed

MXS-846 MMMon: Maintenance mode on slave logs error message every second

Closed

Activity

People

Assignee:: markus makela

Reporter:: Michaël de groot

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2016-08-26 12:44

Updated:: 2016-08-31 14:32

Resolved:: 2016-08-30 07:46

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.