[MXS-656] after upgrade from 1.3 to 1.4, selecting master isn't working as expected Created: 2016-04-01  Updated: 2016-04-15  Resolved: 2016-04-04

Status: Closed
Project: MariaDB MaxScale
Component/s: galeramon
Affects Version/s: 1.4.0, 1.4.1
Fix Version/s: 1.4.2

Type: Bug Priority: Major
Reporter: Wesley Schaft Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS release 6.7 (Final)
kernel: 2.6.32-573.18.1.el6.x86_64
Galera Cluster / MariaDB: 5.5.46-MariaDB-wsrep MariaDB Server, wsrep_25.12.r4f81026



 Description   

We have 12 MaxScale (version 1.3.0) servers running at the moment, behind HAproxy and connected to a Galera Cluster of 4 nodes (1 master, 3 slaves)

I've put 1 server in maintenance, so I could try to update MaxScale to 1.4.1.

Situation before updating (and running 1.3.0):

# maxadmin -pmariadb list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
db-03          | 192.168.120.74  |  3306 |          22 | Master, Synced, Running
db-04          | 192.168.120.95  |  3306 |          23 | Slave, Synced, Running
db-05          | 192.168.120.96  |  3306 |          22 | Slave, Synced, Running
db-06          | 192.168.120.97  |  3306 |          23 | Slave, Synced, Running
-------------------+-----------------+-------+-------------+--------------------

After updating (running 1.4.1)

# maxadmin -pmariadb list servers
Servers.
-------------------+-----------------+-------+-------------+--------------------
Server             | Address         | Port  | Connections | Status
-------------------+-----------------+-------+-------------+--------------------
db-03          | 192.168.120.74  |  3306 |           1 | Slave, Synced, Running
db-04          | 192.168.120.95  |  3306 |           1 | Slave, Synced, Running
db-05          | 192.168.120.96  |  3306 |           1 | Slave, Synced, Running
db-06          | 192.168.120.97  |  3306 |           1 | Master, Synced, Running
-------------------+-----------------+-------+-------------+--------------------

Whatever we do, we can't get db-03 to be Master again. It will always change back to db-06.
db-03 has the lowest wsrep_local_index, so it should be elected as Master, right?

We've noticed some difference in the "Node Id" from "show servers" output. On 1.3.0, the Node Id = 0 and on 1.4.1 the Node Id = -1

On 1.3.0:

# maxadmin -pmariadb show servers | egrep 'Id|Server '
Server 0x25903f0 (db-03)
        Server Version:                 5.5.45-MariaDB-wsrep
        Node Id:                     0
        Master Id:                   -1
Server 0x25902b0 (db-04)
        Server Version:                 5.5.46-MariaDB-wsrep
        Node Id:                     3
        Master Id:                   -1
Server 0x2590190 (db-05)
        Server Version:                 5.5.46-MariaDB-wsrep
        Node Id:                     2
        Master Id:                   -1
Server 0x258ffc0 (db-06)
        Server Version:                 5.5.46-MariaDB-wsrep
        Node Id:                     1
        Master Id:                   -1

On 1.4.1:

# maxadmin -pmariadb show servers | egrep 'Id|Server '
Server 0x87efe0 (db-03)
        Server Version:                 5.5.45-MariaDB-wsrep
        Node Id:                     -1
        Master Id:                   -1
Server 0x87eea0 (db-04)
        Server Version:                 5.5.46-MariaDB-wsrep
        Node Id:                     3
        Master Id:                   -1
Server 0x87ed60 (db-05)
        Server Version:                 5.5.46-MariaDB-wsrep
        Node Id:                     2
        Master Id:                   -1
Server 0x87eb70 (db-06)
        Server Version:                 5.5.46-MariaDB-wsrep
        Node Id:                     1
        Master Id:                   -1

The logfile shows some errors:

2016-04-01 09:50:36   error  : Couldn't find suitable Master from 4 candidates.
2016-04-01 09:50:36   error  : Failed to create new router session for service 'Splitter Service'. See previous errors for more details.

When we downgrade back to 1.3.0, server db-03 is Master again.



 Comments   
Comment by markus makela [ 2016-04-02 ]

You can control which server is the master by using the priority functionality in the Galera monitor: https://mariadb.com/kb/en/mariadb-enterprise/mariadb-maxscale/maxscale-galera-monitor/

The fact that the node ID is different does seem to be a bug of some sort.

Comment by Wesley Schaft [ 2016-04-04 ]

Thank you Markus, with the priority option, server db-03 is seen as Master in Maxscale 1.4.
However, we didn't use that option in Maxscale 1.3 and selecting the Master went OK. Maybe that was just coincidence?

Comment by markus makela [ 2016-04-04 ]

The node with the lowest index should still be seen as the master. This behavior is not expected and we'll investigate why it happens.

If any information about the cluster or how MaxScale is configured is available, please provide it.

Comment by markus makela [ 2016-04-04 ]

A look at the diff between 1.3.0 and 1.4.1 doesn't point any reasons that could cause the indexes to be interpreted differently. Looking at the 1.3.0 code does point out one oddity:

 while ((row = mysql_fetch_row(result)))
        {
            local_index = strtol(row[1], NULL, 10);
            if ((errno == ERANGE && (local_index == LONG_MAX
                                     || local_index == LONG_MIN)) || (errno != 0 && local_index == 0))
            {
                local_index = -1;
            }
            database->server->node_id = local_index;
        }
        mysql_free_result(result);

The value of local_index is ignored if errno is not 0. The correct check would be to check for the last parsed character.

Comment by Wesley Schaft [ 2016-04-04 ]

This is our current, working Maxscale 1.4.1 config (with the passwords removed):

[maxscale]
threads=4
 
[Splitter Service]
type=service
router=readwritesplit
localhost_match_wildcard_host=1
servers=db-03,db-04,db-05,db-06
user=maxscale_user
passwd=******
connection_timeout=3600
weightby=myweight
router_options=slave_selection_criteria=LEAST_GLOBAL_CONNECTIONS
max_slave_connections=1
 
[Splitter Listener]
type=listener
service=Splitter Service
protocol=MySQLClient
port=3306
socket=/tmp/ClusterMaster
 
[db-03]
type=server
address=192.168.120.74
port=3306
protocol=MySQLBackend
priority=1
 
[db-04]
type=server
address=192.168.120.95
port=3306
protocol=MySQLBackend
priority=2
myweight=1
 
[db-05]
type=server
address=192.168.120.96
port=3306
protocol=MySQLBackend
priority=3
myweight=1
 
[db-06]
type=server
address=192.168.120.97
port=3306
protocol=MySQLBackend
priority=4
myweight=1
 
[Galera Monitor]
type=monitor
module=galeramon
disable_master_failback=0
servers=db-03,db-04,db-05,db-06
user=maxscale_user
passwd=******
use_priority=true
 
[CLI]
type=service
router=cli
 
[CLI Listener]
type=listener
service=CLI
protocol=maxscaled
address=localhost
port=6603

Comment by markus makela [ 2016-04-04 ]

Fixed in commit 358c1946a75c7e3a9d7e4740d98ec43f4000ce44

Generated at Thu Feb 08 04:00:57 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.