[MXS-38] bugzillaId-688: Incorrect backend state leads to confusing and untraced client errors Created: 2015-01-08  Updated: 2016-02-04  Resolved: 2016-02-04

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 1.0.3
Fix Version/s: 1.3.0

Type: Bug Priority: Minor
Reporter: Kolbe Kegel (Inactive) Assignee: Massimiliano Pinto (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

All OS



 Description   

This is imported from bugzilla item:
http://bugs.mariadb.com/show_bug.cgi?id=688

Description Kolbe Kegel 2015-01-08 22:44:19 UTC
I shutdown my "monitor" manually. I stopped the mysql service of one of the nodes. I left that node marked as "master". It appears that I can connect to the cluster using MaxScale, but my session is ended right away and on my first attempt to execute a query I get "MySQL server has gone away".

 
[root@db3 ~]# mysql -h 192.168.30.38 -P 4006 -u maxuser -pmaxpwd
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 2140
Server version: 10.0.14-MariaDB-Enterprise-Cluster
 
Copyright (c) 2000, 2014, Oracle, SkySQL Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [test]>
 
MariaDB [test]> select @@hostname;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    2140
Current database: test
 
ERROR 2013 (HY000): Lost connection to MySQL server during query
MariaDB [test]>
 
 

Nothing is logged in this situation in the MaxScale error log or message log and nothing relevant seems to be written to the trace log, either:

 
2015-01-08 14:35:35   Servers and router connection counts:
2015-01-08 14:35:35   current operations : 0 in         192.168.30.33:3306 RUNNING MASTER
2015-01-08 14:35:35   current operations : 0 in         192.168.30.32:3306 RUNNING SLAVE
2015-01-08 14:35:35   current operations : 0 in         192.168.30.31:3306 RUNNING JOINED
2015-01-08 14:35:35   Selected RUNNING MASTER in        192.168.30.33:3306
2015-01-08 14:35:35   Selected RUNNING SLAVE in         192.168.30.32:3306
2015-01-08 14:35:35   Started RW Split Router client session [125] for 'maxuser' from 192.168.30.38
2015-01-08 14:35:36   [125]  Stopped RW Split Router client session [125]
2015-01-08 14:35:36   Servers and router connection counts:
2015-01-08 14:35:36   current operations : 0 in         192.168.30.33:3306 RUNNING MASTER
2015-01-08 14:35:36   current operations : 0 in         192.168.30.32:3306 RUNNING SLAVE
2015-01-08 14:35:36   current operations : 0 in         192.168.30.31:3306 RUNNING JOINED
2015-01-08 14:35:36   Selected RUNNING MASTER in        192.168.30.33:3306
2015-01-08 14:35:36   Selected RUNNING SLAVE in         192.168.30.32:3306
2015-01-08 14:35:36   Started RW Split Router client session [126] for 'maxuser' from 192.168.30.38
2015-01-08 14:35:36   [126]  > Autocommit: [enabled], trx is [not open], cmd: COM_QUERY, type: QUERY_TYPE_SYSVAR_READ, stmt: select @@version_comment limit 1
2015-01-08 14:35:36   [126]  Route query to slave       192.168.30.32:3306 <
2015-01-08 14:35:37   [126]  Stopped RW Split Router client session [126]
 

I think it would be good if this kind of anomaly were noted in the error log as well as the trace log. An failed attempt to connect to a server is a serious problem, since it means that the server state in MaxScale doesn't match reality.

Ideally, a better error message could be given by MaxScale to the client, too. It appears that the connection is simply severed without giving any information at all.



 Comments   
Comment by Dipti Joshi (Inactive) [ 2015-03-10 ]

This is comments history improted from bugzilla:

Comment 1 Kolbe Kegel 2015-01-08 22:52:10 UTC
Here's the state of the servers I have set inside of MaxScale:

MaxScale> show servers
Server 0x39c48d0 (server1)
        Server:                         192.168.30.31
        Status:                         Synced, Running
        Protocol:                       MySQLBackend
        Port:                           3306
        Server Version:                 10.0.14-MariaDB-certified-wsrep-log
        Node Id:                        0
        Master Id:                      -1
        Repl Depth:                     0
        Number of connections:          76
        Current no. of conns:           0
        Current no. of operations:      0
Server 0x39c47c0 (server2)
        Server:                         192.168.30.32
        Status:                         Slave, Synced, Running
        Protocol:                       MySQLBackend
        Port:                           3306
        Server Version:                 10.0.14-MariaDB-certified-wsrep-log
        Node Id:                        1
        Master Id:                      -1
        Repl Depth:                     0
        Number of connections:          49
        Current no. of conns:           0
        Current no. of operations:      0
Server 0x39c45e0 (server3)
        Server:                         192.168.30.33
        Status:                         Master, Synced, Running
        Protocol:                       MySQLBackend
        Port:                           3306
        Server Version:                 10.0.14-MariaDB-certified-wsrep-log
        Node Id:                        2
        Master Id:                      -1
        Repl Depth:                     0
        Number of connections:          126
        Current no. of conns:           38
        Current no. of operations:      0

And here is the "real" state of the services:

[root@max1 ~]# for s in db{1..3}; do printf "$s: "; ssh "$s" service mysql status; done
db1:  SUCCESS! MySQL running (843)
db2:  SUCCESS! MySQL running (27165)
db3:  ERROR! MySQL is not running

Comment 2 Mark Riddoch 2015-02-13 10:22:18 UTC
If the incorrect state persists this is a bug in the Galera Monitor

Comment 3 Kolbe Kegel 2015-02-13 16:52:17 UTC
There's no bug in the monitor. Note that 'I shutdown my "monitor" manually' to reproduce this situation.

Since the monitors' functionality are limited, there are some situations when a user may need to work without a monitor and inform MaxScale manually about the state of servers. This bug is related to a problem found when using that workflow.

Comment by markus makela [ 2015-05-04 ]

I tried this with MaxScale 1.1 and the behavior is as expected. If the servers are down but MaxScale still sees them as running, all connections hang. This only happens if there are no monitors and the server states are set manually. I could not reproduce the disconnecting behavior when server states are set manually.

I did notice that if a connection to a server is made while MaxScale sees it in a running state and in reality it is down, it hangs. If the server is then set into a failed state manually through MaxAdmin, current connections aren't severed to that server even though they should be.

Comment by markus makela [ 2016-02-04 ]

Based on the testing done with 1.1 MaxScale is working as intended.

Generated at Thu Feb 08 03:56:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.