[MXS-511] ReadWriteSplit router won't choose node as master and logs confusing "RUNNING MASTER" error message Created: 2016-01-04  Updated: 2016-02-15  Resolved: 2016-02-15

Status: Closed
Project: MariaDB MaxScale
Component/s: readwritesplit
Affects Version/s: 1.2.1
Fix Version/s: 1.4.0

Type: Bug Priority: Minor
Reporter: Geoff Montee (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 1
Labels: None


 Description   

In some cases, MaxScale will not choose a node as master, and it logs a confusing error message that says it can't do so because the node is "RUNNING MASTER".

2015-12-30 15:06:46   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.

This error message is a bit confusing. Can this error message be clarified at all?

This was seen in the following context:

A user is using MaxScale's ReadWriteSplit router along with a 3-node Galera Cluster.

At one point, the cluster's master hung. At that point, MaxScale could no longer connect to the master.

MaxScale log:

2015-12-30 14:32:51   Error : Monitor was unable to connect to server 10.10.10.130:3306 : "Can't connect to MySQL server on '10.10.10.130' (4)"
2015-12-30 14:32:51   Error : server 10.10.10.130:3306 lost the master status. Readwritesplit service can't locate the master. Client sessions will be closed.

The user brought up the hung node shortly after.

Galera Cluster log:

2015-12-30 15:05:39 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2015-12-30 15:05:39 0 [Note] /usr/sbin/mysqld (mysqld 5.6.26-74.0-56-log) starting as process 5932 ...
2015-12-30 15:05:39 5932 [Note] WSREP: Read nil XID from storage engines, skipping position init
2015-12-30 15:05:39 5932 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/libgalera_smm.so'
2015-12-30 15:05:39 5932 [Note] WSREP: wsrep_load(): Galera 3.12.2(rf3e626d) by Codership Oy <info@codership.com> loaded successfully.

The node needed a state transfer, but it was able to synchronize with the group fairly quickly:

Galera Cluster log:

2015-12-30 15:05:39 5932 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-12-30 15:05:39 5932 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.10.130' --datadir '/home/galera/data/' --defaults-file '/home/galera/conf/my.cnf' --defaults-group-suffix '' --parent '5932' --binlog '/home/galera/logdata/mysql-bin' '
WSREP_SST: [INFO] Streaming with xbstream (20151230 15:05:40.004)
WSREP_SST: [INFO] Using socat as streamer (20151230 15:05:40.006)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20151230 15:05:40.077)
2015-12-30 15:05:40 5932 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.10.10.130:4444/xtrabackup_sst//1
2015-12-30 15:05:40 5932 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-12-30 15:05:40 5932 [Note] WSREP: Service thread queue flushed.
2015-12-30 15:05:40 5932 [Note] WSREP: Assign initial position for certification: 2604394632, protocol version: 3
2015-12-30 15:05:40 5932 [Note] WSREP: Service thread queue flushed.
2015-12-30 15:05:40 5932 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.10.10.130:4568
2015-12-30 15:05:40 5932 [Note] WSREP: Member 0.0 (node2) requested state transfer from '*any*'. Selected 1.0 (node3)(SYNCED) as donor.
2015-12-30 15:05:40 5932 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 2604394640)
2015-12-30 15:05:40 5932 [Note] WSREP: Requesting state transfer: success, donor: 1
2015-12-30 15:05:41 5932 [Note] WSREP: 1.0 (node3): State transfer to 0.0 (node2) complete.
2015-12-30 15:05:41 5932 [Note] WSREP: Member 1.0 (node3) synced with group.
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20151230 15:05:41.583)
WSREP_SST: [INFO] Galera co-ords from recovery: aed16fff-961b-11e3-b31d-ebb207beff83:2604252415 (20151230 15:05:41.585)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20151230 15:05:41.587)
WSREP_SST: [INFO] Removing the sst_in_progress file (20151230 15:05:41.588)
2015-12-30 15:05:41 5932 [Note] WSREP: SST complete, seqno: 2604252415
2015-12-30 15:05:41 5932 [Note] Plugin 'FEDERATED' is disabled.
2015-12-30 15:05:41 5932 [Warning] Using unique option prefix innodb_read_ahead instead of innodb-read-ahead-threshold is deprecated and will be removed in a future release. Please use the full name instead.
2015-12-30 15:05:41 7f1ed49e8720 InnoDB: Warning: Using innodb_locks_unsafe_for_binlog is DEPRECATED. This option may be removed in future releases. Please use READ COMMITTED transaction isolation level instead, see http://dev.mysql.com/doc/refman/5.6/en/set-transaction.html.
2015-12-30 15:05:41 5932 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-12-30 15:05:41 5932 [Note] InnoDB: The InnoDB memory heap is disabled
2015-12-30 15:05:41 5932 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-12-30 15:05:41 5932 [Note] InnoDB: Memory barrier is not used
2015-12-30 15:05:41 5932 [Note] InnoDB: Compressed tables use zlib 1.2.7
2015-12-30 15:05:41 5932 [Note] InnoDB: Using Linux native AIO
2015-12-30 15:05:41 5932 [Note] InnoDB: Using CPU crc32 instructions
2015-12-30 15:05:41 5932 [Note] InnoDB: Initializing buffer pool, size = 208.0G
2015-12-30 15:05:42 5932 [Note] WSREP: (243d391e, 'tcp://0.0.0.0:4567') turning message relay requesting off
2015-12-30 15:05:51 5932 [Note] InnoDB: Completed initialization of buffer pool
2015-12-30 15:05:53 5932 [Note] InnoDB: Highest supported file format is Barracuda.
2015-12-30 15:05:56 5932 [Note] InnoDB: 128 rollback segment(s) are active.
2015-12-30 15:05:56 5932 [Note] InnoDB: Waiting for purge to start
2015-12-30 15:05:56 5932 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.26-74.0 started; log sequence number 2911365201969
2015-12-30 15:05:56 7ee2b24bf700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
2015-12-30 15:05:56 5932 [Note] RSA private key file not found: /home/galera/data//private_key.pem. Some authentication plugins will not work.
2015-12-30 15:05:56 5932 [Note] RSA public key file not found: /home/galera/data//public_key.pem. Some authentication plugins will not work.
2015-12-30 15:05:56 5932 [Note] Server hostname (bind-address): '0.0.0.0'; port: 3306
2015-12-30 15:05:56 5932 [Note]   - '0.0.0.0' resolves to '0.0.0.0';
2015-12-30 15:05:56 5932 [Note] Server socket created on IP: '0.0.0.0'.
2015-12-30 15:05:56 5932 [ERROR] /usr/sbin/mysqld: Table './mysql/user' is marked as crashed and should be repaired
2015-12-30 15:05:56 5932 [Warning] Checking table:   './mysql/user'
2015-12-30 15:05:56 5932 [ERROR] 1 client is using or hasn't closed the table properly
2015-12-30 15:05:56 5932 [ERROR] /usr/sbin/mysqld: Table './mysql/db' is marked as crashed and should be repaired
2015-12-30 15:05:56 5932 [Warning] Checking table:   './mysql/db'
2015-12-30 15:05:56 5932 [ERROR] 1 client is using or hasn't closed the table properly
2015-12-30 15:05:56 5932 [Note] Event Scheduler: Loaded 0 events
2015-12-30 15:05:56 5932 [Note] WSREP: Signalling provider to continue.
2015-12-30 15:05:56 5932 [Note] WSREP: Initialized wsrep sidno 3
2015-12-30 15:05:56 5932 [Note] WSREP: SST received: aed16fff-961b-11e3-b31d-ebb207beff83:2604252415
2015-12-30 15:05:56 5932 [Note] WSREP: Receiving IST: 142217 writesets, seqnos 2604252415-2604394632
2015-12-30 15:05:56 5932 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.26-74.0-56-log'  socket: '/home/galera/run/mysqld.sock'  port: 3306  Percona XtraDB Cluster (GPL), Release rel74.0, Revision 624ef81, WSREP version 25.12, wsrep_25.12
2015-12-30 15:06:41 5932 [Note] WSREP: IST received: aed16fff-961b-11e3-b31d-ebb207beff83:2604394632
2015-12-30 15:06:41 5932 [Note] WSREP: 0.0 (node2): State transfer from 1.0 (node3) complete.
2015-12-30 15:06:41 5932 [Note] WSREP: Shifting JOINER -> JOINED (TO: 2604395006)
2015-12-30 15:06:41 5932 [Note] WSREP: Member 0.0 (node2) synced with group.
2015-12-30 15:06:41 5932 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 2604395009)
2015-12-30 15:06:41 5932 [Note] WSREP: Synchronized with group, ready for connections
2015-12-30 15:35:16 7ee2b24bf700 InnoDB: Buffer pool(s) load completed at 151230 15:35:16

After the node synchronizes with the group, I would expect MaxScale to fail back to the old master. However, MaxScale would not choose the node as master.

MaxScale log:

2015-12-30 15:06:46   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:49   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:50   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:51   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:52   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:53   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:55   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:55   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:58   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:06:59   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:07:00   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:07:01   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:07:03   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:07:03   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:07:04   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 15:07:52   Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.
2015-12-30 19:05:46   MaxScale received signal SIGTERM. Exiting.
2015-12-30 19:05:46   MaxScale is shut down.

After MaxScale was restarted, it had no problem choosing the node as master.



 Comments   
Comment by markus makela [ 2016-01-05 ]

The node can't be chosen as the master because MaxScale doesn't see it as a synced node. This is expected if the node isn't synced but if the node is synced and MaxScale still sees it as out-of-sync it seems like a possible bug, especially if a restart fixes it.

It is also likely that this is related to MXS-472 and is fixed in 1.3.0.

Comment by Geoff Montee (Inactive) [ 2016-01-05 ]

Thanks for the reply, markus makela!

The node can't be chosen as the master because MaxScale doesn't see it as a synced node.

If that's what the error message is supposed to mean, should we change the error message to reflect that?

e.g. change it from this:

Error : Server at 10.10.10.130:3306 should be master but is RUNNING MASTER instead and can't be chosen to master.

to something more like this:

Error : Server at 10.10.10.130:3306 should be master, but it is not synced, so it can't be chosen to be master.

or does the error message have to be somewhat generic to support multiple types of monitors?

This is expected if the node isn't synced but if the node is synced and MaxScale still sees it as out-of-sync it seems like a possible bug, especially if a restart fixes it.

The node was synced. You can see in the Galera Cluster error log snippet that it changed status to SYNCED at 15:06:41.

It is also likely that this is related to MXS-472 and is fixed in 1.3.0.

That does look like it could be related.

Comment by Johan Wikman [ 2016-02-01 ]

Downgrading to minor since the root cause should have been fixed by https://mariadb.atlassian.net/browse/MXS-472.

The unclear message should be cleared up.

Comment by markus makela [ 2016-02-15 ]

Updated the function so that it will always generate consistent output.

Generated at Thu Feb 08 03:59:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.