[MXS-4738] The fact that disable_master_failback does not work with root_node_as_master is not documented Created: 2023-09-04  Updated: 2023-09-11  Resolved: 2023-09-11

Status: Closed
Project: MariaDB MaxScale
Component/s: Documentation
Affects Version/s: 23.02.3
Fix Version/s: 2.5.29, 6.4.11, 22.08.9, 23.02.5

Type: Bug Priority: Minor
Reporter: Presnickety Assignee: markus makela
Resolution: Done Votes: 0
Labels: None
Environment:

RHEL v8.2
VMware vCenter v7
MariaDB v10.7 (3 node Galera cluster)
Maxcsale v23.02.3 (one instance of each per MariaDB VM for redundancy)


Attachments: PNG File MXS-4738-01.PNG     PNG File MXS-4738-02.PNG     PNG File MXS-4738-03.PNG     PNG File MXS-4738-04.PNG     PNG File MXS-4738-05.PNG     PNG File MXS-4738-06.PNG     PNG File MXS-4738-07.PNG     PNG File MXS-4738-08.PNG     Text File maxscale-server-1.txt     Text File maxscale-server-2.txt     Text File maxscale-server-3.txt    

 Description   

The fact that disable_master_failback goes agains what root_node_as_master tries to accomplish is not documented and the two parameters can be used together without any warnings.


Original title: Maxscale | Multiple instances in a cluster go out of sync.

Original description:
Hi there,

We have multiple instances of Maxscale running in the same DB cluster. Occasionally the maxctrl administrative client reports an inconsistency in states amongst the nodes, e.g. Maxscale #1 reports server #1 as the Master, while Maxscale #2 & #3 reports server #2 as Master etc... . Could you suggest the steps to correct these inconsistencies?

Thanks.



 Comments   
Comment by markus makela [ 2023-09-04 ]

It's probably because of disable_master_failback=true in the configuration, it shouldn't be used together with root_node_as_master=true.

I think that the root_node_as_master=true parameter could forbid the use of disable_master_failback=true as they're aiming to complete different goals. root_node_as_master is used to keep a consistent view of the cluster across multiple MaxScale instances whereas disable_master_failback is used to minimize the amount of changes in the node states, i.e. don't move the Master label unless the node goes down. For now I'll update the documentation to clearly state that the two can be used together but it is generally not a very good idea to do so.

Comment by Presnickety [ 2023-09-04 ]

Hello Markus,
We've set root_node_as_master=false. Will observe for a few days and report back.
Thanks.

Comment by markus makela [ 2023-09-04 ]

If you want the MaxScale instances to always pick the same node, you should disable disable_master_failback instead of root_node_as_master.

Comment by Presnickety [ 2023-09-04 ]

Hi Markus,

Yes we will try this. After previous reconfiguration we get the same issue. Pease see attached pics. Maxscale is also thew the following error across the three nodes;

2023-09-04 22:30:27 notice : Started replicating from [10.195.241.81]:3306 at GTID '1-1-79197812'
2023-09-04 22:30:27 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.

The error message disappeared after restarting all 3 load balancers.

Thanks.

Comment by markus makela [ 2023-09-05 ]

Looking at the server status it looks like disable_master_failback was still in use. The Master Stickiness is only shown if that parameter is enabled.

Comment by Presnickety [ 2023-09-06 ]

Hi Markus,

Yes we've set disable_master_failback=false. We still see connection timeouts & CPU soft lockups, so far load balancers have not crashed. Increased Galera replication timeouts have helped, nodes have not dropped from the cluster as yet;

[Galera-Monitor-VIEXH]
type=monitor
module=galeramon
servers=viexh-session-usage-mdb-01,viexh-session-usage-mdb-02,viexh-session-usage-mdb-03
user=maxscale
password=maxscale
available_when_donor=true
backend_connect_timeout=2s
backend_connect_attempts=3
backend_read_timeout=2s
backend_write_timeout=2s
#disable_master_failback=true
disable_master_failback=false
disable_master_role_setting=false
##disk_space_threshold=/data10:80
##disk_space_check_interval=60000ms
journal_max_age=172800s
monitor_interval=1000ms
root_node_as_master=true
#root_node_as_master=false
set_donor_nodes=true
use_priority=true

[Read-Write-Split]
type=service
router=readwritesplit
servers=viexh-session-usage-mdb-01,viexh-session-usage-mdb-02,viexh-session-usage-mdb-03
user=maxscale
password=maxscale
connection_keepalive=30s
connection_timeout=540s
#disable_sescmd_history=true
master_accept_reads=true
#master_failure_mode=fail_instantly
#master_failure_mode=error_on_write
master_failure_mode=fail_on_write
master_reconnection=true
causal_reads=fast
#causal_reads=none
#transaction_replay=true
transaction_replay=false
max_connections=2048
retry_failed_reads=false

Thanks.

Comment by Presnickety [ 2023-09-09 ]

Hi Markus,

Failover tests with the write master are showing much better results, and the apps are experiencing fewer connection problems, however we sometimes see the following messages appear repeatedly in the logs once any node that has left the cluster, rejoins the cluster. Could you suggest where the GTID values below come from if they are not available in the binlogs? The messages appear even if a node has been out of the cluster for a few minutes - binlog_expire_logs_seconds=10800 (3 hrs);

MariaDB logs

2023-09-08 23:00:02 13254 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714')
2023-09-08 23:00:03 13255 [Note] Start binlog_dump to slave_server(2), pos(, 0), using_gtid(1), gtid('1-1-341693771')
2023-09-08 23:00:04 13256 [Note] Start binlog_dump to slave_server(1), pos(, 0), using_gtid(1), gtid('')
2023-09-08 23:00:07 13257 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714')
2023-09-08 23:00:08 13258 [Note] Start binlog_dump to slave_server(2), pos(, 0), using_gtid(1), gtid('1-1-341693771')
2023-09-08 23:00:09 13259 [Note] Start binlog_dump to slave_server(1), pos(, 0), using_gtid(1), gtid('')
2023-09-08 23:00:12 13260 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714')
2023-09-08 23:00:13 13261 [Note] Start binlog_dump to slave_server(2), pos(, 0), using_gtid(1), gtid('1-1-341693771')
2023-09-08 23:00:14 13262 [Note] Start binlog_dump to slave_server(1), pos(, 0), using_gtid(1), gtid('')
2023-09-08 23:00:17 13263 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714')
2023-09-08 23:00:18 13264 [Note] Start binlog_dump to slave_server(2), pos(, 0), using_gtid(1), gtid('1-1-341693771')
2023-09-08 23:00:19 13265 [Note] Start binlog_dump to slave_server(1), pos(, 0), using_gtid(1), gtid('')
2023-09-08 23:00:22 13266 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714')
2023-09-08 23:00:23 13267 [Note] Start binlog_dump to slave_server(2), pos(, 0), using_gtid(1), gtid('1-1-341693771')
2023-09-08 23:00:24 13268 [Note] Start binlog_dump to slave_server(1), pos(, 0), using_gtid(1), gtid('')
2023-09-08 23:00:27 13269 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714')
2023-09-08 23:00:28 13271 [Note] Start binlog_dump to slave_server(2), pos(, 0), using_gtid(1), gtid('1-1-341693771')

Maxscale logs

2023-09-08 23:00:04 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-08 23:00:07 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-08 23:00:08 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-08 23:00:09 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.

We see the following repeated messages after DB02 reboots and rejoins the cluster;

Maxscale logs

2023-09-09 11:47:34 notice : 'viexh-session-usage-mdb-02' sent version string '10.7.1-MariaDB-log'. Detected type: 'MariaDB', version: 10.7.1.
2023-09-09 11:47:34 notice : Server 'viexh-session-usage-mdb-02' charset: latin1_swedish_ci
2023-09-09 11:47:34 warning: [galeramon] Invalid 'wsrep_local_index' on server 'viexh-session-usage-mdb-02': 18446744073709551615
2023-09-09 11:47:34 notice : Server changed state: viexh-session-usage-mdb-02[10.195.241.80:3306]: server_up. [Down] -> [Running]
2023-09-09 11:47:37 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:47:37 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:47:42 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:47:42 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:47:47 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:47:47 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:47:52 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:47:52 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:47:57 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:47:57 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:02 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:02 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:07 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:07 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:12 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:12 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:17 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:17 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:22 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:22 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:25 notice : Server changed state: viexh-session-usage-mdb-02[10.195.241.80:3306]: new_slave. [Running] -> [Slave, Synced, Running]
2023-09-09 11:48:27 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:27 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:32 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:32 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:37 notice : Started replicating from [10.195.241.79]:3306 at GTID ''
2023-09-09 11:48:37 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged.
2023-09-09 11:48:42 notice : Started replicating from [10.195.241.79]:3306 at GTID ''

Thanks.

Comment by markus makela [ 2023-09-11 ]

Those errors are related to the kafkacdc service and the fact that you have commented out the gtid parameter. If no locally known GTID position is present, the router starts from the beginning of the history (an empty GTID) in the hopes that it'll be able to consume all events. Once it has processed some events, it'll know which GTID to continue from.

Generated at Thu Feb 08 04:30:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.