[MXS-4738] The fact that disable_master_failback does not work with root_node_as_master is not documented Created: 2023-09-04 Updated: 2023-09-11 Resolved: 2023-09-11 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Documentation |
| Affects Version/s: | 23.02.3 |
| Fix Version/s: | 2.5.29, 6.4.11, 22.08.9, 23.02.5 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Presnickety | Assignee: | markus makela |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Environment: |
RHEL v8.2 |
||
| Attachments: |
|
| Description |
|
The fact that disable_master_failback goes agains what root_node_as_master tries to accomplish is not documented and the two parameters can be used together without any warnings. Original title: Maxscale | Multiple instances in a cluster go out of sync. Original description: We have multiple instances of Maxscale running in the same DB cluster. Occasionally the maxctrl administrative client reports an inconsistency in states amongst the nodes, e.g. Maxscale #1 reports server #1 as the Master, while Maxscale #2 & #3 reports server #2 as Master etc... . Could you suggest the steps to correct these inconsistencies? Thanks. |
| Comments |
| Comment by markus makela [ 2023-09-04 ] |
|
It's probably because of disable_master_failback=true in the configuration, it shouldn't be used together with root_node_as_master=true. I think that the root_node_as_master=true parameter could forbid the use of disable_master_failback=true as they're aiming to complete different goals. root_node_as_master is used to keep a consistent view of the cluster across multiple MaxScale instances whereas disable_master_failback is used to minimize the amount of changes in the node states, i.e. don't move the Master label unless the node goes down. For now I'll update the documentation to clearly state that the two can be used together but it is generally not a very good idea to do so. |
| Comment by Presnickety [ 2023-09-04 ] |
|
Hello Markus, |
| Comment by markus makela [ 2023-09-04 ] |
|
If you want the MaxScale instances to always pick the same node, you should disable disable_master_failback instead of root_node_as_master. |
| Comment by Presnickety [ 2023-09-04 ] |
|
Hi Markus, Yes we will try this. After previous reconfiguration we get the same issue. Pease see attached pics. Maxscale is also thew the following error across the three nodes; 2023-09-04 22:30:27 notice : Started replicating from [10.195.241.81]:3306 at GTID '1-1-79197812' The error message disappeared after restarting all 3 load balancers. Thanks. |
| Comment by markus makela [ 2023-09-05 ] |
|
Looking at the server status it looks like disable_master_failback was still in use. The Master Stickiness is only shown if that parameter is enabled. |
| Comment by Presnickety [ 2023-09-06 ] |
|
Hi Markus, Yes we've set disable_master_failback=false. We still see connection timeouts & CPU soft lockups, so far load balancers have not crashed. Increased Galera replication timeouts have helped, nodes have not dropped from the cluster as yet; [Galera-Monitor-VIEXH] [Read-Write-Split] Thanks. |
| Comment by Presnickety [ 2023-09-09 ] |
|
Hi Markus, Failover tests with the write master are showing much better results, and the apps are experiencing fewer connection problems, however we sometimes see the following messages appear repeatedly in the logs once any node that has left the cluster, rejoins the cluster. Could you suggest where the GTID values below come from if they are not available in the binlogs? The messages appear even if a node has been out of the cluster for a few minutes - binlog_expire_logs_seconds=10800 (3 hrs); MariaDB logs 2023-09-08 23:00:02 13254 [Note] Start binlog_dump to slave_server(3), pos(, 0), using_gtid(1), gtid('1-1-341797714') Maxscale logs 2023-09-08 23:00:04 error : Failed to read replicated event: 1236, Could not find GTID state requested by slave in any binlog files. Probably the slave state is too old and required binlog files have been purged. We see the following repeated messages after DB02 reboots and rejoins the cluster; Maxscale logs 2023-09-09 11:47:34 notice : 'viexh-session-usage-mdb-02' sent version string '10.7.1-MariaDB-log'. Detected type: 'MariaDB', version: 10.7.1. Thanks. |
| Comment by markus makela [ 2023-09-11 ] |
|
Those errors are related to the kafkacdc service and the fact that you have commented out the gtid parameter. If no locally known GTID position is present, the router starts from the beginning of the history (an empty GTID) in the hopes that it'll be able to consume all events. Once it has processed some events, it'll know which GTID to continue from. |