[MDEV-25851] Cluster on asymmetric IP addresses fails IST/SST Created: 2021-06-03 Updated: 2022-06-16 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.5.10 |
| Fix Version/s: | 10.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Zenon Panoussis | Assignee: | Seppo Jaakola |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Description |
|
I have a 3-node WAN cluster. The nodes are connected to each-other by VPN without a central server so as to avoid a single point of failure. Instead, each node is an openvpn server with one other node as client. It follows that each VPN server uses its own 10.x subnet and each node has two tap interfaces. The result is this:
It works fine with other similar stuff (e.g. openldap multimaster and glusterfs), but it confuses galera because wsrep_node_address is a global and takes one single value. If unset, it will use the address of eth0, which would be wrong here. But if it is set to one of the node's two VPN endpoints, it gets passed as a global to all the other nodes and then I have e.g. node B trying to contact node C on 10.z.0.1 and the cluster breaks. It should be noted that this is a problem on the sender/donor side, not on the recipient/joiner. With the same example, if node C needs syncing and node B is most advanced, B will try to contact C on 10.z.0.1, which is permanently unreachable for B. Hence, setting wsrep_node_incoming_address is of no use. This affects all multihomed cluster nodes and defies the benefits of multihoming. The simple solution would be to make wsrep_node_address(es) multi-value. |
| Comments |
| Comment by Zenon Panoussis [ 2021-06-03 ] | ||
Or to take the peer addresses from wsrep_cluster_address instead of from wsrep_node_address. This would allow stating multiple addresses for the same node, while not creating backward compatibility problems by changing wsrep_node_address to multivalue, nor require a new wsrep_node_addresses to work around the latter. TBH, I have struggled to understand why wsrep_node_address even exists, and failed to come up with any valid idea. The obvious reason would be to limit galera's binding to a single address, but this is not the case:
It binds on all interfaces anyway (this could possibly be a separate bug, or a misunderstanding of wsrep_node_address on my part). So then what exactly does wsrep_node_address control, that cannot be adequately controlled by wsrep_cluster_address? | ||
| Comment by Zenon Panoussis [ 2022-06-16 ] | ||
|
One more thing: the reason why openldap multimaster and glusterfs work with my setup, is that they use hostnames instead of IP-addresses. This makes it trivial to maintain different /etc/hosts on each node, pointing to the correct address to use for the other two nodes. |