Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Not a Bug
-
10.1.37, 10.2.21, 10.3.12
-
None
-
Debian 9 and CentOS 7
Description
[galera]
|
bind-address=0.0.0.0 |
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_address='gcomm://1.2.3.4, 4.5.6.7, 7.8.9.10' |
wsrep_cluster_name='someclustername' |
wsrep_node_address='1.2.3.4' |
wsrep_node_name='somenodename' |
wsrep_sst_method=rsync
|
wsrep_sst_donor='4.5.6.7' |
 |
binlog_format=row
|
default_storage_engine=InnoDB
|
innodb_autoinc_lock_mode=2 |
innodb_flush_log_at_trx_commit=0 |
log-error=/var/log/mysqld.log
|
When the config is as shown above, 1.2.3.4 will fail to join the cluster. If I change the order of gcomm:// to gcomm://4.5.6.7, 1.2.3.4, 7.8.9.10 or gcomm://7.8.9.10, 4.5.6.7, 1.2.3.4 it will join or rejoin.
Removing wsrep_node_address and/or wsrep_node_name and/or wsrep_sst_donor does not make a difference.
The resulting error in the log is as follows.
[Note] WSREP: wsrep_sst_grab()
|
[Note] WSREP: Start replication
|
[Note] WSREP: Setting initial position to 159ff08a-120a-11e9-9f4c-020a73527abb:10 |
[Note] WSREP: protonet asio version 0 |
[Note] WSREP: Using CRC-32C for message checksums. |
[Note] WSREP: backend: asio
|
[Note] WSREP: gcomm thread scheduling priority set to other:0 |
[Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory) |
[Note] WSREP: restore pc from disk failed
|
[Note] WSREP: GMCast version 0 |
[Warning] WSREP: Failed to resolve tcp:// 4.5.6.7:4567 |
[Warning] WSREP: Failed to resolve tcp:// 7.8.9.10:4567 |
[Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 |
[Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 |
[Note] WSREP: EVS version 0 |
[Note] WSREP: gcomm: connecting to group 'somegroupname', peer '1.2.3.4:, 4.5.6.7:, 7.8.9.10:' |
[Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') connection established to b8a1af13 tcp://1.2.3.4:4567 |
[Warning] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') address 'tcp://1.2.3.4:4567' points to own listening address, blacklisting |
[Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') connection to peer b8a1af13 with addr tcp://1.2.3.4:4567 timed out, no messages seen in PT3S |
[Warning] WSREP: no nodes coming from prim view, prim not possible
|
.
|
.
|
.
|
[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) |
at gcomm/src/pc.cpp:connect():158 |
[ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out) |
[ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'someclustername' at 'gcomm://1.2.3.4, 4.5.6.7, 7.8.9.10': -110 (Connection timed out) |
[ERROR] WSREP: gcs connect failed: Connection timed out
|
[ERROR] WSREP: wsrep::connect(gcomm://1.2.3.4, 4.5.6.7, 7.8.9.10) failed: 7 |
These are all WAN IP's that are geographically separated. I can definitely connect to the other nodes from this node and visa versa, all verifed by telnet 4.5.6.7 4567 etc. I have verified that the other 2 nodes are clustered and in SYNCED and PRIMARY state.
There is no selinux on any nodes. All 3 ips trust each other so no ports are blocked between them. I have also tested with no firewall at all as a sanity check. There is no NAT. These are standard KVM VMs with public IP addresses on the main interface. All I have to do is change the order of the IP addresses in gcomm://... and it starts working. I can then go to another node in the cluster and recreate this same problem there.
I see this on v10.2. and v10.3 using current stable releases. I am pretty sure I saw it happening on v10.1 some time ago. The mariadb documentation states that using all IP's in the cluster in the gcomm:// statement, including the local node public IP is the recommended config.
So just by changing the order of the IP's on gcomm:// and nothing else I get the following successful logs
[Warning] WSREP: Failed to resolve tcp:// 1.2.3.4:4567 |
[Warning] WSREP: Failed to resolve tcp:// 4.5.6.7:4567 |
[Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567 |
[Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1 |
[Note] WSREP: EVS version 0 |
[Note] WSREP: gcomm: connecting to group 'someclustername', peer '7.8.9.10:, 1.2.3.4:, 4.5.6.7:' |
[Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') connection established to fa45e8bc tcp://7.8.9.10:4567 |
[Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') connection established to d30a7122 tcp://4.5.6.7:4567 |
[Note] WSREP: declaring d30a7122 at tcp://4.5.6.7 stable |
[Note] WSREP: declaring fa45e8bc at tcp://7.8.9.10:4567 stable |
[Note] WSREP: Node d30a7122 state prim
|
.
|
.
|
.
|
[Note] WSREP: Synchronized with group, ready for connections |
It looks to me like it is giving up trying to connect to the other nodes after the local node IP is blacklisted (considered normal according to documentation). So the way around this is to put the local node public ip at the end of the gcomm:// list.