Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-18210

Node does not join or re-join galera cluster when it is the first IP on gcomm:// and there is a space after the comma

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 10.1.37, 10.2.21, 10.3.12
    • N/A
    • Galera
    • None
    • Debian 9 and CentOS 7

    Description

      [galera]
      bind-address=0.0.0.0
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address='gcomm://1.2.3.4, 4.5.6.7, 7.8.9.10'
      wsrep_cluster_name='someclustername'
      wsrep_node_address='1.2.3.4'
      wsrep_node_name='somenodename'
      wsrep_sst_method=rsync
      wsrep_sst_donor='4.5.6.7'
       
      binlog_format=row
      default_storage_engine=InnoDB
      innodb_autoinc_lock_mode=2
      innodb_flush_log_at_trx_commit=0
      log-error=/var/log/mysqld.log
      

      When the config is as shown above, 1.2.3.4 will fail to join the cluster. If I change the order of gcomm:// to gcomm://4.5.6.7, 1.2.3.4, 7.8.9.10 or gcomm://7.8.9.10, 4.5.6.7, 1.2.3.4 it will join or rejoin.

      Removing wsrep_node_address and/or wsrep_node_name and/or wsrep_sst_donor does not make a difference.

      The resulting error in the log is as follows.

      [Note] WSREP: wsrep_sst_grab()
      [Note] WSREP: Start replication
      [Note] WSREP: Setting initial position to 159ff08a-120a-11e9-9f4c-020a73527abb:10
      [Note] WSREP: protonet asio version 0
      [Note] WSREP: Using CRC-32C for message checksums.
      [Note] WSREP: backend: asio
      [Note] WSREP: gcomm thread scheduling priority set to other:0
      [Warning] WSREP: access file(/var/lib/mysql//gvwstate.dat) failed(No such file or directory)
      [Note] WSREP: restore pc from disk failed
      [Note] WSREP: GMCast version 0
      [Warning] WSREP: Failed to resolve tcp:// 4.5.6.7:4567
      [Warning] WSREP: Failed to resolve tcp:// 7.8.9.10:4567
      [Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      [Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
      [Note] WSREP: EVS version 0
      [Note] WSREP: gcomm: connecting to group 'somegroupname', peer '1.2.3.4:, 4.5.6.7:, 7.8.9.10:'
      [Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') connection established to b8a1af13 tcp://1.2.3.4:4567
      [Warning] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') address 'tcp://1.2.3.4:4567' points to own listening address, blacklisting
      [Note] WSREP: (b8a1af13, 'tcp://0.0.0.0:4567') connection to peer b8a1af13 with addr tcp://1.2.3.4:4567 timed out, no messages seen in PT3S
      [Warning] WSREP: no nodes coming from prim view, prim not possible
      .
      .
      .
      [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
      	 at gcomm/src/pc.cpp:connect():158
      [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():209: Failed to open backend connection: -110 (Connection timed out)
      [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1458: Failed to open channel 'someclustername' at 'gcomm://1.2.3.4, 4.5.6.7, 7.8.9.10': -110 (Connection timed out)
      [ERROR] WSREP: gcs connect failed: Connection timed out
      [ERROR] WSREP: wsrep::connect(gcomm://1.2.3.4, 4.5.6.7, 7.8.9.10) failed: 7
      

      These are all WAN IP's that are geographically separated. I can definitely connect to the other nodes from this node and visa versa, all verifed by telnet 4.5.6.7 4567 etc. I have verified that the other 2 nodes are clustered and in SYNCED and PRIMARY state.
      There is no selinux on any nodes. All 3 ips trust each other so no ports are blocked between them. I have also tested with no firewall at all as a sanity check. There is no NAT. These are standard KVM VMs with public IP addresses on the main interface. All I have to do is change the order of the IP addresses in gcomm://... and it starts working. I can then go to another node in the cluster and recreate this same problem there.

      I see this on v10.2. and v10.3 using current stable releases. I am pretty sure I saw it happening on v10.1 some time ago. The mariadb documentation states that using all IP's in the cluster in the gcomm:// statement, including the local node public IP is the recommended config.

      So just by changing the order of the IP's on gcomm:// and nothing else I get the following successful logs

      [Warning] WSREP: Failed to resolve tcp:// 1.2.3.4:4567
      [Warning] WSREP: Failed to resolve tcp:// 4.5.6.7:4567
      [Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
      [Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
       [Note] WSREP: EVS version 0
      [Note] WSREP: gcomm: connecting to group 'someclustername', peer '7.8.9.10:, 1.2.3.4:, 4.5.6.7:'
      [Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') connection established to fa45e8bc tcp://7.8.9.10:4567
      [Note] WSREP: (53854fe3, 'tcp://0.0.0.0:4567') connection established to d30a7122 tcp://4.5.6.7:4567
      [Note] WSREP: declaring d30a7122 at tcp://4.5.6.7 stable
      [Note] WSREP: declaring fa45e8bc at tcp://7.8.9.10:4567 stable
      [Note] WSREP: Node d30a7122 state prim
      .
      .
      .
      [Note] WSREP: Synchronized with group, ready for connections
      

      It looks to me like it is giving up trying to connect to the other nodes after the local node IP is blacklisted (considered normal according to documentation). So the way around this is to put the local node public ip at the end of the gcomm:// list.

      Attachments

        Activity

          People

            sysprg Julius Goryavsky
            sman sam stein
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.