[MDEV-33356] Galera cluster down when one DB node rebooted when arbitrator on RHEL8 Created: 2024-02-01  Updated: 2024-02-02

Status: Open
Project: MariaDB Server
Component/s: Galera, Galera Arbitrator garbd
Affects Version/s: 10.6.16
Fix Version/s: None

Type: Bug Priority: Major
Reporter: William Wong Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Redhat 8 on VMware


Attachments: Text File node2-errorlog.txt    

 Description   

Our environment is 3-nodes Galera clusters - 2 DB nodes + 1 arbitrator.

We started running aritrator on redhat 8. Encounter reboot one DB node causing cluster down. This does not happen when arbitartor on redhat 7.

DB node version : MariaDB 10.6.16 on redhat 7/8
arbitrator version : 26.4.14 or 26.4.16 on redhat 8

Interconnection TCP netstat (galera port 18301)

    ┌-- arbitrator <-┐
    V                |
DB node 1  <---  DB node 2 
 
DB node 1:
[root@t1vdbs-gcisdba-el8aa22-1-01 ~]# netstat -an | grep 18301 | grep "\.27:" | grep 18301 | grep ESTABLISHED
tcp        0      0 172.25.213.27:18301     172.25.223.27:41579     ESTABLISHED
tcp        0      0 172.25.213.27:18301     172.24.134.27:40817     ESTABLISHED
 
DB node 2:
[root@t2vdbs-gcisdba-el8aa22-2-01 errorlog]# netstat -an | grep 18301 | grep "\.27:" | grep 18301 | grep ESTABLISHED
tcp        0      0 172.25.223.27:60023     172.24.134.27:18301     ESTABLISHED
tcp        0      0 172.25.223.27:41579     172.25.213.27:18301     ESTABLISHED
 
arbitrator:
[si00chw@t1vdbs-gcissc-witness03d witness]$ netstat -an | grep 18301 | grep "\.27:" | grep 18301 | grep ESTABLISHED
tcp        0      0 172.24.134.27:18301     172.25.223.27:60023     ESTABLISHED
tcp        0      0 172.24.134.27:40817     172.25.213.27:18301     ESTABLISHED

When we reboot DB node 1 guest OS,

  • DB node 2 detects node 1 down (see attached file node2-errorlog.txt)
  • arbitrator does not have any log about node 1 down
  • DB node 2 got isolated and DB cluster down

We tried to use OS "nc" command to check further.

  • nc output from arbitrator to DB node 2 - keeps "connected" when we reboot DB node 1.
  • nc output from DB node 2 to arbitrator - changed from "connected" to "connection refused" when we reboot DB node 1. The output is immediately and not timeout. Thus, firewall should be opened.

Kindly advise what we can do to further troubleshoot this case.


Generated at Thu Feb 08 10:38:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.