[MDEV-33356] Galera cluster down when one DB node rebooted when arbitrator on RHEL8 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.6.16
Fix Version/s: N/A
Component/s: Galera, Galera Arbitrator garbd
Labels:
None
Environment:
Redhat 8 on VMware

Description

Our environment is 3-nodes Galera clusters - 2 DB nodes + 1 arbitrator.

We started running aritrator on redhat 8. Encounter reboot one DB node causing cluster down. This does not happen when arbitartor on redhat 7.

DB node version : MariaDB 10.6.16 on redhat 7/8
arbitrator version : 26.4.14 or 26.4.16 on redhat 8

Interconnection TCP netstat (galera port 18301)

    ┌-- arbitrator <-┐

    V                |

DB node 1  <---  DB node 2

DB node 1:

[root@t1vdbs-gcisdba-el8aa22-1-01 ~]# netstat -an | grep 18301 | grep "\.27:" | grep 18301 | grep ESTABLISHED

tcp        0      0 172.25.213.27:18301     172.25.223.27:41579     ESTABLISHED

tcp        0      0 172.25.213.27:18301     172.24.134.27:40817     ESTABLISHED

DB node 2:

[root@t2vdbs-gcisdba-el8aa22-2-01 errorlog]# netstat -an | grep 18301 | grep "\.27:" | grep 18301 | grep ESTABLISHED

tcp        0      0 172.25.223.27:60023     172.24.134.27:18301     ESTABLISHED

tcp        0      0 172.25.223.27:41579     172.25.213.27:18301     ESTABLISHED

arbitrator:

[si00chw@t1vdbs-gcissc-witness03d witness]$ netstat -an | grep 18301 | grep "\.27:" | grep 18301 | grep ESTABLISHED

tcp        0      0 172.24.134.27:18301     172.25.223.27:60023     ESTABLISHED

tcp        0      0 172.24.134.27:40817     172.25.213.27:18301     ESTABLISHED

When we reboot DB node 1 guest OS,

DB node 2 detects node 1 down (see attached file node2-errorlog.txt)
arbitrator does not have any log about node 1 down
DB node 2 got isolated and DB cluster down

We tried to use OS "nc" command to check further.

nc output from arbitrator to DB node 2 - keeps "connected" when we reboot DB node 1.
nc output from DB node 2 to arbitrator - changed from "connected" to "connection refused" when we reboot DB node 1. The output is immediately and not timeout. Thus, firewall should be opened.

Kindly advise what we can do to further troubleshoot this case.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

node2-errorlog.txt
14 kB
2024-02-01 18:09

Activity

Ascending order - Click to sort in descending order

Seppo Jaakola added a comment - 2024-07-04 09:15

node 2 error log starts from 2024-01-31 18:31:34, and at that point arbitrator had already connectivity issues, first messages are:

2024-01-31 18:31:34 0 [Note] WSREP: declaring ea914695-81a4 at ssl://172.24.134.27:18301 stable
2024-01-31 18:31:34 0 [Note] WSREP: forgetting d4131753-91df (ssl://172.25.213.27:18301)

After that node 2 looses connection both to node 1 and the arbitrator, and cannot resume connections until the end of the error log sample.

How is node 1 restarted, is it full container/server reboot or just mariadb server restart?
Galera release log has comment for redhat 8 install:

In order to install Galera package on RHEL 8, MySQL and MariaDB modules need
to be disabled first with `dnf -y module disable mysql mariadb`.

Please check if this was carried out in your installation.

Does this problem happen always when node 1 or node 2 is restarted, or is it temporary problem?

Please attach related logs from node 1, node 2 and arbitrator over the complete period of networking problems.

Seppo Jaakola added a comment - 2024-07-04 09:15 node 2 error log starts from 2024-01-31 18:31:34, and at that point arbitrator had already connectivity issues, first messages are: 2024-01-31 18:31:34 0 [Note] WSREP: declaring ea914695-81a4 at ssl://172.24.134.27:18301 stable 2024-01-31 18:31:34 0 [Note] WSREP: forgetting d4131753-91df (ssl://172.25.213.27:18301) After that node 2 looses connection both to node 1 and the arbitrator, and cannot resume connections until the end of the error log sample. How is node 1 restarted, is it full container/server reboot or just mariadb server restart? Galera release log has comment for redhat 8 install: In order to install Galera package on RHEL 8, MySQL and MariaDB modules need to be disabled first with `dnf -y module disable mysql mariadb`. Please check if this was carried out in your installation. Does this problem happen always when node 1 or node 2 is restarted, or is it temporary problem? Please attach related logs from node 1, node 2 and arbitrator over the complete period of networking problems.

Jan Lindström added a comment - 2024-07-15 07:48

This could be duplicate of ~~MDEV-33495~~ fixed on Galera library 26.4.18.

Jan Lindström added a comment - 2024-07-15 07:48 This could be duplicate of MDEV-33495 fixed on Galera library 26.4.18.

People

Assignee:: Seppo Jaakola

Reporter:: William Wong

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2024-02-01 18:10

Updated:: 2024-09-05 22:23

Resolved:: 2024-08-16 06:43

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration