[MDEV-33349] Crash when a new node attempts to join the Galera cluster - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 11.2.2
Fix Version/s: 11.4
Component/s: Galera SST
Labels:
- galera

Description

The Kubernetes operator for MariaDB is able to provision MariaDB clusters by creating containers one by one waiting until the `wsrep_ready` variable is enabled. This is to ensure that just one node is attempting to join the cluster at a given time.

When a new node joins the cluster, Galera performs an SST choosing an existing node as donor and transferring the state to the new node in order to initialize it. We've seen this process failing sometimes when bootstrapping the cluster or when a node goes down, so the unhealthy container gets restarted by Kubernetes and a new one is created, which implies that the SST is retried again. This situation is repeated until the container reaches a healthy state and the node is part of the cluster:

Here there are the configuration files for each node:

mariadb-galera-0

[mariadb]

bind-address=0.0.0.0

default_storage_engine=InnoDB

binlog_format=row

innodb_autoinc_lock_mode=2

# Cluster configuration

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"

wsrep_cluster_name=mariadb-operator

wsrep_slave_threads=1

# Node configuration

wsrep_node_address="mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local"

wsrep_node_name="mariadb-galera-0"

wsrep_sst_method="mariabackup"

wsrep_sst_auth="<user>:<password>"

mariadb-galera-1

[mariadb]

bind-address=0.0.0.0

default_storage_engine=InnoDB

binlog_format=row

innodb_autoinc_lock_mode=2

# Cluster configuration

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"

wsrep_cluster_name=mariadb-operator

wsrep_slave_threads=1

# Node configuration

wsrep_node_address="mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local"

wsrep_node_name="mariadb-galera-1"

wsrep_sst_method="mariabackup"

wsrep_sst_auth="<user>:<password>"

mariadb-galera-2

[mariadb]

bind-address=0.0.0.0

default_storage_engine=InnoDB

binlog_format=row

innodb_autoinc_lock_mode=2

# Cluster configuration

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"

wsrep_cluster_name=mariadb-operator

wsrep_slave_threads=1

# Node configuration

wsrep_node_address="mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local"

wsrep_node_name="mariadb-galera-2"

wsrep_sst_method="mariabackup"

wsrep_sst_auth="<user>:<password>"

It is important to note that we are using DNS names in the cluster address, which get resolved to the IP of the containers, but every time a container is restarted by Kubernetes it gets a new IP assigned.

There are some log files attached to the current Jira showing the crash happening after a node went down and attempting to rejoin the cluster for a while. Also, after ~30m or so, it finally managed to join.

I have also tried to use rsync instead of mariabackup, but it didn't help.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mariadb-galera-0-no-k8s-probes.log
71 kB
2024-02-05 10:11
mariadb-galera-1-donor.log
2 kB
2024-02-02 13:36
mariadb-galera-1-no-k8s-probes.log
72 kB
2024-02-05 10:11
mariadb-galera-2.log
18 kB
2024-02-01 09:32
mariadb-galera-2-joiner.log
0.8 kB
2024-02-02 13:36
mariadb-galera-2-no-k8s-probes.log
18 kB
2024-02-05 10:11
mariadb-galera-2-recovered.log
22 kB
2024-02-01 09:32
Screenshot from 2024-02-05 11-07-25.png
103 kB
2024-02-05 10:13

Activity

People

Assignee:: Seppo Jaakola

Reporter:: Martin Montes

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2024-02-01 09:16

Updated:: 2025-10-02 18:36

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.