Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
11.2.2
Description
The Kubernetes operator for MariaDB is able to provision MariaDB clusters by creating containers one by one waiting until the `wsrep_ready` variable is enabled. This is to ensure that just one node is attempting to join the cluster at a given time.
When a new node joins the cluster, Galera performs an SST choosing an existing node as donor and transferring the state to the new node in order to initialize it. We've seen this process failing sometimes when bootstrapping the cluster or when a node goes down, so the unhealthy container gets restarted by Kubernetes and a new one is created, which implies that the SST is retried again. This situation is repeated until the container reaches a healthy state and the node is part of the cluster:
Here there are the configuration files for each node:
mariadb-galera-0
[mariadb]
|
bind-address=0.0.0.0 |
default_storage_engine=InnoDB
|
binlog_format=row
|
innodb_autoinc_lock_mode=2 |
|
# Cluster configuration
|
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local" |
wsrep_cluster_name=mariadb-operator
|
wsrep_slave_threads=1 |
|
# Node configuration
|
wsrep_node_address="mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local" |
wsrep_node_name="mariadb-galera-0" |
wsrep_sst_method="mariabackup" |
wsrep_sst_auth="<user>:<password>" |
mariadb-galera-1
[mariadb]
|
bind-address=0.0.0.0 |
default_storage_engine=InnoDB
|
binlog_format=row
|
innodb_autoinc_lock_mode=2 |
|
# Cluster configuration
|
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local" |
wsrep_cluster_name=mariadb-operator
|
wsrep_slave_threads=1 |
|
# Node configuration
|
wsrep_node_address="mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local" |
wsrep_node_name="mariadb-galera-1" |
wsrep_sst_method="mariabackup" |
wsrep_sst_auth="<user>:<password>" |
mariadb-galera-2
[mariadb]
|
bind-address=0.0.0.0 |
default_storage_engine=InnoDB
|
binlog_format=row
|
innodb_autoinc_lock_mode=2 |
|
# Cluster configuration
|
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_address="gcomm://mariadb-galera-0.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-1.mariadb-galera-internal.default.svc.cluster.local,mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local" |
wsrep_cluster_name=mariadb-operator
|
wsrep_slave_threads=1 |
|
# Node configuration
|
wsrep_node_address="mariadb-galera-2.mariadb-galera-internal.default.svc.cluster.local" |
wsrep_node_name="mariadb-galera-2" |
wsrep_sst_method="mariabackup" |
wsrep_sst_auth="<user>:<password>" |
It is important to note that we are using DNS names in the cluster address, which get resolved to the IP of the containers, but every time a container is restarted by Kubernetes it gets a new IP assigned.
There are some log files attached to the current Jira showing the crash happening after a node went down and attempting to rejoin the cluster for a while. Also, after ~30m or so, it finally managed to join.
I have also tried to use rsync instead of mariabackup, but it didn't help.