[MDEV-25166] Galera node is bootsrapping but it should not Created: 2021-03-16  Updated: 2021-12-23  Resolved: 2021-12-23

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.9
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Oli Sennhauser Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Debian 10, n.a.



 Description   

When I normally start a mariadbd with

systemctl start mariadb

it forms its own cluster (but instead it should join the the other 2 nodes of the cluster, aka. it is bootstrapping). This should NOT happen. Relevant config:

  1. grep -r wsrep *
    mariadb.conf.d/60-galera.cnf:# See the examples of server wsrep.cnf files in /usr/share/mysql
    mariadb.conf.d/60-galera.cnf:wsrep_on = ON
    mariadb.conf.d/60-galera.cnf:wsrep_provider = /usr/lib/galera/libgalera_smm.so
    mariadb.conf.d/60-galera.cnf:wsrep_cluster_name = "MariaDB Galera Cluster"
    mariadb.conf.d/60-galera.cnf:wsrep_cluster_address = gcomm://192.168.56.103,192.168.56.133,192.168.56.134
    mariadb.conf.d/60-galera.cnf:wsrep_node_address = 192.168.56.133
    mariadb.conf.d/60-galera.cnf:wsrep_sst_method = rsync

I had this already 24 hours ago with a completely other cluster after upgrading to 10.5.9 (Ubuntu 20.04). This one was a fresh install. I never have seen this symptom before. So I assume it was introduced with 10.5.9. And it is somehow not too difficult to reproduce. But I do not know yet how exactly.

After rebooting the machine the problem disappeared automatically and node joined to the cluster. So I assume it has to do with the variables the wsrep_new_cluster is setting or not removing any more.
I think it has to do with the hang of wsrep_new_cluster which has happened in an earlier try and after I killed wsrep_new_cluster the variables are somehow still there (globally).

Pretty evil surprise because it also happens on an already running system and this bug is introduced with the upgrade.

If you have a hint how to reproduce I can try. I will keep this testing system for a while...



 Comments   
Comment by Oli Sennhauser [ 2021-03-17 ]

I come closer to the problem. It happened again this morning:
If the node is bootstrapped and it hangs and I kill it (kill -9) a subequent start will lead to the situation. So the killed bootstrap somehow sets an env variable and the subsequent start thinks it should still bootstrap... A reboot of the machine solved the problem.

Comment by Jan Lindström (Inactive) [ 2021-12-23 ]

You need to clean up your environment before trying start you node again.

Generated at Thu Feb 08 09:35:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.