Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.14
-
None
-
Kubernetes 1.26.9
docker.io/bitnami/mariadb-galera:10.6.14-debian-11-r0
bitnami helm galeracluster v7.0.1
3 node cluster
proxysql directing all write statements to one node
Description
I am running 27 galeraclusters on Kubernetes. Sporadically, I see an issue during rolling updates. Yesterday for instance, I just added two new labels to the galeracluster of my statefulsets and the galera-pods, which is rolled out with a rolling update in Kubernetes.
First, the pod galeracluster-2 is restarted, which was no problem. 40 seconds later it was in sync again.
Then the pod galeracluster-1 got restarted. But when the IST usually should happen, mysqld crashed with signal 11. A full SST sync was started, taking 10 minutes.
Finally, galeracluster-0 got restarted within 40 seconds.
The segfault on pod galeracluster-1 causes the pod to restart once more, but then, it will not sync with an IST but uses a SST instead, which takes 10 minutes for this galeracluster. In some of my bigger clusters SSTs take up to an hour, which is quite annoying. So I would like to find out, if I can reduce the odds for a SST to a minimum. Imagine to update 27 galeraclusters and having to wait for an hour every now and then. During my update session yesterday I only had one segmentation fault, but I have had sessions, where 4-5 pods went into the full SST sync.
Unfortunately, I can't force this behavior to reproduce it. It just happens every now and then on different clusters, and in different pods.
The provided logfile has been exported from Kibana, and you'll have to read it from the bottom... however, rows from the same microsecond appear in the "correct" order. This makes analyzing a bit tricky.
Please let me know if you require further information.