[MDEV-26869] Mariadb going to non-primary after one node leaves the cluster while doing host shutdown. Created: 2021-10-20  Updated: 2023-12-11

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.2.32, 10.4
Fix Version/s: 10.3

Type: Bug Priority: Major
Reporter: Jasvinder singh kwatra Assignee: Seppo Jaakola
Resolution: Unresolved Votes: 2
Labels: None

Attachments: File mariadb-server-0_ucp.pcap     Text File mariadb-server-0_ucp_mariadb-8585e9731ab17bd8f1a781791b2e68dc8c745ad46e1004c62d5a4f1e8aaf4aeb.log     File mariadb-server-1_ucp.pcap     Text File mariadb-server-1_ucp_mariadb-40e922811cd78c0b043e285e0b19c7433b547c039c53ec2afe2a3ff14f973a3f.log     File mariadb-server-2-config.rtf     Text File mariadb-server-2_ucp_mariadb-e300cc37746fd3f75a69d7b07d01b3703552208b3e9b78d8b80d25360a858015.log     File mariadb-server-config.rtf     File mariadb-server1-config.rtf    

 Description   

We are facing the issue in mariadb galera cluster deployed in k8. We are having a problem where if there is power issue in one of the node hosting one of the galera other two galera nodes goes non primary. What we have seen and proved that network was stable between the nodes on 4567 ports that were left in cluster. After enabling the debug logs I was able to see that only difference between when it doesn't happen and when it happens is install message was never being exchanged between the nodes.
This is reproduce-able in both 10.2 and 10.4 with both galleria3 and galera4.
Good scenarios where cluster didn't die has this message exchanged

021-10-15T19:29:22.390596315Z stderr F 2021-10-15 19:29:22,390 - OpenStack-Helm Mariadb - INFO - b'2021-10-15 19:29:22 140341999027968 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:handle_install():1103: cd32f6ad handle install from a4cb7bc7 pcmsg{ type=INSTALL, seq=0, flags= 0, node_map {\ta4cb7bc7,prim=1,un=0,last_seq=58,last_prim=view_id(PRIM,a4cb7bc7,19),to_seq=245997,weight=1,segment=0'

For Bad node scenario

2021-10-15T19:29:54.4610095Z stderr F 2021-10-15 19:29:54,460 - OpenStack-Helm Mariadb - INFO - b'2021-10-15 19:29:54 140712551864064 [Warning] WSREP: gcomm/src/evs_proto.cpp:handle_install_timer():690: evs::proto(be28c9b9, GATHER, view_id(REG,17423d3f,11)) install timer expired'

if two nodes are survivor, the cluster should survive. The issue is delaying the production readiness testing and defeating the purpose of the clustering in first place.

I have attached the debug logs and pcap supporting the argument this was not the network issue.



 Comments   
Comment by Peter2121 [ 2022-10-12 ]

We are facing the same issue.
MariaDB 10.4.26, Galera 26.4.12 on FreeBSD 13.0.
The cluster was working correctly on MariaDB 10.3.35 and Galera 25.3.37.
On disconnect of any node we get "WSREP: no install message received" and cluster becomes NON-PRIMARY.

Comment by Prachi Jain [ 2023-02-09 ]

We are facing the same issue. We have total 8 nodes, 4 in each datacenter with 7 out of 8 nodes set to weight 1 and 8th node with weight 0. If one of the nodes go down in standby DC for a rebuild, it brings down the entire cluster to non-primary state. Same error message as reported by others above related to install timer expired. We are on mariadb 10.5.18 and galera3.

Generated at Thu Feb 08 09:48:34 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.