Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.2.32, 10.4(EOL)
-
None
Description
We are facing the issue in mariadb galera cluster deployed in k8. We are having a problem where if there is power issue in one of the node hosting one of the galera other two galera nodes goes non primary. What we have seen and proved that network was stable between the nodes on 4567 ports that were left in cluster. After enabling the debug logs I was able to see that only difference between when it doesn't happen and when it happens is install message was never being exchanged between the nodes.
This is reproduce-able in both 10.2 and 10.4 with both galleria3 and galera4.
Good scenarios where cluster didn't die has this message exchanged
021-10-15T19:29:22.390596315Z stderr F 2021-10-15 19:29:22,390 - OpenStack-Helm Mariadb - INFO - b'2021-10-15 19:29:22 140341999027968 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:handle_install():1103: cd32f6ad handle install from a4cb7bc7 pcmsg{ type=INSTALL, seq=0, flags= 0, node_map {\ta4cb7bc7,prim=1,un=0,last_seq=58,last_prim=view_id(PRIM,a4cb7bc7,19),to_seq=245997,weight=1,segment=0'
For Bad node scenario
2021-10-15T19:29:54.4610095Z stderr F 2021-10-15 19:29:54,460 - OpenStack-Helm Mariadb - INFO - b'2021-10-15 19:29:54 140712551864064 [Warning] WSREP: gcomm/src/evs_proto.cpp:handle_install_timer():690: evs::proto(be28c9b9, GATHER, view_id(REG,17423d3f,11)) install timer expired'
if two nodes are survivor, the cluster should survive. The issue is delaying the production readiness testing and defeating the purpose of the clustering in first place.
I have attached the debug logs and pcap supporting the argument this was not the network issue.