[MDEV-26869] Mariadb going to non-primary after one node leaves the cluster while doing host shutdown. Created: 2021-10-20 Updated: 2023-12-11 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.2.32, 10.4 |
| Fix Version/s: | 10.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jasvinder singh kwatra | Assignee: | Seppo Jaakola |
| Resolution: | Unresolved | Votes: | 2 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
We are facing the issue in mariadb galera cluster deployed in k8. We are having a problem where if there is power issue in one of the node hosting one of the galera other two galera nodes goes non primary. What we have seen and proved that network was stable between the nodes on 4567 ports that were left in cluster. After enabling the debug logs I was able to see that only difference between when it doesn't happen and when it happens is install message was never being exchanged between the nodes. 021-10-15T19:29:22.390596315Z stderr F 2021-10-15 19:29:22,390 - OpenStack-Helm Mariadb - INFO - b'2021-10-15 19:29:22 140341999027968 [Note] [Debug] WSREP: gcomm/src/pc_proto.cpp:handle_install():1103: cd32f6ad handle install from a4cb7bc7 pcmsg{ type=INSTALL, seq=0, flags= 0, node_map {\ta4cb7bc7,prim=1,un=0,last_seq=58,last_prim=view_id(PRIM,a4cb7bc7,19),to_seq=245997,weight=1,segment=0' For Bad node scenario 2021-10-15T19:29:54.4610095Z stderr F 2021-10-15 19:29:54,460 - OpenStack-Helm Mariadb - INFO - b'2021-10-15 19:29:54 140712551864064 [Warning] WSREP: gcomm/src/evs_proto.cpp:handle_install_timer():690: evs::proto(be28c9b9, GATHER, view_id(REG,17423d3f,11)) install timer expired' if two nodes are survivor, the cluster should survive. The issue is delaying the production readiness testing and defeating the purpose of the clustering in first place. I have attached the debug logs and pcap supporting the argument this was not the network issue. |
| Comments |
| Comment by Peter2121 [ 2022-10-12 ] |
|
We are facing the same issue. |
| Comment by Prachi Jain [ 2023-02-09 ] |
|
We are facing the same issue. We have total 8 nodes, 4 in each datacenter with 7 out of 8 nodes set to weight 1 and 8th node with weight 0. If one of the nodes go down in standby DC for a rebuild, it brings down the entire cluster to non-primary state. Same error message as reported by others above related to install timer expired. We are on mariadb 10.5.18 and galera3. |