[MDEV-27410] Galera cluster hangs after one node reboots Created: 2022-01-03  Updated: 2022-01-17

Status: Open
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.5.12
Fix Version/s: None

Type: Bug Priority: Major
Reporter: PhilJing Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

centos7.5
Mariadb 10.5.12
Galera provider version: 26.4.9



 Description   

Happy new year!

We have a gelera cluster with 3 nodes in production, and we upgraded to 10.5.12 recently from 10.3.10.
We provide a keepAlived +Haproxy to access the DB clutster.

The galera cluster would hang (can only select and unable to do update or delete) after one node reboots occationally.
The application gets the error "Lock wait timeout exceeded; try restarting transaction" and there is no error log in mysql logs. I can do some select in termincal but any update and delete sql would hang.
And *wsrep_local_state_comment *on 3 nodes are "Synced". *Wsrep_last_commited *on 3 nodes are static and would not go forward any more. And one is lower than other two.

To recover the cluster, it would work to restart the mariadb instance which has differrent wsrep_last_commited or reboot the cluster with --wsrep-new-cluster.

The possibility would be much higher when one node is poweroff, and reboot another node.
Any advice would appreciate, thanks!



 Comments   
Comment by PhilJing [ 2022-01-04 ]

update:
After a lot of tests, it is much easier to reproduce the case when I reboot the node1 which is not the backup server of HA proxy and alse the ntp server for other 2 nodes.
the haproxy like this:

server node1 ip1 check inter 2000 rise 2 fall 5
server node2 ip2 check inter 2000 rise 2 fall 5 backup
server node3 ip3 check inter 2000 rise 2 fall 5 backup

I don't know if it is related.

Comment by PhilJing [ 2022-01-10 ]

Recently I did a lot of test, I would post any information that migh be useful

Comment by PhilJing [ 2022-01-17 ]

Found that some DDL (truncate) would run periodically every 1 hour, dont know if it is realted. More tests needed...

Generated at Thu Feb 08 09:52:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.