We updated our MariaDB Galera Cluster to 10.5.13 last week. Since then we are facing following issue each time when try to switch the "master" node.
When we switch the traffic from one node to other at the time of medium loaded service - 15-20000 q/s, the new node freezes in brutal way.
The wsrep status stays as it is normal member of the cluster - Synced with 5 IPs listed, but other members exclude it from the quorum.
The log is filled in infinitive loop with following messages:
InnoDB: WSREP: BF lock wait long for trx:11255701331 query: INSERT INTO
In the log are repeated the same 7-10 unique INSERTS.
The whole cluster freezes until we shutdown the mysqld on "bad" node with regular service shutdown - /usr/local/etc/rc.d/mysql-server onestop.
When we try to stop the Node with regular shutdown procedure the node is excluded from the cluster and service operations continue as normal. But the mysqld is going to print
in infinitive loop the queries noted above. The only way to stop the mysqld working on "bad" node is with kill -9.
The specific thing here is the query ( INSERT ) and target table are always the same. We have massive INSERT load to this table and one daemon which process the data on background. If we stop the daemon before node switching there are no issues.