[MDEV-32261] Galera Cluster does not mark lagging node as non-primary, wsrep_local_state_comment shows synced status. Entire cluster hangs with TOI. Created: 2023-09-27 Updated: 2023-09-27 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.6.11 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major |
| Reporter: | PITTA NEELIMA | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Prod |
||
| Attachments: |
|
| Description |
|
Galera Cluster does not mark lagging node as non-primary, wsrep_local_state_comment shows synced status. Entire cluster hangs with TOI. We have a 3-node galera cluster on the primary site. There is another 3-node galera cluster in a DR site with binlog replication happening between node 1(master node) of primary cluster to node 1 of DR cluster . Node 1 has pc.weight set as 2, node 2 has it as 1 and node 3 has it set to 0 in wsrep_provider_options. We have observed that sometimes, one of the nodes ( even one with pc.weight = 1 or 0), lags behind in the cluster, shows wsrep_last_committed value less than the other two nodes and shows a high wsrep_local_recv_queue value but still it is NOT marked as NON-Primary component. The other nodes are waiting on the lagging node. And all the 3 nodes are hung, transactions are waiting forever either on commit or on "acquiring total order isolation" (sometime due to a truncate which is not the original offender). Surprisingly, 'wsrep_cluster_status' is shown as Primary for all nodes, wsrep_cluster_size shows 3 , wsrep_local_state_comment shows "synced" on all the nodes, all the nodes are wsrep_ready=yes and wsrep_connected=yes. The value for wsrep_local_recv_queue on the lagging node > 1 but the wsrep_last_committed value remains frozen. No errors are shown in mysqld log. Issue does not get resolved unless we bounce the problematic node and in some cases the entire cluster. Also, DML (especially deletes and updates) replication across cluster nodes is very slow and a delete of 10k rows takes 2 mins and update takes 4 mins to sync up across the all the nodes. Tried with higher values for evs.send_window and wsrep_slave_threads, still there is no change in performance. All the servers involved are 4 CPU and 32 GB RAM. RTT under 0.3 ms between nodes. My.cnf values for 1st node -
Attaching the session logs taken from Galera nodes and the mysqld log files. |