[MDEV-38843] WSREP: BF applier failed on a node causing complete Cluster lockup - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.6.15
Fix Version/s: 10.6.28, 10.11.19, 11.4.13, 11.8.9, 13.0.2
Component/s: Galera
Labels:
- analysis_needed

Bug Category:
Can result in hang or crash
Release Note Summary:

Hide
Fixed a rare Galera cluster lockup where, after a write set could not be applied on one node, that node would silently stay in the cluster instead of stepping out. New write sets kept arriving but could not be processed, leaving the cluster unable to make progress until the node was killed and restarted. The node now reports the apply failure to the cluster, loses the consistency check, and is evicted automatically so it can rejoin via state transfer.

Show
Fixed a rare Galera cluster lockup where, after a write set could not be applied on one node, that node would silently stay in the cluster instead of stepping out. New write sets kept arriving but could not be processed, leaving the cluster unable to make progress until the node was killed and restarted. The node now reports the apply failure to the cluster, loses the consistency check, and is evicted automatically so it can rejoin via state transfer.
Sprint:
Q3/2026 Replic. Maintenance

Description

A few weeks ago, we had a critical event on our 3 node Galera Cluster where node 1 had transaction apply error.

If I understand correctly node1 should have been evicted and go uninitialized so that node2 and 3 can continue as normal however this did not happen.

Node1 stayed part of the cluster as a Primary and then caused all commits on all nodes to hang as it did not continue to certify and apply any further write sets.

I also saw two wsrep threads that showed killed on the node1 process list but from research this seems to have been Killed internally inside mariadb.

Also the wsrep applier threads were not present on node1. We have 8 of them on each node and they were present on node2 and node3.

On a hunch I stopped mariadb service on node1 which then released all hanging commits and allowed node2 and node3 to apply transactions as per normal.
In the end I had to kill the mariadb process on node1 as it did not want to shutdown cleanly.

This is a big concern for us as it caused more than an hour of downtime.

Related syslog entries:

node1:
Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1317, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 14582761987)
Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [ERROR] Slave SQL: Error executing row event: 'Query execution was interrupted', Internal MariaDB error code: 1317
Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 1317, seqno 14582761987
After this normal log entries and no indication of node1 attempting to leave cluster...

node2:
Jan 24 10:38:29 localhost mariadbd[2502]: 2026-01-24 10:38:29 0 [Warning] WSREP: Failed to report last committed 94a81217-9350-11e9-a666-bae2f92ef610:14582761995, -110 (Connection timed out)
After this normal log entries and no indication of node1 attempting to leave cluster or node2 seeing anything wrong...

node3:
Jan 24 10:38:29 db3 mariadbd[2579]: 2026-01-24 10:38:29 0 [Warning] WSREP: Failed to report last committed 94a81217-9350-11e9-a666-bae2f92ef610:14582761995, -110 (Connection timed out)
After this normal log entries and no indication of node1 attempting to leave cluster or node3 seeing anything wrong...

Attachments

Issue Links

duplicates

MDEV-33204 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1317, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: NNNN)

Open

Activity

People

Assignee:: Hemant Dangi

Reporter:: Stephan Vos

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2026-02-16 09:28

Updated:: 2026-05-29 08:15

Resolved:: 2026-05-28 12:16

Time Tracking

Estimated:

3.4d

Remaining:

Logged:

1d 4.5h

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.