Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.6.15
-
None
-
None
Description
A few weeks ago, we had a critical event on our 3 node Galera Cluster where node 1 had transaction apply error.
If I understand correctly node1 should have been evicted and go uninitialized so that node2 and 3 can continue as normal however this did not happen.
Node1 stayed part of the cluster as a Primary and then caused all commits on all nodes to hang as it did not continue to certify and apply any further write sets.
I also saw two wsrep threads that showed killed on the node1 process list but from research this seems to have been Killed internally inside mariadb.
Also the wsrep applier threads were not present on node1. We have 8 of them on each node and they were present on node2 and node3.
On a hunch I stopped mariadb service on node1 which then released all hanging commits and allowed node2 and node3 to apply transactions as per normal.
In the end I had to kill the mariadb process on node1 as it did not want to shutdown cleanly.
This is a big concern for us as it caused more than an hour of downtime.
Related syslog entries:
node1:
Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1317, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 14582761987)
Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [ERROR] Slave SQL: Error executing row event: 'Query execution was interrupted', Internal MariaDB error code: 1317
Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 1317, seqno 14582761987
After this normal log entries and no indication of node1 attempting to leave cluster...
node2:
Jan 24 10:38:29 localhost mariadbd[2502]: 2026-01-24 10:38:29 0 [Warning] WSREP: Failed to report last committed 94a81217-9350-11e9-a666-bae2f92ef610:14582761995, -110 (Connection timed out)
After this normal log entries and no indication of node1 attempting to leave cluster or node2 seeing anything wrong...
node3:
Jan 24 10:38:29 db3 mariadbd[2579]: 2026-01-24 10:38:29 0 [Warning] WSREP: Failed to report last committed 94a81217-9350-11e9-a666-bae2f92ef610:14582761995, -110 (Connection timed out)
After this normal log entries and no indication of node1 attempting to leave cluster or node3 seeing anything wrong...