Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38843

WSREP: BF applier failed on a node causing complete Cluster lockup

    XMLWordPrintable

Details

    • Can result in hang or crash
    • Hide
      Fixed a rare Galera cluster lockup where, after a write set could not be applied on one node, that node would silently stay in the cluster instead of stepping out. New write sets kept arriving but could not be processed, leaving the cluster unable to make progress until the node was killed and restarted. The node now reports the apply failure to the cluster, loses the consistency check, and is evicted automatically so it can rejoin via state transfer.
      Show
      Fixed a rare Galera cluster lockup where, after a write set could not be applied on one node, that node would silently stay in the cluster instead of stepping out. New write sets kept arriving but could not be processed, leaving the cluster unable to make progress until the node was killed and restarted. The node now reports the apply failure to the cluster, loses the consistency check, and is evicted automatically so it can rejoin via state transfer.
    • Q3/2026 Replic. Maintenance

    Description

      A few weeks ago, we had a critical event on our 3 node Galera Cluster where node 1 had transaction apply error.

      If I understand correctly node1 should have been evicted and go uninitialized so that node2 and 3 can continue as normal however this did not happen.

      Node1 stayed part of the cluster as a Primary and then caused all commits on all nodes to hang as it did not continue to certify and apply any further write sets.

      I also saw two wsrep threads that showed killed on the node1 process list but from research this seems to have been Killed internally inside mariadb.

      Also the wsrep applier threads were not present on node1. We have 8 of them on each node and they were present on node2 and node3.

      On a hunch I stopped mariadb service on node1 which then released all hanging commits and allowed node2 and node3 to apply transactions as per normal.
      In the end I had to kill the mariadb process on node1 as it did not want to shutdown cleanly.

      This is a big concern for us as it caused more than an hour of downtime.

      Related syslog entries:

      node1:
      Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1317, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 14582761987)
      Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [ERROR] Slave SQL: Error executing row event: 'Query execution was interrupted', Internal MariaDB error code: 1317
      Jan 24 10:38:22 db1 mariadbd[2641237]: 2026-01-24 10:38:22 14 [Warning] WSREP: Event 3 Write_rows_v1 apply failed: 1317, seqno 14582761987
      After this normal log entries and no indication of node1 attempting to leave cluster...

      node2:
      Jan 24 10:38:29 localhost mariadbd[2502]: 2026-01-24 10:38:29 0 [Warning] WSREP: Failed to report last committed 94a81217-9350-11e9-a666-bae2f92ef610:14582761995, -110 (Connection timed out)
      After this normal log entries and no indication of node1 attempting to leave cluster or node2 seeing anything wrong...

      node3:
      Jan 24 10:38:29 db3 mariadbd[2579]: 2026-01-24 10:38:29 0 [Warning] WSREP: Failed to report last committed 94a81217-9350-11e9-a666-bae2f92ef610:14582761995, -110 (Connection timed out)
      After this normal log entries and no indication of node1 attempting to leave cluster or node3 seeing anything wrong...

      Attachments

        Issue Links

          Activity

            People

              hemantdangi Hemant Dangi
              stephanvos Stephan Vos
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 3.4d
                  3.4d
                  Remaining:
                  Remaining Estimate - 0d
                  0d
                  Logged:
                  Time Spent - 1d 4.5h Time Not Required
                  1d 4.5h

                  Git Integration

                    Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.