[MDEV-38260] Galera: IST Stalls on New Node When Donor is an Active Asynchronous Slave - Jira

XML

Word

Printable

Details

Type: Bug
Status: Stalled (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6, 11.4
Fix Version/s: 10.11
Component/s: None
Labels:
None

Sprint:
Q1/2026 Galera Development, Q2/2026 Galera Maintenance, Q3/2026 Replic. Maintenance

Description

Struggling with a problem that has been seen at least twice recently.

Node stalls on IST phase if the Donor in the cluster is an async Slave replicating (from another galera cluster fwiw).
Desyncing the node hence disabling flow control only frees the Donor/Async Slave but doesn't solve the issue on the Joiner/IST.
Stopping replication on the Donor unblocks the Joiner.

Factual Observations

1. Consistent Stall Point: GDB stack traces from two independent occurrences of the issue show that a critical thread on the joining node hangs indefinitely inside the function
galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter. This is the single, consistent point of failure observed.

2. Specific Trigger Condition: The hang only occurs when a new node performs an IST from a donor that is also an active asynchronous replication slave. The issue does not occur under a regular client workload on the donor.

3. Definitive Workaround: The issue is completely resolved if the asynchronous replication is stopped on the donor node (STOP SLAVE) before the new node initiates its join.

4. Flow Control is Not the Cause: You have confirmed that disabling flow control on the joining node via SET GLOBAL wsrep_desync=1 does not prevent the hang. This rules out network back-pressure as the root cause.

5. IST Data Availability: You have confirmed that the historical writesets from the IST stream are already present in the joining node's local Galera cache. This proves the donor is not withholding data and the problem is
internal to the joiner's processing logic.

Confident Conclusions (Based on Evidence)

1. The Issue is a Deadlock/Livelock in Commit Ordering: The function galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter is, by design, a synchronization point to enforce Galera's strict, sequential transaction commit
order based on the global sequence number (seqno). The indefinite stall within this function indicates a deadlock or livelock (resource starvation) where the applier thread is waiting for a condition that can never be met.

2. The Problem is Internal to the Joining Node: Since the required historical writesets are already cached on the joining node, the failure is not due to a network issue or a fault in the donor's sending logic. The joining node
has the data it needs but is logically incapable of processing it under these specific conditions.

3. Async Replication Workload is the Trigger: The asynchronous slave workload is fundamentally different from a standard multi-client workload. It injects a relentless, high-throughput, and serialized stream of "live"
transactions into the cluster. This specific type of workload is the necessary trigger for the deadlock, as a regular workload does not cause the same failure. The bug is not about load in general, but about this specific
kind of incoming transaction stream.

4. Slave Transactions Follow a Specific Code Path: Source code analysis confirms that transactions originating from a slave thread are correctly identified as non-local and are routed through the commit_order_enter_remote
function path, which leads directly to the stalled monitor observed in the GDB traces.

In summary, the bug is a resource contention deadlock within the joining node's applier-thread pool. It is triggered when the pool is saturated by "live" transactions from an async-slave donor, which block while waiting for their
turn in the commit-order queue. This leaves no available threads to process the precedent-setting historical IST transactions that are already in the local cache, which are the very transactions that would clear the commit-order
queue.

GDB stack traces from two different instances of the problem are available on request.

Attachments

Issue Links

is caused by

MDEV-34924 gtid_slave_pos table rows newer been deleted on non replica nodes (wsrep_gtid_mode = 1)

Closed

Activity

People

Assignee:: Seppo Jaakola

Reporter:: Claudio Nanni

Assigned for Implementation:: Seppo Jaakola

Votes:: 1 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Due:: 2026-02-19

Created:: 2025-12-05 14:42

Updated:: 3 days ago 08:01

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.