Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38260

Galera: IST Stalls on New Node When Donor is an Active Asynchronous Slave

    XMLWordPrintable

Details

    • Bug
    • Status: In Progress (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.6, 11.4
    • 10.11
    • None
    • None
    • Q1/2026 Galera Development, Q2/2026 Galera Maintenance

    Description

      Struggling with a problem that has been seen at least twice recently.

      Node stalls on IST phase if the Donor in the cluster is an async Slave replicating (from another galera cluster fwiw).
      Desyncing the node hence disabling flow control only frees the Donor/Async Slave but doesn't solve the issue on the Joiner/IST.
      Stopping replication on the Donor unblocks the Joiner.

      Factual Observations

      1. Consistent Stall Point: GDB stack traces from two independent occurrences of the issue show that a critical thread on the joining node hangs indefinitely inside the function
      galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter. This is the single, consistent point of failure observed.

      2. Specific Trigger Condition: The hang only occurs when a new node performs an IST from a donor that is also an active asynchronous replication slave. The issue does not occur under a regular client workload on the donor.

      3. Definitive Workaround: The issue is completely resolved if the asynchronous replication is stopped on the donor node (STOP SLAVE) before the new node initiates its join.

      4. Flow Control is Not the Cause: You have confirmed that disabling flow control on the joining node via SET GLOBAL wsrep_desync=1 does not prevent the hang. This rules out network back-pressure as the root cause.

      5. IST Data Availability: You have confirmed that the historical writesets from the IST stream are already present in the joining node's local Galera cache. This proves the donor is not withholding data and the problem is
      internal to the joiner's processing logic.

      Confident Conclusions (Based on Evidence)

      1. The Issue is a Deadlock/Livelock in Commit Ordering: The function galera::Monitor<galera::ReplicatorSMM::CommitOrder>::enter is, by design, a synchronization point to enforce Galera's strict, sequential transaction commit
      order based on the global sequence number (seqno). The indefinite stall within this function indicates a deadlock or livelock (resource starvation) where the applier thread is waiting for a condition that can never be met.

      2. The Problem is Internal to the Joining Node: Since the required historical writesets are already cached on the joining node, the failure is not due to a network issue or a fault in the donor's sending logic. The joining node
      has the data it needs but is logically incapable of processing it under these specific conditions.

      3. Async Replication Workload is the Trigger: The asynchronous slave workload is fundamentally different from a standard multi-client workload. It injects a relentless, high-throughput, and serialized stream of "live"
      transactions into the cluster. This specific type of workload is the necessary trigger for the deadlock, as a regular workload does not cause the same failure. The bug is not about load in general, but about this specific
      kind of incoming transaction stream.

      4. Slave Transactions Follow a Specific Code Path: Source code analysis confirms that transactions originating from a slave thread are correctly identified as non-local and are routed through the commit_order_enter_remote
      function path, which leads directly to the stalled monitor observed in the GDB traces.

      In summary, the bug is a resource contention deadlock within the joining node's applier-thread pool. It is triggered when the pool is saturated by "live" transactions from an async-slave donor, which block while waiting for their
      turn in the commit-order queue. This leaves no available threads to process the precedent-setting historical IST transactions that are already in the local cache, which are the very transactions that would clear the commit-order
      queue.

      GDB stack traces from two different instances of the problem are available on request.

      Attachments

        Activity

          People

            seppo Seppo Jaakola
            claudio.nanni Claudio Nanni
            Seppo Jaakola Seppo Jaakola
            Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.