Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38257

Galera hangs in "Waiting for certification" when Async Replication is in use

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.11.15
    • None
    • Galera
    • None
    • MariDB 10.11.15 with Galera 26.4.24, Two 3-node Galera-Clusters connected via async Replication running on Ubuntu 22.04

    Description

      We ran into the following bug when connecting two Galera Cluster via async replication. Cluster A is replicating to cluster B via a dedicated node in each cluster. After a couple of successful transactions the Slave_SQL thread gets stuck in state "Commit" while the other nodes in Cluster B are stuck in "Waiting for certification". There is no load on IO or CPU, there is no progress at all so it does not look like a bottleneck but more like a deadlock.

      There are no other transactions running on any node in Cluster B. Only the async replica is writing to this cluster. There are no blocked transactions or locks held other than the applier threads.

      Replica Node in Cluster B shows this processlist:

      +------+-------------+-----------+----+-----------+-------+----------------------------------+------------------+----------+
      | Id   | User        | Host      | db | Command   | Time  | State                            | Info             | Progress |
      +------+-------------+-----------+----+-----------+-------+----------------------------------+------------------+----------+
      | 1    | system user |           |    | Sleep     | 65819 | wsrep aborter idle               |                  | 0.000    |
      | 2    | system user |           |    | Sleep     | 65819 |                                  |                  | 0.000    |
      | 8    | system user |           |    | Sleep     | 65819 |                                  |                  | 0.000    |
      | 10   | system user |           |    | Sleep     | 65819 | wsrep applier idle               |                  | 0.000    |
      | 9    | system user |           |    | Sleep     | 65819 |                                  |                  | 0.000    |
      | 7772 | root        | localhost |    | Sleep     | 4148  |                                  |                  | 0.000    |
      | 7880 | system user |           |    | Slave_IO  | 1140  | Waiting for master to send event |                  | 0.000    |
      | 7881 | system user |           |    | Slave_SQL | 1138  | Commit                           |                  | 0.000    |
      | 7919 | root        | localhost |    | Query     | 0     | starting                         | show processlist | 0.000    |
      +------+-------------+-----------+----+-----------+-------+----------------------------------+------------------+----------+
      

      Other node processlist

      +-----+-------------+-----------+----+---------+------+---------------------------+------------------+----------+
      | Id  | User        | Host      | db | Command | Time | State                     | Info             | Progress |
      +-----+-------------+-----------+----+---------+------+---------------------------+------------------+----------+
      | 2   | system user |           |    | Sleep   | 6784 | wsrep aborter idle        |                  | 0.000    |
      | 1   | system user |           |    | Sleep   | 2261 | wsrep applier committed   |                  | 0.000    |
      | 6   | system user |           |    | Sleep   | 2261 | Waiting for certification |                  | 0.000    |
      | 7   | system user |           |    | Sleep   | 2261 | Waiting for certification |                  | 0.000    |
      | 9   | system user |           |    | Sleep   | 2261 | Waiting for certification |                  | 0.000    |
      | 124 | root        | localhost |    | Query   | 0    | starting                  | show processlist | 0.000    |
      +-----+-------------+-----------+----+---------+------+---------------------------+------------------+----------+
      

      Shutting down the blocked nodes (with kill -9, no other way worked) will let the Replica resume without error. Furthermore there are no errors logged on any node.

      Any suggestions how to debug this further or workarounds to try are welcome.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Andreas.Vogler@geneon.de Andreas Vogler
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.