Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26488

thread deadlock with galera cluster freeze when sending writes to multiple cluster members

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Galera
    • None
    • debian 10 (buster)
      MariaDB 10.5.12 running under Docker
      Galera cluster of 8 nodes

    Description

      We're experiencing frequent lockups of MariaDB with no known instigating event. Symptoms of the lockup:

      • one MariaDB node locks up (always the one to which we're directing writes)
      • logging for that node ceases
      • after awhile, that node starts logging "too many connections" when new connections are made
      • the rest of the nodes in the galera cluster are not locked up
      • the rest of the nodes see the locked node as being a member of the cluster
      • the cluster is flow-control paused 100% of the time
      • ProxySQL tries to direct writes to a new node, but the writes don't complete because the cluster is flow-control paused

      We can usually resolve the problem by restarting the frozen node. The cluster unpauses and everything continues on as it should. The restarted node rejoins the cluster.

      Occasionally another node freezes up immediately thereafter, or the node in question freezes up shortly after rejoining the cluster. Subsequent restarts eventually resolve the problem.

      The characteristics of the freeze are especially interesting and lead me to the theory that a subset of mariadb's threads are deadlocked (I mean at the process level, not SQL transaction deadlock). If I try to connect to a frozen node (e.g. with the mysql command line), the connection is accept()ed, but MariaDB remains silent. The connection remains open and takes up space in MariaDB's connection table. When MariaDB hits its connection limit, it starts to respond to new connections with the "Too many connections" error and the connection is closed.

      Before MariaDB hits its connection limit, connections are in a rather strange state. Clients disconnect after awhile because MariaDB doesn't do the usual MySQL protocol handshake. However, MariaDB seems not to clean up the closed connection. Tools like `ss` and `netstat` show many connections in the CLOSE_WAIT state, which leads me to think that MariaDB has not called `close()` on the fd. This seems to be a distinctive characteristic of this kind of node lockup.

      This happens several times a week under our current light workload. We actually have two 8-node galera clusters and both exhibit this problem multiple times per week.

      MariaDB 10.3.24 definitely did not exhibit this problem. I believe 10.3.28 also did not have this problem, but I can't be sure, because we were not running it for long enough to be 100% sure. 10.5.9 and higher have all exhibited this problem. We have not yet tried a 10.6-series version.

      I used GDB to take a stack dump of mariadb when it was in the deadlocked state. I ran it through pt-pmp (see attached).

      I tried installing the debug symbols package that exactly matched the version of MariaDB installed. However, the debug symbols included in it do not match the binaries. I verified this by checking the build ID embedded in the mariadb server binary; it definitely did not match any symbols file included in the debug symbols package. This happened with packages downloaded from http://ftp.osuosl.org/pub/mariadb/repo/. Hopefully the lack of debug symbols doesn't render the stack trace useless...

      As described in MDEV-25883, I tried setting mysql-kill_backend_connection_when_disconnect: false in ProxySQL to avoid MDEV-23328. That did not prevent the deadlock.

      This is a critical problem for us, leading us to question whether we can continue to use Galera as we head toward production in the coming months. Any help would be appreciated. I stand ready to help debug as needed. Thanks!

      Attachments

        Activity

          People

            Unassigned Unassigned
            lneva Lex Neva
            Votes:
            3 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.