[MDEV-26488] thread deadlock with galera cluster freeze when sending writes to multiple cluster members - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: Galera
Labels:
None
Environment:
debian 10 (buster)
MariaDB 10.5.12 running under Docker
Galera cluster of 8 nodes

Description

We're experiencing frequent lockups of MariaDB with no known instigating event. Symptoms of the lockup:

one MariaDB node locks up (always the one to which we're directing writes)
logging for that node ceases
after awhile, that node starts logging "too many connections" when new connections are made
the rest of the nodes in the galera cluster are not locked up
the rest of the nodes see the locked node as being a member of the cluster
the cluster is flow-control paused 100% of the time
ProxySQL tries to direct writes to a new node, but the writes don't complete because the cluster is flow-control paused

We can usually resolve the problem by restarting the frozen node. The cluster unpauses and everything continues on as it should. The restarted node rejoins the cluster.

Occasionally another node freezes up immediately thereafter, or the node in question freezes up shortly after rejoining the cluster. Subsequent restarts eventually resolve the problem.

The characteristics of the freeze are especially interesting and lead me to the theory that a subset of mariadb's threads are deadlocked (I mean at the process level, not SQL transaction deadlock). If I try to connect to a frozen node (e.g. with the mysql command line), the connection is accept()ed, but MariaDB remains silent. The connection remains open and takes up space in MariaDB's connection table. When MariaDB hits its connection limit, it starts to respond to new connections with the "Too many connections" error and the connection is closed.

Before MariaDB hits its connection limit, connections are in a rather strange state. Clients disconnect after awhile because MariaDB doesn't do the usual MySQL protocol handshake. However, MariaDB seems not to clean up the closed connection. Tools like `ss` and `netstat` show many connections in the CLOSE_WAIT state, which leads me to think that MariaDB has not called `close()` on the fd. This seems to be a distinctive characteristic of this kind of node lockup.

This happens several times a week under our current light workload. We actually have two 8-node galera clusters and both exhibit this problem multiple times per week.

MariaDB 10.3.24 definitely did not exhibit this problem. I believe 10.3.28 also did not have this problem, but I can't be sure, because we were not running it for long enough to be 100% sure. 10.5.9 and higher have all exhibited this problem. We have not yet tried a 10.6-series version.

I used GDB to take a stack dump of mariadb when it was in the deadlocked state. I ran it through pt-pmp (see attached).

I tried installing the debug symbols package that exactly matched the version of MariaDB installed. However, the debug symbols included in it do not match the binaries. I verified this by checking the build ID embedded in the mariadb server binary; it definitely did not match any symbols file included in the debug symbols package. This happened with packages downloaded from http://ftp.osuosl.org/pub/mariadb/repo/. Hopefully the lack of debug symbols doesn't render the stack trace useless...

As described in ~~MDEV-25883~~, I tried setting mysql-kill_backend_connection_when_disconnect: false in ProxySQL to avoid ~~MDEV-23328~~. That did not prevent the deadlock.

This is a critical problem for us, leading us to question whether we can continue to use Galera as we head toward production in the coming months. Any help would be appreciated. I stand ready to help debug as needed. Thanks!

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

summarized-stack-trace
5 kB
2021-08-27 20:55

Activity

People

Assignee:: Unassigned

Reporter:: Lex Neva

Votes:: 3 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2021-08-27 20:55

Updated:: 2021-09-16 14:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.