[MDEV-26488] thread deadlock with galera cluster freeze when sending writes to multiple cluster members Created: 2021-08-27  Updated: 2021-09-16

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Lex Neva Assignee: Unassigned
Resolution: Unresolved Votes: 3
Labels: None
Environment:

debian 10 (buster)
MariaDB 10.5.12 running under Docker
Galera cluster of 8 nodes


Attachments: HTML File summarized-stack-trace    

 Description   

We're experiencing frequent lockups of MariaDB with no known instigating event. Symptoms of the lockup:

  • one MariaDB node locks up (always the one to which we're directing writes)
  • logging for that node ceases
  • after awhile, that node starts logging "too many connections" when new connections are made
  • the rest of the nodes in the galera cluster are not locked up
  • the rest of the nodes see the locked node as being a member of the cluster
  • the cluster is flow-control paused 100% of the time
  • ProxySQL tries to direct writes to a new node, but the writes don't complete because the cluster is flow-control paused

We can usually resolve the problem by restarting the frozen node. The cluster unpauses and everything continues on as it should. The restarted node rejoins the cluster.

Occasionally another node freezes up immediately thereafter, or the node in question freezes up shortly after rejoining the cluster. Subsequent restarts eventually resolve the problem.

The characteristics of the freeze are especially interesting and lead me to the theory that a subset of mariadb's threads are deadlocked (I mean at the process level, not SQL transaction deadlock). If I try to connect to a frozen node (e.g. with the mysql command line), the connection is accept()ed, but MariaDB remains silent. The connection remains open and takes up space in MariaDB's connection table. When MariaDB hits its connection limit, it starts to respond to new connections with the "Too many connections" error and the connection is closed.

Before MariaDB hits its connection limit, connections are in a rather strange state. Clients disconnect after awhile because MariaDB doesn't do the usual MySQL protocol handshake. However, MariaDB seems not to clean up the closed connection. Tools like `ss` and `netstat` show many connections in the CLOSE_WAIT state, which leads me to think that MariaDB has not called `close()` on the fd. This seems to be a distinctive characteristic of this kind of node lockup.

This happens several times a week under our current light workload. We actually have two 8-node galera clusters and both exhibit this problem multiple times per week.

MariaDB 10.3.24 definitely did not exhibit this problem. I believe 10.3.28 also did not have this problem, but I can't be sure, because we were not running it for long enough to be 100% sure. 10.5.9 and higher have all exhibited this problem. We have not yet tried a 10.6-series version.

I used GDB to take a stack dump of mariadb when it was in the deadlocked state. I ran it through pt-pmp (see attached).

I tried installing the debug symbols package that exactly matched the version of MariaDB installed. However, the debug symbols included in it do not match the binaries. I verified this by checking the build ID embedded in the mariadb server binary; it definitely did not match any symbols file included in the debug symbols package. This happened with packages downloaded from http://ftp.osuosl.org/pub/mariadb/repo/. Hopefully the lack of debug symbols doesn't render the stack trace useless...

As described in MDEV-25883, I tried setting mysql-kill_backend_connection_when_disconnect: false in ProxySQL to avoid MDEV-23328. That did not prevent the deadlock.

This is a critical problem for us, leading us to question whether we can continue to use Galera as we head toward production in the coming months. Any help would be appreciated. I stand ready to help debug as needed. Thanks!



 Comments   
Comment by Lex Neva [ 2021-09-13 ]

This may be related to sending write transactions for the same (or nearby) rows to two different members of the Galera cluster.

I know this is not the recommended way of using Galera, and we weren't doing this intentionally. We had 4 clients sending fairly heavy write traffic for one table, with two sending to one galera node and two sending to another. We were getting the expected "Deadlock, retry transaction" errors from MariaDB, and we intended to fix this at some point. I realized that this might be related to the cluster freezes and moved up the priority for fixing this.

I rolled out the fix last Thursday and we haven't had a cluster freeze since. Based on previous data, we very likely would have experienced multiple freeze-ups since then, so this looks promising.

Obviously we're not supposed to use Galera this way, but I think a full cluster freeze is probably not the expected outcome?

Comment by Lex Neva [ 2021-09-16 ]

I am now fully confident that ceasing to send writes to multiple nodes fixed this problem.

I know for sure that we were sending this kind of traffic to 10.3.24 with the same cluster topology and we did not experience this problem. It seems like something between 10.3.24 and 10.5.9 introduced instability when writes are sent to multiple cluster nodes.

Generated at Thu Feb 08 09:45:41 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.