[MDEV-21459] non-Galera deadlock causes leaked WSREP transaction Created: 2020-01-10 Updated: 2023-06-06 Resolved: 2023-06-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, wsrep |
| Affects Version/s: | 10.2.30 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Eric Hontz | Assignee: | Seppo Jaakola |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | Memory_leak, galera, leak, wsrep | ||
| Environment: |
|
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
As you can see from the correlation_between_mysqld_memory_and_wsrep_open_transactions.jpg attachment, memory usage of the mysqld process increases indefinitely. One of our 3 nodes (rtp-arkt-ds01) receives the majority of write operations, and that's the node that see's the most rapid memory growth. Only a restart of mysqld releases the memory, so we're having to restart every few months. When restarting mysqld, I noticed a high number reported for wsdb trx map usage in the log of the high-memory-usage node:
The wsrep_open_transactions status variable reports the same number, and there is a strong correlation between the memory usage of mysqld and wsrep_open_transactions (see correlation_between_mysqld_memory_and_wsrep_open_transactions.jpg). The other 2 nodes in the cluster receive a much slower rate of write operations than rtp-arkt-ds01. If I look at one of them, and plot the time derivative of wsrep_open_transactions and compare it with our application logs that indicate a non-Galera deadlock, there is a strong correlation: wsrep_open_transactions increases each time a non-Galera deadlock occurs. (By "non-Galera deadlock", I mean a deadlock resulting from competing writes to the same node.) This correlation is shown in the correlation_between_wsrep_open_transactions_and_non-galera_deadlocks.jpg attachment. Given the correlation between mysqld memory usage and wsrep_open_transactions, along with the correlation between wsrep_open_transactions and non-Galera deadlocks, I suspected the non-Galera deadlocks were causing the memory leak. After some searching, I found
Here is a modified version of the galera.MW-328E test that shows wsrep_open_transactions is nonzero after a non-Galera deadlock. Note that, just like galera.MW-328E, it's using two connections to the same node for the queries; it's not issuing any queries to the second node.
Unfortunately, we only retain metric data for a few months, so I'm not exactly sure when this issue started. It may have always been present, and we may have just started noticing it after changes to our application resulted in more non-Galera deadlocks. I can say that the memory consumption issue has been present since at least MariaDB 10.2.19. |
| Comments |
| Comment by Eric Hontz [ 2020-02-04 ] |
|
Hi @jplindst. Is there any further information you need from me? |
| Comment by Jan Lindström [ 2023-06-06 ] |
|
10.2 is EOL. |