[MDEV-34493] memory leak when using galera node with only replication thread - Jira

Details

Type: Bug
Status: Needs Feedback (View Workflow)
Priority: Critical
Resolution: Unresolved
Affects Version/s: 10.6.16
Fix Version/s: 10.6
Component/s: Galera, Server
Labels:
None
Environment:
redhat 8 on vmware

Description

we are getting memory leak in one DB server.

situation:

our architecture is 3 nodes galera cluster - 2 DB nodes + 1 arbitrator in different locations. Plus running haproxy as the front end
haproxy redirects traffic to one DB node in normal stage. Another DB node is configured as backup in haproxy
the backup DB node used up its operating system available memory in 1 or 2 days. The DB node is running on 64GB memory VM and 8GB buffer pool
after available memory used up in backup DB node, OS killed the DB process and available memory got released
the backup DB node got 128 galera replication threads

Since haproxy does not route traffic to backup DB node in normal situation, only replication and Percona PMM are using backup DB node.

Screenshot:
uploaded top command output, keeping decreasing available memory and output of memory_summary_global_by_event_name and memory_summary_by_thread_by_event_name

Question:
Any method to continue the troubleshooting? Kindly advise "memory/sql/thd::main_mem_root" is related to the memory leak or not.

Regards,
William Wong

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

available-memory-keeps-decreaing.png
713 kB
2024-06-28 17:59
Investigation-2025-04-03.txt
8 kB
2025-04-08 04:39
main_mem_root-at-session-level.png
831 kB
2024-06-28 17:59
node2 memory_usage_pattern.xlsx
347 kB
2025-03-11 07:07
top.png
360 kB
2024-06-28 17:59
top-memory_summary_global_by_event_name.png
557 kB
2024-06-28 17:59

Activity

Ascending order - Click to sort in descending order

View 6 older comments

Lawrence Man added a comment - 2025-03-17 06:56

Hi support team,

A high level summary of the problem:

At 1:xx a.m. daily, some transactions take place at node 1, with select statements involved and transaction isolation level are repeatable read.

Node 2 slave thread memory consumption bumped up by ~2GB per day during this batch job execution, and such memory is not released.

We are unable to capture what was running at node 2. The information_summary.processlist.info_binary is NULL throughout the observation period.

Thanks and best regards,

Lawrence

Lawrence Man added a comment - 2025-03-17 06:56 Hi support team, A high level summary of the problem: At 1:xx a.m. daily, some transactions take place at node 1, with select statements involved and transaction isolation level are repeatable read. Node 2 slave thread memory consumption bumped up by ~2GB per day during this batch job execution, and such memory is not released. We are unable to capture what was running at node 2. The information_summary.processlist.info_binary is NULL throughout the observation period. Thanks and best regards, Lawrence

Jan Lindström added a comment - 2025-03-17 13:14

frelist Is wsrep-slave-thread=128 in every node ?

Jan Lindström added a comment - 2025-03-17 13:14 frelist Is wsrep-slave-thread=128 in every node ?

Lawrence Man added a comment - 2025-03-26 05:53

Yes, but we recently we have reduced it match with the # of vCPUs and the situation remains the same.

Lawrence Man added a comment - 2025-03-26 05:53 Yes, but we recently we have reduced it match with the # of vCPUs and the situation remains the same.

Lawrence Man added a comment - 4 days ago

I got a small reproducible case as in the attached file:

1. A table called t2 is created.
2. An insert w/ 10mil records via the stored procedure InsertIntoLknmT2_bind_savept() caused increase in node 2's memory utilization, whenever the SP was executed for the first & second time.
3. An SP Update_LknmT2_bind_savept was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit.
4. An SP Update_LknmT2_bind was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit.

It appears that the committing after running large batch of INSERT or UPDATE can lead to such situation, regardless of the presence of savepoint / release savepoint.

Thanks and best regards,

Lawrence

Investigation-2025-04-03.txt

Lawrence Man added a comment - 4 days ago I got a small reproducible case as in the attached file: 1. A table called t2 is created. 2. An insert w/ 10mil records via the stored procedure InsertIntoLknmT2_bind_savept() caused increase in node 2's memory utilization, whenever the SP was executed for the first & second time. 3. An SP Update_LknmT2_bind_savept was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit. 4. An SP Update_LknmT2_bind was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit. It appears that the committing after running large batch of INSERT or UPDATE can lead to such situation, regardless of the presence of savepoint / release savepoint. Thanks and best regards, Lawrence Investigation-2025-04-03.txt

Lawrence Man added a comment - 6 hours ago

In the previously attached file, the same insert / update is executed 10mil times, in a single transaction. The SQLs are written using bind variable. In this situation, will the memory used to parse the SQL be allocated for the first SQL, and re-used for the rest of SQLs within that same transaction, or will the memory used to parse the SQL be multiplied by 10mil times, since they are in the same transaction?

I raise the above query because it can be observed that there is a sharp rise in memory utilization, even though only a single INSERT/UPDATE SQL skeleton is used.

Lawrence Man added a comment - 6 hours ago In the previously attached file, the same insert / update is executed 10mil times, in a single transaction. The SQLs are written using bind variable. In this situation, will the memory used to parse the SQL be allocated for the first SQL, and re-used for the rest of SQLs within that same transaction, or will the memory used to parse the SQL be multiplied by 10mil times, since they are in the same transaction? I raise the above query because it can be observed that there is a sharp rise in memory utilization, even though only a single INSERT/UPDATE SQL skeleton is used.

MariaDB Server

memory leak when using galera node with only replication thread

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration