[MDEV-28436] Memory leak on slave server Created: 2022-04-28  Updated: 2022-10-30  Resolved: 2022-10-30

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.5.13, 10.6.7
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Xan Charbonnet Assignee: Angelique Sklavounos (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Environment:

Debian Bullseye, MariaDB through the Maria repositories



 Description   

I have a backup server which is used for making snapshots of a MariaDB database. It slaves off of a MariaDB master. All servers in the system were running 10.3 until recently, when I upgraded the backup to 10.5.

At that time the backup server started to regularly run out of memory. MariaDB gets OOMed by the kernel and I have to restart it quite often.

mysqltuner reports that the max memory usage by MariaDB should be some ~8GB. I have 16GB RAM on the machine. MariaDB will hit 95%+ memory usage according to top, and eventually gets killed.

MariaDB on this server doesn't do anything other than replication, so it's hard to say for sure whether this is a replication problem or a "doing anything" problem. But it certainly scales with the amount of replication being performed: if it's catching up on a replication backlog, then it'll fly through memory.

It may be related to replication paralellism (slave_domain_parallel_threads). On 10.3 I was running 8 threads in "aggressive" mode without trouble, and that's what I started with on 10.5 and 10.6. Yesterday I experimented with the settings:

slave_parallel_threads slave_parallel_mode slave_retried_transaction / minute MB RAM leaked / minute
2 aggressive 106 16
8 none 0 0
8 minimal 0 22
8 aggressive 221 11
20 minimal 0 19
20 aggressive 420 30

These were run over the course of ~10 minutes each while replication was catching up. So it isn't a huge sample, and the test may not have been "fair" in terms of exactly what commands were being executed for each test.

The biggest thing to note is the 0 MB/min leaked in "none" mode. Indeed, when I have slave parallelization disabled completely (via slave_parallel_mode=none or slave_domain_parallel_threads=0) it /seems/ that the memory is no longer leaking, or is doing so much more slowly.

(This may or may not be the same as MDEV-27481. That one affects 10.3, which I did not seem to have a problem with.)



 Comments   
Comment by Xan Charbonnet [ 2022-06-14 ]

After much further testing, I believe this is a problem with the system's memory allocator. Switching to jemalloc for Maria seems to have solved it.

Comment by Xan Charbonnet [ 2022-07-12 ]

Another update: I believe the different memory allocator helped, and in general MariaDB isn't running the system out of memory.

However, when doing a lot of catchup all at once from replication, it is still running out.

Comment by Andrei Elkin [ 2022-07-27 ]

xan@biblionix.com, the best if you could share with us your slave's initial data and binlog to replay to prove the effects you observed. Can be it done?

Comment by Xan Charbonnet [ 2022-07-28 ]

I'm afraid I can't share the actual data, but I'll try to see if I can come up with a constructed example.

Generated at Thu Feb 08 10:00:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.