[MDEV-17247] Memory leak on parallel replicated slave Created: 2018-09-19  Updated: 2023-04-27

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3.8
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Luke Jolly Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Host - CoreOS
Docker Container - centos:7
Using RPM from MariaDB Repo


Attachments: HTML File Dockerfile     PNG File Screen Shot 2018-09-26 at 3.35.03 PM.png     HTML File compile_install     HTML File crash_log     HTML File global_status     HTML File global_variables_redacted     HTML File snapshot_1537811421     HTML File snapshot_1537963221     HTML File snapshot_1538417420     HTML File status     HTML File valgrind_error     HTML File variables_redacted    
Issue Links:
Blocks
is blocked by MDEV-17471 parallel replication mutex acquision ... Open
Relates
relates to MDEV-17471 parallel replication mutex acquision ... Open

 Description   

[ISSUE]:
I have been encountering what appears to be an unbounded memory leak on parallel replicated slaves running 10.3.8. Everything will be steady and memory usage will seem completely stable for weeks. It seems that it is caused by a specific workload/type of query but I have not been able to identify what. It does always occur during heavy load on the master, but not every time the master is loaded. When the problem does manifest, the memory usage grows very rapidly. It does not occur on the master they are replicating from. Restarting MariaDB is the only solution to free up memory I have found so far. I have attached the statuses and variables.

[BELLOW IS NO LONGER RELEVANT, LEFT FOR GOOGLE SEARCH HITS]:
In an attempt to debug what is going on I have compiled MariaDB with the BUILD/compile-amd64-valgrind-max script in docker. I have attached the Dockerfile to show the build process. I start valgrind and mariadb with the following command:

valgrind --tool=massif --massif-out-file=/var/log/mysql/massif.out.%p /usr/local/mysql/bin/mysqld

However, when I shutdown MariaDB with a signal or with the SHUTDOWN command, valgrind errors and does not output a massif file. I have attached that error log as well. It seems that MariaDB maybe built wrong? I'm not sure what I'm doing wrong and am reaching the end of my debugging abilities. I attached a log called compile_install that the output from the build. I'm not sure what to look for in it though.

Another night that might factor in is that I built the docker image on a host system running "Linux 4.18.7-arch1-1-ARCH" and am attempting to run it on a host running "Linux 4.14.67-coreos"



 Comments   
Comment by Luke Jolly [ 2018-09-19 ]

I just tried running the image on the same system I built it on and got the same valgrind error on exit.

Comment by Luke Jolly [ 2018-09-19 ]

I was finally able to get it write out the snapshots by running it with this command:

valgrind --tool=massif --vgdb=yes --massif-out-file=/var/log/mysql/massif.out.%p --time-unit=ms --max-snapshots=200 /usr/local/mysql/bin/mysqld

And then triggering using:

vgdb all_snapshots [<filename>]

I will try to get an snapshot of it happening and post it here.

Comment by Luke Jolly [ 2018-09-24 ]

I uploaded a valgrind snapshot (snapshot_1537811421) hopefully showing the memory issue I'm experience in production. I restored a backup of our slave database less than an hour before a workload hit the master that caused memory to spike on the production slave. I started the restored slave and let it work through the back log of master events over the weekend. The compiled version running with valgrind is so much less performant that it took all weekend to work through about 40 minutes of binlogs. This is enough however to be well into the even. I will also upload a snapshot with the slave stopped showing that the memory usage stays about the same once I have that. snapshot_1537811421

I'm not sure what else to do to debug this. Any suggestion or input would be much appreciated.

Comment by Andrei Elkin [ 2018-09-25 ]

Luke, hi.

Thanks for trying to investigate it!

You wrote
> Restarting MariaDB is the only solution to free up memory I have found so far

I wonder if mere the slave service restart would also release it? That would put the blame
at once on the slave (parallel) applier.

Could please check it out?

Andrei

Comment by Luke Jolly [ 2018-09-25 ]

Stopping the slave frees up some memory but the doesn't fix what I perceive the leak is. So it will spike roughly 20-30 GB in memory, and stopping the slave will free up around 8 GB (those aren't exact numbers, but I saw this behavior many times). I ran stop slave yesterday on the vaglrind instance I'm running and it still hasn't stopped.... once it finally does I will attach the snapshots of it post stop.

Comment by Luke Jolly [ 2018-09-26 ]

Well it seems you maybe correct, in this test instance it's looking like a lot more memory was cleaned up than I thought. Attached a post 'stop slave' memory snapshot. snapshot_1537963221

Comment by Luke Jolly [ 2018-09-26 ]

So this shows what I said yesterday that running 'stop slave; start slave;' does not fix the problem. When the memory drops a few GB on the graph is when the slave threads exited. Upon starting the slave again almost 20 GB of memory was consumed.

Comment by Luke Jolly [ 2018-10-01 ]

Here's another snapshot having now I believed work through the entire period of binlogs that initially caused a memory spike on our production slave. snapshot_1538417420

Another thing that is of note is that htop says MariaDB is using 54.0GB of memory and if I'm reading the valgrind dump correctly it has only detailed 32.7GB of memory usage which seems quite a discrepancy.

Comment by Andrei Elkin [ 2018-10-02 ]

To the lastest snapshot_1538417420, a graph of mem_heap_B as function of snapshot shows two steep curves. The first reaches about 350 GB at about 50 and drops
to about 6GB (must be STOP SLAVE), then (must be START SLAVE) the 2nd one follows up
to repeat the pattern just to reach higher maximum. No more STOP SLAVE followed as I understand, correct?
If there were the 2nd STOP SLAVE could we expect a drop back to about 6GB?

Comment by Luke Jolly [ 2018-10-02 ]

The one drop in the memory is indeed from a STOP SLAVE. I can run STOP SLAVE again and see what happens. The thing that worries me is it seems like this valgrind'ed instance is not behaving exactly like the production slaves. I believe the most memory I have ever been able to "recover" in production with STOP SLAVE has been 8 GB. This test instance recovered x2 that.

I have run STOP SLAVE on it, it will probably take at least until tomorrow for it to actually stop.

Comment by Luke Jolly [ 2018-10-02 ]

What did you mean by text_mem_heap_B?

Comment by Andrei Elkin [ 2018-10-03 ]

Sorry for confusing leftover from my editing. mem_heap_B was meant (which alone must be sufficient
for our purpose).

Comment by Luke Jolly [ 2018-10-03 ]

Well I don't think I'm going to be able to get anymore useful information out of my test valgrind instance. It crashed before the slave thread was able to exit. Here's the crash log crash_log

Comment by Andrei Elkin [ 2018-10-16 ]

Thanks for the crash report! I will try to take a closer look to what is actually a deadlock in mutex acquisition.

Comment by Kristian Nielsen [ 2018-11-25 ]

One way this could occur is if the slave gets blocked for some reason (eg. long-running ALTER TABLE), and @@slave_parallel_max_queued is set too high:

https://mariadb.com/kb/en/library/replication-and-binary-log-system-variables/#slave_parallel_max_queued

Comment by Luke Jolly [ 2018-11-28 ]

slave_parallel_threads = 16
slave_parallel_max_queued = 104857600

If I understand things correctly, this should only be able to lead to 1600 MB + some overhead of memory used. This is no where near the amounts of memory usage spikes we were seeing (20-30 GB).

Comment by Kristian Nielsen [ 2018-11-28 ]

The limit of 1600 MB is in terms of event size as stored in binlog/relaylog. The actual memory usage is somewhat different (probably higher) when each event is converted to an in-memory C++ object. Agree that a factor of 10+ different sounds like a lot, and it's confusing that the slave_parallel_max_queued is not directly the actual memory usage. But still, it sounds like the most likely explanation is that the worker threads got behind and there was enough load to fill up the queues completely with 1.6 GByte of relaylog data (and even larger RAM usage).

Queueing 100 MB of relay log events per thread sounds like hugely overkill unless there is a very special reason. 100 kB would be a more usual number, perhaps.

Generated at Thu Feb 08 08:35:01 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.