[MXS-2433] infinite memory usage, possible leak Created: 2019-04-13  Updated: 2019-04-19  Resolved: 2019-04-17

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 2.3.5
Fix Version/s: 2.3.7

Type: Bug Priority: Major
Reporter: Tim Westervoorde Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MXS-2430 A possible memory leak in version Closed
Sprint: MXS-SPRINT-80

 Description   

Under 2.3.5 the memory usage keeps increasing over time, until the machine finally goes OOM.
I did manage to make a core dump of the process but running under valgrind for a long time isn't something I managed to do.

The config is as follows

[Read-Write-Service]
type=service
router=readwritesplit
servers=db1,db2,db3
user=maxscale
password=XXX
connection_keepalive=30
#max_sescmd_history=50
disable_sescmd_history=true
master_accept_reads=false
master_failure_mode=fail_on_write
enable_root_user=true
#master_reconnection=true
#prune_sescmd_history=true
#session_track_trx_state=true
causal_reads=true
causal_reads_timeout=2

as you can see I already disabled sescmd history to rule this out. The core dump is quite large and I cannot upload it publicly as I'm not sure it doesn't contain any sensitive data but if there is anything I can do with it to help out please let me know.



 Comments   
Comment by markus makela [ 2019-04-15 ]

Can you add the configurations for any monitors or filters you use?

Comment by Tim Westervoorde [ 2019-04-15 ]

The rest of the config as follows

[maxscale]
threads=auto
local_address=10.10.x.x
 
[db1]
type=server
address=10.10.x.x
port=3306
protocol=MariaDBBackend
proxy_protocol=on
persistpoolmax=50
persistmaxtime=900
 
[db2]
type=server
address=10.10.x.x
port=3306
protocol=MariaDBBackend
proxy_protocol=on
persistpoolmax=50
persistmaxtime=900
 
[db3]
type=server
address=10.10.x.x
port=3306
protocol=MariaDBBackend
proxy_protocol=on
persistpoolmax=50
persistmaxtime=900
 
[Galera-Monitor]
type=monitor
module=galeramon
servers=db1,db2,db3
user=maxscale
password=xxx
monitor_interval=2000
available_when_donor=true
disable_master_failback=true
 
[Read-Write-Listener]
type=listener
service=Read-Write-Service
protocol=MariaDBClient
port=3306

Comment by Tim Westervoorde [ 2019-04-15 ]

I just went trough the core dump and is seems almost all memory is used by large prepared statements. So it looks like these aren't freed somewhere somehow.

Comment by markus makela [ 2019-04-15 ]

If you add query_classifier_cache_size=0 under the [maxscale] section, does it have any effect?

Comment by Tim Westervoorde [ 2019-04-15 ]

within a few minutes a large cron with most of these prepared statements will run and Ive directed them to a maxscale under valgrind and will see if that triggers something. If it does/doesn't I'll try your suggestion

Comment by Tim Westervoorde [ 2019-04-15 ]

before and after my cron

sending command leak_check definiteleak any to pid 95136
==95136== LEAK SUMMARY:
==95136==    definitely lost: 0 bytes in 0 blocks
==95136==    indirectly lost: 0 bytes in 0 blocks
==95136==      possibly lost: 397,337 bytes in 535 blocks
==95136==    still reachable: 445,963 bytes in 5,378 blocks
==95136==         suppressed: 0 bytes in 0 blocks

sending command leak_check definiteleak any to pid 95136
==95136== LEAK SUMMARY:
==95136==    definitely lost: 0 bytes in 0 blocks
==95136==    indirectly lost: 0 bytes in 0 blocks
==95136==      possibly lost: 765,706 bytes in 570 blocks
==95136==    still reachable: 453,939 bytes in 5,435 blocks
==95136==         suppressed: 0 bytes in 0 blocks

I'll try them again at 15:10 with query_classifier_cache_size=0

Comment by markus makela [ 2019-04-15 ]

Looks like it's not leaking memory directly but it does seem to retain a lot more than is needed. If it is the query classifier cache, that could explain the increase but it will limit itself to 40% of available memory by default.

Comment by Tim Westervoorde [ 2019-04-15 ]

sending command leak_check definiteleak any to pid 96986
==96986== LEAK SUMMARY:
==96986==    definitely lost: 0 bytes in 0 blocks
==96986==    indirectly lost: 0 bytes in 0 blocks
==96986==      possibly lost: 397,337 bytes in 535 blocks
==96986==    still reachable: 443,970 bytes in 5,365 blocks
==96986==         suppressed: 0 bytes in 0 blocks

that seemed to fix the memory usage. I'll see if I can roll this into production

Comment by Tim Westervoorde [ 2019-04-15 ]

Hmm it didn't stop at 40% but completely OOM the server. I'll see what happens on the primary cluster if I run 2.3.5 with query_classifier_cache_size=0

Comment by markus makela [ 2019-04-15 ]

The MaxScale log should tell on startup how much memory it uses for the query classifier cache. If you can check this, it would help.

Comment by Tim Westervoorde [ 2019-04-15 ]

Query classification results are cached and reused. Memory used per thread: 396.58MiB
MaxScale started with 2 worker threads, each with a stack size of 8388608 bytes

So that could account up to ~800M of extra memory on the 2G memory of the machine. I'll dig deeper into the coredump to see if there is more interesting data later in the file.

Comment by markus makela [ 2019-04-15 ]

That looks like the correct amount of memory: 800M is close to a rough 40% of available total memory. You can also limit it to a specific value e.g. query_classifier_cache_size=200M.

Comment by Tim Westervoorde [ 2019-04-16 ]

Disabling the query_classifier_cache seems to fix my memory issue. It has been running stable for some hours now. So it seems the limit isn't honored somehow.

Comment by markus makela [ 2019-04-16 ]

Based on the valgrind output and your description of the system, it does appear to respect the default value of 40% of total system memory. Did you try reducing it by adding query_classifier_cache_size=200M?

Comment by Tim Westervoorde [ 2019-04-16 ]

The valgrind output was after 5 minutes, not after a long time (I cannot run maxscale under valgrind for a long time, I get duplicate fd errors if I do that).

Is there anything that I can pull out of the coredump to see what the memory usage for the quary_classifier_cache is?

Kind regards,
Tim

Comment by markus makela [ 2019-04-16 ]

You can check whether the limitation mechanism works by lowering the cache size to a very low value e.g. query_classifier_cache_size=50M. If you can try this without valgrind, we'd be able to test it more.

Comment by Tim Westervoorde [ 2019-04-17 ]

Will do, switched to another instance with 50M max and will monitor for the next few hours. The instance with cache_size=0 has been running with stable memory usage for ~1day

Comment by markus makela [ 2019-04-17 ]

turns out this is not really a bug but the query classifier cache size should definitely be reduced as 40% of total system memory is a bit overkill in most cases. For starters, it could be reduced to 20% to prevent excessive memory use and all super large packets could be discarded from addition to the cache.

Comment by markus makela [ 2019-04-17 ]

Upon further review, a cache size of 15% as the default value seems reasonable for the majority of use-cases.

Comment by Tim Westervoorde [ 2019-04-17 ]

There does seem to be something wrong. The instance with query_classifier_cache_size=0M was stable at ~250MB memory used.
The instance with query_classifier_cache_size=50M is slowly growing over time, after six hours it is now at ~395MB. That doesn't seem to be rigth.

Comment by Tim Westervoorde [ 2019-04-18 ]

Currently at ~450MB. The lower query_classifier_cache_size=50M seems to slowdown the memory increment a lot but still it's growing. I'll see if it flats out today. That would point towards that the cap isn't global among the workers but per worker.

Comment by markus makela [ 2019-04-18 ]

To verify that, you could reduce and/or increase the number of threads. If the memory use remains the same even with a different thread count, the extra memory use would be caused by some other cache filling up.

Comment by Tim Westervoorde [ 2019-04-18 ]

Will do that, I doubled the number of threads to see if it increases even more. It is somehow related to query_classifier_cache_size because setting query_classifier_cache_size=0M keeps the memory stable.

Comment by Tim Westervoorde [ 2019-04-19 ]

At first it seemed to remain stable at ~450MB but suddenly it took another step upwards to ~550MB

Comment by Tim Westervoorde [ 2019-04-19 ]

It dropped back to 450MB. So it seems it remains stable. However query_classifier_cache_size=0M and query_classifier_cache_size=50M differs roughly 200M memory

Generated at Thu Feb 08 04:14:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.