[MXS-2433] infinite memory usage, possible leak Created: 2019-04-13 Updated: 2019-04-19 Resolved: 2019-04-17 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | Core |
| Affects Version/s: | 2.3.5 |
| Fix Version/s: | 2.3.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Tim Westervoorde | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Sprint: | MXS-SPRINT-80 | ||||||||
| Description |
|
Under 2.3.5 the memory usage keeps increasing over time, until the machine finally goes OOM. The config is as follows
as you can see I already disabled sescmd history to rule this out. The core dump is quite large and I cannot upload it publicly as I'm not sure it doesn't contain any sensitive data but if there is anything I can do with it to help out please let me know. |
| Comments |
| Comment by markus makela [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Can you add the configurations for any monitors or filters you use? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The rest of the config as follows
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I just went trough the core dump and is seems almost all memory is used by large prepared statements. So it looks like these aren't freed somewhere somehow. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
If you add query_classifier_cache_size=0 under the [maxscale] section, does it have any effect? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
within a few minutes a large cron with most of these prepared statements will run and Ive directed them to a maxscale under valgrind and will see if that triggers something. If it does/doesn't I'll try your suggestion | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
before and after my cron
I'll try them again at 15:10 with query_classifier_cache_size=0 | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like it's not leaking memory directly but it does seem to retain a lot more than is needed. If it is the query classifier cache, that could explain the increase but it will limit itself to 40% of available memory by default. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
that seemed to fix the memory usage. I'll see if I can roll this into production | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hmm it didn't stop at 40% but completely OOM the server. I'll see what happens on the primary cluster if I run 2.3.5 with query_classifier_cache_size=0 | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The MaxScale log should tell on startup how much memory it uses for the query classifier cache. If you can check this, it would help. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
So that could account up to ~800M of extra memory on the 2G memory of the machine. I'll dig deeper into the coredump to see if there is more interesting data later in the file. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
That looks like the correct amount of memory: 800M is close to a rough 40% of available total memory. You can also limit it to a specific value e.g. query_classifier_cache_size=200M. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Disabling the query_classifier_cache seems to fix my memory issue. It has been running stable for some hours now. So it seems the limit isn't honored somehow. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Based on the valgrind output and your description of the system, it does appear to respect the default value of 40% of total system memory. Did you try reducing it by adding query_classifier_cache_size=200M? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The valgrind output was after 5 minutes, not after a long time (I cannot run maxscale under valgrind for a long time, I get duplicate fd errors if I do that). Is there anything that I can pull out of the coredump to see what the memory usage for the quary_classifier_cache is? Kind regards, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
You can check whether the limitation mechanism works by lowering the cache size to a very low value e.g. query_classifier_cache_size=50M. If you can try this without valgrind, we'd be able to test it more. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Will do, switched to another instance with 50M max and will monitor for the next few hours. The instance with cache_size=0 has been running with stable memory usage for ~1day | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
turns out this is not really a bug but the query classifier cache size should definitely be reduced as 40% of total system memory is a bit overkill in most cases. For starters, it could be reduced to 20% to prevent excessive memory use and all super large packets could be discarded from addition to the cache. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Upon further review, a cache size of 15% as the default value seems reasonable for the majority of use-cases. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
There does seem to be something wrong. The instance with query_classifier_cache_size=0M was stable at ~250MB memory used. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Currently at ~450MB. The lower query_classifier_cache_size=50M seems to slowdown the memory increment a lot but still it's growing. I'll see if it flats out today. That would point towards that the cap isn't global among the workers but per worker. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by markus makela [ 2019-04-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
To verify that, you could reduce and/or increase the number of threads. If the memory use remains the same even with a different thread count, the extra memory use would be caused by some other cache filling up. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Will do that, I doubled the number of threads to see if it increases even more. It is somehow related to query_classifier_cache_size because setting query_classifier_cache_size=0M keeps the memory stable. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
At first it seemed to remain stable at ~450MB but suddenly it took another step upwards to ~550MB | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tim Westervoorde [ 2019-04-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
It dropped back to 450MB. So it seems it remains stable. However query_classifier_cache_size=0M and query_classifier_cache_size=50M differs roughly 200M memory |