[MDEV-30096] Memory leak 10.3.37 and 10.6.11 Created: 2022-11-25  Updated: 2023-11-28  Resolved: 2022-12-28

Status: Closed
Project: MariaDB Server
Component/s: Prepared Statements
Affects Version/s: 10.3.37, 10.6.11
Fix Version/s: 10.3.38, 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3

Type: Bug Priority: Major
Reporter: Erik Friesen Assignee: Sergei Golubchik
Resolution: Duplicate Votes: 0
Labels: regression
Environment:

VPS server on Centos 7 x64 with 8bg, cpanel/whm


Attachments: Text File SELECT_BY_EVENT_NAME11.27.2022-16.19.txt     Text File SELECT_BY_EVENT_NAME11.28.2022-8.45.txt     Text File SHOW_ENGINE_INNODB_STATUS.txt     Text File SHOW_GLOBAL_STATUS.txt     Text File SHOW_GLOBAL_STATUS11.28.2022-8.45.txt     File my.cnf     File webaxess.sql     File webaxess_sessions.sql    
Issue Links:
Duplicate
duplicates MDEV-29988 Major performance regression with 10.... Closed
Relates
relates to MDEV-30848 Memory leak in MariaDB 10.6 Closed

 Description   

Since the server auto updated from 10.3.36 to 10.3.37, as well as after updating from 10.3.37 to 10.6.11, there is a memory leak.

on 10.3.37, top showed 1.3g use

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND    
14406 mysql     20   0 3179620   1.3g   6132 S   1.2 17.6 152:58.38 mysqld

Which didn't match the 162MB shown by SHOW ENGINE INNODB STATUS;

----------------------
BUFFER POOL AND MEMORY
----------------------
Total large memory allocated 170590208
Dictionary memory allocated 340032
Buffer pool size   8192
Free buffers       1409
Database pages     6100
Old database pages 2255
Modified db pages  0
Percent of dirty pages(LRU & free pages): 0.000
Max dirty pages percent: 75.000
Pending reads 0
Pending writes: LRU 0, flush list 0, single page 0
Pages made young 6, not young 0
0.00 youngs/s, 0.00 non-youngs/s
Pages read 5379, created 721, written 4228656
0.00 reads/s, 0.00 creates/s, 0.00 writes/s
No buffer pool page gets since the last printout
Pages read ahead 0.00/s, evicted without access 0.00/s, Random read ahead 0.00/s
LRU len: 6100, unzip_LRU len: 0
I/O sum[0]:cur[0], unzip sum[0]:cur[0]

Moving to 10.6.11:

Some px-aux | grep mariadb log points

USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
mysql    29577  1.5  7.2 2715668 577652 ?      Ssl  Nov23  21:56 /usr/sbin/mariadbd
mysql    29577  1.5  7.2 2699668 580328 ?      Ssl  Nov23  22:09 /usr/sbin/mariadbd
mysql    29577  1.5  7.2 2708100 584600 ?      Ssl  Nov23  22:27 /usr/sbin/mariadbd
mysql    29577  1.5  7.3 2708232 585152 ?      Ssl  Nov23  22:34 /usr/sbin/mariadbd
mysql    29577  1.5  7.3 2708232 585416 ?      Ssl  Nov23  22:36 /usr/sbin/mariadbd
mysql    29577  2.1 11.8 2720628 945832 ?      Ssl  Nov23  59:49 /usr/sbin/mariadbd
mysql    29577  2.8 17.9 3318176 1437704 ?     Ssl  Nov23 136:12 /usr/sbin/mariadbd

I have done a couple of consecutive memory dumps using and while I may not understand what to look for, it seems like the growing files show data relating to these tables and queries.

SELECT * FROM aercon0_aerweb.webaxess_sessions
             WHERE id = ?

SELECT webfolder, webpassword
             FROM aercon0_aerweb.webaxess
             WHERE webusername = ?

UPDATE aercon0_aerweb.webaxess
                        SET aerware = ?,
                        last_comms = UNIX_TIMESTAMP()
                    WHERE webfolder = ?

Memory dump script from here https://serverfault.com/questions/173999/dump-a-linux-processs-memory-to-file

#!/bin/bash
 
grep rw-p /proc/$1/maps \
| sed -n 's/^\([0-9a-f]*\)-\([0-9a-f]*\) .*$/\1 \2/p' \
| while read start stop; do \
    gdb --batch --pid $1 -ex \
        "dump memory $1-$start-$stop.dump 0x$start 0x$stop"; \
done

I'll try to migrate everything, but here is my original issue - https://dba.stackexchange.com/questions/320126/memory-leak-mariadb-10-3-37



 Comments   
Comment by Daniel Black [ 2022-11-27 ]

Thanks Erik for reporting the bug.

The time"59:49" on the 10.6 measurement is ambiguous. If you've got further 10.6 measurements to confirm the trend that would be useful.

10.5+ has performance schema, which I'm not terribly well versed, in however setting the following start to provide instrumentation:

performance_schema=on
performance-schema-instrument='memory/%=ON'

After that the following tables/views will be populated:

MariaDB [sys]> show tables from sys like 'mem%';
+-----------------------------------+
| Tables_in_sys (mem%)              |
+-----------------------------------+
| memory_by_host_by_current_bytes   |
| memory_by_thread_by_current_bytes |
| memory_by_user_by_current_bytes   |
| memory_global_by_current_bytes    |
| memory_global_total               |
+-----------------------------------+
5 rows in set (0.001 sec)
 
MariaDB [sys]> show tables from performance_schema like 'mem%';
+-----------------------------------------+
| Tables_in_performance_schema (mem%)     |
+-----------------------------------------+
| memory_summary_by_account_by_event_name |
| memory_summary_by_host_by_event_name    |
| memory_summary_by_thread_by_event_name  |
| memory_summary_by_user_by_event_name    |
| memory_summary_global_by_event_name     |
+-----------------------------------------+
5 rows in set (0.001 sec)

SELECT * FROM performance_schema.memory_summary_global_by_event_name might start to show a leak by area.

Much appreciate your assistance in narrowing down this problem.

Comment by Erik Friesen [ 2022-11-27 ]

Yes, the time is pretty nebulous. Its growing by about 320MB daily it appears. My initial guess is that this may well be related to MDEV-29404

Those prepared statements above are by far the bulk of what happens on this server.

I'll work on getting more data.

Comment by Erik Friesen [ 2022-11-28 ]

I've added some files. I restarted the server at 16:19 and took the second snapshots at 8:45.

ps -aux output @ 11.28.2022-8:41:AM is
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
mysql 4822 1.3 6.8 2785864 548716 ? Ssl Nov27 13:19 /usr/sbin/mariadbd

Comment by Sergei Golubchik [ 2022-11-28 ]

It cannot be MDEV-29404 because a comment there says "Verified that this is not a new regression" and yours is not present in 10.3.36.

Can you try to close all connections for a moment and see if the memory would stop growing after they reconnect? That would mean the memory is not truly lost, but is freed on disconnect.

Does your processlist have unusually many connections in the "Starting cleanup" state?

It might be MDEV-29988 (see also duplicates, listed there).

As a workaround you can try to wrap your parameters (question marks) in CONVERT and see it that helps, like in

SELECT webfolder, webpassword
             FROM aercon0_aerweb.webaxess
             WHERE webusername = CONVERT(? USING _charset_of_webusername)

Comment by Erik Friesen [ 2022-11-28 ]

Can you try to close all connections for a moment and see if the memory would stop growing after they reconnect? That would mean the memory is not truly lost, but is freed on disconnect.

I am unable to stop all connections easily, but the one I did reconnect did appear to release about 300MB of memory.

Does your processlist have unusually many connections in the "Starting cleanup" state?

I have a number of items in sleep, but none in "Starting cleanup"

This is a production server in active use, I am somewhat limited in my testing options. Also note that I am using 10.6.11 for these present tests.

Comment by Sergei Golubchik [ 2022-11-28 ]

Yes, MDEV-29988 was in 10.6.11 too. But let's first check the preconditions:

  • you use prepared statements
  • you prepare once and then execute many thousands of times before deallocating the statement
  • your statements contains a string comparison with strings in different (but compatible) character sets (e.g. utf8 and latin1).

if all that matches your use case you could be affected by MDEV-29988. If you do, then

  • every execution of such a prepared statements will allocate more memory
  • all memory will be freed when a statements is deallocated (explicitly or when the client disconnects)
  • time to deallocate a prepared statement is proportional to the number of times it was executed

and workarounds could be

  • deallocate prepared statements more often
  • rewrite statements to avoid comparing strings in different character sets

or wait for the next release that will have it fixed

Comment by Miroslav Lachman [ 2022-12-28 ]

Are there any update and fix for this memory issue in 10.6.11?
We upgraded our production database servers from 10.3 to 10.6.10 and now we got crashes of mariadbd every day because it eats all memory (OOM).
Then we upgraded from 10.6.10 to 10.6.11 but it behaves the same wrong way. MariaDB is configured to use about 9GB of total memory but when we run project import (many queries from PHP application with PDO and prepared statements from Nette library) mariadbd uses over 40GB of memory + swap. Then system kills it because there is no more memory to use.
The same application runs fine on an older instances with 10.3 on VMs with about 8 - 10GB of total RAM.

Comment by Sergei Golubchik [ 2022-12-28 ]

Lachman, yes, it's a duplicate of MDEV-29988, and MDEV-29988 is marked a blocker so it will be in the next release

Comment by MAtteo Valsasna [ 2023-01-12 ]

memory leak on 10.5.18 (debian 11) here

it seems to match quite well the description by Sergei Golubchik above

also, information_schema.PROCESSLIST.memory_used points a finger on emqx 4.1.3, which is indeed executing always the same prepared statement with string comparisons each time an MQTT PUBLISH is received, which is thousands of times.

what doesn't match is that if I kill the emqx mysql connections no memory is released to the OS (instead, /proc/pid/VmSize keeps growing at the same rate).

is it supposed to be released to the OS or to a mariadb memory pool?

also, in https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1027337 it seems that the bug was introduced between 10.5.15 and 10.5.18.

can you confirm?

we would like to downgrade to the latest non-affected debian package while we wait for 10.5.19

thanks

MAtteo

Comment by Sergei Golubchik [ 2023-01-12 ]

Introduced in 10.5.18 (there's a "Caused by" link in MDEV-29988).

Other possible workarounds, besides downgrading, could be

  • do not keep the client connection constantly open, reconnect periodically (not very often)
  • avoid the need for an implicit charset conversion — change collation_connection to match table's collation or add an explicit CAST

10.5.19 is planned to be released (see https://jira.mariadb.org) in two weeks

Comment by MAtteo Valsasna [ 2023-01-12 ]

I tried forcing disconnections using mysql kill (the application reconnects gracefully), but apparently did not free the memory (at least from the OS perspective), and did not stop the mariadb memory occupation growth (so it seems that also mariadb didn't recover the memory).

our workload includes at least 20/s queries to stored procedures with string comparisons and different charsets, which triggered MDEV-29988

if we downgrade to 10.1.15, could we be exposed to the bug whose fix introduced MDEV-29988 (https://jira.mariadb.org/browse/MDEV-19605)?

thanks

MAtteo

Comment by Mariusz Gronczewski [ 2023-11-28 ]

I'm using 10.11.4 (Debian Bookworm package) and PowerDNS is causing this leak pretty repeatably

echo 'kill user dns' | mysql helps temporarily

I don't think it should be closed, it is neither related to any tickets as "duplicate" nor fixed

Comment by Sergei Golubchik [ 2023-11-28 ]

xani, if you're using 10.11.4, it cannot be this leak, it must be something else.

Does MariaDB internal memory monitoring show the leak? MEMORY_USED status variable? MEMORY_USED column in the INFORMATION_SCHEMA.PROCESSLIST, performance schema?

Generated at Thu Feb 08 10:13:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.