[MDEV-30889] Memory leak issues with MariaDB 10.6.12 and OOM Crashes Created: 2023-03-20  Updated: 2024-02-05

Status: Stalled
Project: MariaDB Server
Component/s: Stored routines
Affects Version/s: 10.6.12
Fix Version/s: 10.6

Type: Bug Priority: Critical
Reporter: Pravin Malali Assignee: Oleksandr Byelkin
Resolution: Unresolved Votes: 10
Labels: None
Environment:

OS - Debian Bullseye
MariaDB - 10.6.12
Virtual Machine
Total Memory - 96 GB
CPUs -
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 45 bits physical, 48 bits virtual
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 12
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz
Stepping: 7
CPU MHz: 2593.905
BogoMIPS: 5187.81
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 576 KiB
L1i cache: 384 KiB
L2 cache: 15 MiB
L3 cache: 48 MiB
NUMA node0 CPU(s): 0-11


Attachments: PNG File Screenshot 2023-09-29 at 10.43.10.png     PNG File Selection_999(259).png     PNG File Selection_999(260).png     PNG File Selection_999(261).png     PNG File Selection_999(262).png     PNG File Selection_999(263)-1.png     PNG File Selection_999(263).png     PNG File Selection_999(307).png     PNG File image-2023-06-19-08-55-02-967.png     PNG File memory_used.png     File query.sql     PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File switch_to_tcmalloc.png    
Issue Links:
Relates
relates to MDEV-14602 Reduce malloc()/free() usage in InnoDB Confirmed
relates to MDEV-28820 MyISAM wrong server status flags Closed
relates to MDEV-14959 Control over memory allocated for SP/PS Stalled
relates to MDEV-29988 Major performance regression with 10.... Closed
relates to MDEV-31127 Possible memory leak on running inser... Open

 Description   

The MariaDB consumes all memory allocated on the server and eventually crashes due to OOM

Mar 17 18:35:35 <server_name> kernel: [2871555.121737] Swap cache stats: add 28523916, delete 28524014, find 47249895/50419652
Mar 17 18:35:35 <server_name> kernel: [2871555.121737] Free swap  = 0kB
Mar 17 18:35:35 <server_name> kernel: [2871555.121738] Total swap = 999420kB
Mar 17 18:35:35 <server_name> kernel: [2871555.121738] 18874238 pages RAM
Mar 17 18:35:35 <server_name> kernel: [2871555.121738] 0 pages HighMem/MovableOnly
Mar 17 18:35:35 <server_name> kernel: [2871555.121739] 319662 pages reserved
Mar 17 18:35:35 <server_name> kernel: [2871555.121739] 0 pages hwpoisoned
Mar 17 18:35:35 <server_name> kernel: [2871555.121740] Tasks state (memory values in pages):
Mar 17 18:35:35 <server_name> kernel: [2871555.121740] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
Mar 17 18:35:35 <server_name> kernel: [2871555.121747] [    400]     0   400    41263      206   208896      604          -250 systemd-journal
Mar 17 18:35:35 <server_name> kernel: [2871555.121749] [    425]     0   425     6003       38    69632      383         -1000 systemd-udevd
Mar 17 18:35:35 <server_name> kernel: [2871555.121751] [    585]     0   585    12424        0    81920      403             0 VGAuthService
Mar 17 18:35:35 <server_name> kernel: [2871555.121752] [    586]     0   586    43443      242    94208      181             0 vmtoolsd
Mar 17 18:35:35 <server_name> kernel: [2871555.121754] [    603]   105   603     7654      181    73728       98          -900 dbus-daemon
Mar 17 18:35:35 <server_name> kernel: [2871555.121755] [    609]   112   609   549788     4955   372736     1872             0 prometheus-node
Mar 17 18:35:35 <server_name> kernel: [2871555.121756] [    611]     0   611    73164      530   102400      574             0 rsyslogd
Mar 17 18:35:35 <server_name> kernel: [2871555.121758] [    898]     0   898      807       37    32768       17             0 caagentd
Mar 17 18:35:35 <server_name> kernel: [2871555.121759] [   1720]     0  1720     6279      149    77824      153             0 systemd-logind
Mar 17 18:35:35 <server_name> kernel: [2871555.121760] [   3043]   110  3043     4705       70    77824      182             0 lldpd
Mar 17 18:35:35 <server_name> kernel: [2871555.121762] [   3050]     0  3050    27667       12   110592     2130             0 unattended-upgr
Mar 17 18:35:35 <server_name> kernel: [2871555.121763] [   3057]   108  3057    21943       51    69632      167             0 ntpd
Mar 17 18:35:35 <server_name> kernel: [2871555.121764] [   3058]     0  3058     3347       45    65536      201         -1000 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121765] [   3061]   110  3061     4728      104    77824      214             0 lldpd
Mar 17 18:35:35 <server_name> kernel: [2871555.121767] [   3294]     0  3294     4472       25    61440       86             0 cron
Mar 17 18:35:35 <server_name> kernel: [2871555.121768] [   3298]   109  3298     2111       31    53248      163          -500 nrpe
Mar 17 18:35:35 <server_name> kernel: [2871555.121769] [   3304]     0  3304     1416       12    49152       75             0 atd
Mar 17 18:35:35 <server_name> kernel: [2871555.121771] [   3318]     0  3318     1987        0    45056       76             0 agetty
Mar 17 18:35:35 <server_name> kernel: [2871555.121772] [ 828871]     0 828871    28669        9   176128     1035             0 sssd
Mar 17 18:35:35 <server_name> kernel: [2871555.121773] [ 828872]     0 828872    38730     2757   200704     1102             0 sssd_be
Mar 17 18:35:35 <server_name> kernel: [2871555.121774] [ 828873]     0 828873    33879      726   249856      787             0 sssd_nss
Mar 17 18:35:35 <server_name> kernel: [2871555.121776] [ 828874]     0 828874    27486      994   200704      843             0 sssd_pam
Mar 17 18:35:35 <server_name> kernel: [2871555.121777] [ 828875]     0 828875    25646      168   167936      824             0 sssd_pac
Mar 17 18:35:35 <server_name> kernel: [2871555.121778] [ 828876]     0 828876    27065      669   188416      845             0 sssd_sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121779] [3867706]     0 3867706   781525    11410  1392640     8840             0 pmm-agent
Mar 17 18:35:35 <server_name> kernel: [2871555.121781] [3867999]     0 3867999   182337     4007   180224      288             0 node_exporter
Mar 17 18:35:35 <server_name> kernel: [2871555.121782] [4185979]     0 4185979   181211     3798   176128      148             0 mysqld_exporter
Mar 17 18:35:35 <server_name> kernel: [2871555.121784] [4186027]     0 4186027   215617    39736   520192     2276             0 vmagent
Mar 17 18:35:35 <server_name> kernel: [2871555.121785] [ 786172]     0 786172    12905       72    90112      284             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121786] [ 786225] 85018827 786225     3855      277    69632       87             0 systemd
Mar 17 18:35:35 <server_name> kernel: [2871555.121787] [ 786226] 85018827 786226    48988      316   106496      609             0 (sd-pam)
Mar 17 18:35:35 <server_name> kernel: [2871555.121788] [ 786245] 85018827 786245    12905       99    86016      261             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121790] [ 786246] 85018827 786246     4840        1    69632      438             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121791] [ 786356] 85018827 786356     9670        2    77824      216             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121792] [ 786357]     0 786357     7678        1    69632      171             0 su
Mar 17 18:35:35 <server_name> kernel: [2871555.121793] [ 786358]     0 786358     2335       93    53248      118             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121795] [1821490]     0 1821490    12906      154    90112      203             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121796] [1821502] 85018827 1821502    12906      171    86016      189             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121797] [1821503] 85018827 1821503     4840      238    65536      202             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121799] [1821609] 85018827 1821609     5423      115    69632      102             0 ssh
Mar 17 18:35:35 <server_name> kernel: [2871555.121800] [2488524]     0 2488524    12915      209    86016      159             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121802] [2488532] 85018904 2488532     3856      367    77824        0             0 systemd
Mar 17 18:35:35 <server_name> kernel: [2871555.121803] [2488533] 85018904 2488533    48988      576   106496      349             0 (sd-pam)
Mar 17 18:35:35 <server_name> kernel: [2871555.121804] [2488551] 85018904 2488551    12955      288    86016      140             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121805] [2488552] 85018904 2488552     4840      418    65536       24             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121806] [2488812] 85018904 2488812     9681      217    65536       13             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121807] [2488813]     0 2488813     7678       40    69632      132             0 su
Mar 17 18:35:35 <server_name> kernel: [2871555.121808] [2488814]     0 2488814     2310      164    57344       26             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121810] [2508581]     0 2508581    10619      168    81920      171             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121811] [2508598] 85018916 2508598     3856      347    69632        4             0 systemd
Mar 17 18:35:35 <server_name> kernel: [2871555.121812] [2508599] 85018916 2508599    49038      643   106496      285             0 (sd-pam)
Mar 17 18:35:35 <server_name> kernel: [2871555.121813] [2508618] 85018916 2508618    10619      210    81920      131             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121815] [2508619] 85018916 2508619     4840      431    61440        7             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121817] [2508729] 85018916 2508729     9673      199    77824       22             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121818] [2508730]     0 2508730     2568      419    57344       22             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121815] [2508619] 85018916 2508619     4840      431    61440        7             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121817] [2508729] 85018916 2508729     9673      199    77824       22             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121818] [2508730]     0 2508730     2568      419    57344       22             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121819] [2508834]     0 2508834     9657      205    73728        0             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121820] [2508835]     0 2508835     2596      446    61440        0             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121821] [2522285]     0 2522285    12907      293    94208       66             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121822] [2522296] 85018827 2522296    12907      299    81920       63             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121823] [2522297] 85018827 2522297     4840      435    65536        3             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121825] [2522425] 85018827 2522425     9670      219    81920        0             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121826] [2522426]     0 2522426     7678      173    73728        0             0 su
Mar 17 18:35:35 <server_name> kernel: [2871555.121827] [2522427]     0 2522427     2335      195    65536        0             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121828] [2525571]     0 2525571     1350       17    49152        0             0 tail
Mar 17 18:35:35 <server_name> kernel: [2871555.121829] [2529115]     0 2529115    12906      358    94208        0             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121830] [2529159]   113 2529159 19180640 18261859 149372928   221370             0 mariadbd
Mar 17 18:35:35 <server_name> kernel: [2871555.121832] [2529169] 85018827 2529169    12906      361    86016        1             0 sshd
Mar 17 18:35:35 <server_name> kernel: [2871555.121833] [2529170] 85018827 2529170     4840      370    61440       70             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121834] [2529349] 85018827 2529349     9670      215    73728        3             0 sudo
Mar 17 18:35:35 <server_name> kernel: [2871555.121835] [2529350]     0 2529350     7678      172    73728        0             0 su
Mar 17 18:35:35 <server_name> kernel: [2871555.121836] [2529351]     0 2529351     2310      191    53248        0             0 bash
Mar 17 18:35:35 <server_name> kernel: [2871555.121838] [2549755]     0 2549755     5428      325    69632        0             0 top
Mar 17 18:35:35 <server_name> kernel: [2871555.121840] [2555764]   109 2555764     2144      102    53248      107          -500 nrpe
Mar 17 18:35:35 <server_name> kernel: [2871555.121841] [2555765]   109 2555765     2144      109    49152      100          -500 nrpe
Mar 17 18:35:35 <server_name> kernel: [2871555.121843] [2555766]   109 2555766      955       60    45056        0          -500 check_ntp_in_mi
Mar 17 18:35:35 <server_name> kernel: [2871555.121844] [2555770]   109 2555770      955       63    45056        0          -500 check_ntp_in_mi
Mar 17 18:35:35 <server_name> kernel: [2871555.121845] [2555771]   109 2555771     1924       86    45056        0          -500 ntpq
Mar 17 18:35:35 <server_name> kernel: [2871555.121846] [2555772]   109 2555772      811       32    45056        0          -500 grep
Mar 17 18:35:35 <server_name> kernel: [2871555.121847] [2555773]   109 2555773     2057       80    61440        0          -500 awk
Mar 17 18:35:35 <server_name> kernel: [2871555.121848] [2555775]     0 2555775     4669      104    61440       44             0 cron
Mar 17 18:35:35 <server_name> kernel: [2871555.121849] [2555776]   109 2555776     2111       53    53248      141          -500 nrpe
Mar 17 18:35:35 <server_name> kernel: [2871555.121850] [2555777]   109 2555777     2111       42    49152      152          -500 nrpe
Mar 17 18:35:35 <server_name> kernel: [2871555.121851] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,global_oom,task_memcg=/system.slice/mariadb.service,task=mariadbd,pid=2529159,uid=113
Mar 17 19:25:37 <server_name> rsyslogd: action 'action-0-builtin:omfwd' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.2102.0 try https://www.rsyslog.com/e/2007 ]
Mar 17 19:25:37 <server_name> rsyslogd: action 'action-0-builtin:omfwd' resumed (module 'builtin:omfwd') [v8.2102.0 try https://www.rsyslog.com/e/2359 ]

*Memory Alloccation*
 
  @@key_buffer_size / 1048576: 32.0000
       @@query_cache_size / 1048576: 50.0000
@@innodb_buffer_pool_size / 1048576: 78848.0000
 @@innodb_log_buffer_size / 1048576: 16.0000
                  @@max_connections: 500
       @@read_buffer_size / 1048576: 0.1250
   @@read_rnd_buffer_size / 1048576: 0.2500
       @@sort_buffer_size / 1048576: 2.0000
       @@join_buffer_size / 1048576: 0.2500
      @@binlog_cache_size / 1048576: 0.0313
           @@thread_stack / 1048576: 0.2852
         @@tmp_table_size / 1048576: 32.0000
                      MAX_MEMORY_GB: 94.1569



 Comments   
Comment by Pravin Malali [ 2023-03-22 ]

Please see attached pattern of memory usage. The timings marked with downward arrows are when the DB service was restarted or crashed due to OOM.

Also note how mariadb does not release memory back to the OS even after crash or restarts.

Please see below .cnf configs we have:

### MANAGED BY PUPPET ###
 
[client]
port                                  = 3306
socket                                = /var/run/mysqld/mysqld.sock
 
[mysqld]
 
# GENERAL #
user                                  = mysql
default_storage_engine                = InnoDB
character_set_server                  = utf8mb4
collation_server                      = utf8mb4_unicode_ci
tmpdir                                = /opt/mysql/tmp
socket                                = /var/run/mysqld/mysqld.sock
bind_address                          = *
wait_timeout                          = 1800
interactive_timeout                   = 1800
 
# MyISAM #
key_buffer_size                       = 32M
myisam_recover_options                = BACKUP,QUICK
 
# SAFETY #
max_allowed_packet                    = 256M
max_connect_errors                    = 1000000
sysdate_is_now                        = 1
 
# DATA STORAGE #
datadir                               = /opt/mysql/data/
 
# BINARY LOGGING #
binlog_checksum                       = NONE
max_binlog_size                       = 1G
log_bin                               = /opt/mysql/log/<server name>-bin
log_bin_index                         = /opt/mysql/log/<server name>-bin-index
expire_logs_days                      = 7
sync_binlog                           = 1
log_bin_trust_function_creators       = 1
binlog_format                         = ROW
server_id                             = 42193
log_slave_updates                     = 1
 
# SLAVE #
slave_parallel_threads                = 0
slave_max_allowed_packet              = 1073741824
relay_log                             = /opt/mysql/log/mysqld-relay-bin
relay_log_index                       = /opt/mysql/log/mysqld-relay-bin.index
read_only                             = 0
 
# CACHES AND LIMITS #
tmp_table_size                        = 32M
max_heap_table_size                   = 32M
max_connections                       = 250
open_files_limit                      = 65535
table_definition_cache                = 1024
table_open_cache                      = 2048
 
# INNODB #
innodb_flush_method                   = O_DIRECT
innodb_log_file_size                  = 1G
innodb_log_buffer_size                = 16M
innodb_flush_log_at_trx_commit        = 1
innodb_file_per_table                 = 1
innodb_buffer_pool_size               = 78784M
innodb_flush_neighbors                = 0
innodb_buffer_pool_dump_at_shutdown   = 0
innodb_buffer_pool_load_at_startup    = 0
innodb_force_primary_key              = 1
innodb_autoinc_lock_mode              = 2
innodb_monitor_enable                 = all
 
# LOGGING #
log_output                            = FILE
log_error                             = /opt/mysql/log/<server name>.err
slow_query_log                        = 1
slow_query_log_file                   = /opt/mysql/log/mysql-slow.log
long_query_time                       = 10
general_log                           = 0
general_log_file                      = /opt/mysql/log/mysql-general.log
 
# PERFORMANCE_SCHEMA #
performance_schema                    = 1
performance_schema_consumer_statements_digest = 1
performance_schema_consumer_events_statements_history_long = 1
 
!includedir /etc/mysql/conf.d/
 
 
# MariaDB-specific config file.
# Read by /etc/mysql/my.cnf
 
[client]
# Default is Latin1, if you need UTF-8 set this (also in server section)
#default-character-set = utf8 
 
[mysqld]
 
#table_open_cache=8192
#table_definition_cache=8192
innodb_force_primary_key=0
innodb_strict_mode=0
transaction_isolation='READ-COMMITTED'
 
max_allowed_packet                    = 1G
expire_logs_days                      = 1
read_only                             = OFF
 
#innodb_buffer_pool_size               = 45021M
slave_compressed_protocol             = ON
#query_cache_limit=52428800
#query_cache_size=52428800
max_connections                 = 500
innodb_buffer_pool_size               = 65536M
sql_mode='NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION'
optimizer_switch = 'split_materialized=off'
#skip-slave-start
 
#
# * Character sets
# 
# Default is Latin1, if you need UTF-8 set all this (also in client section)
#
#character-set-server  = utf8 
#collation-server      = utf8_general_ci 
#character_set_server   = utf8 
#collation_server       = utf8_general_ci 

Comment by Marko Mäkelä [ 2023-03-29 ]

pravin_malali, thank you for the report. I see that your innodb_buffer_pool_size is 72GiB despite 64GiB in the configuration, and the total maximum memory usage is 94GiB.

What kind of workload are you running? Any prepared statements or stored procedures? I wonder if this could be related to MDEV-29988, which was a regression introduced in 10.6.11 and supposedly fixed in 10.6.12.

Comment by Pravin Malali [ 2023-03-29 ]

Yes we run a lot of prepared statements and stored procedures. The load is very read heavy.

The anomaly with innodb buffer pool size was weird too. After a restart it is reflecting the current value.

We are running MariaDB 10.6.12 and we still have these issues.

Comment by Oleksandr Byelkin [ 2023-04-17 ]

JFYI, in PS/SP was found 4 memory leak, 3 fixed (in review), one in process of fixing...

Comment by Oleksandr Byelkin [ 2023-04-27 ]

I have fixed 3 memory leaks (will be in the next release). Feedback is needed if it will hep. Also work of finding memory leaks continue...

Comment by Vincent Jancso [ 2023-05-09 ]

I built 10.6.12 with the 3 fixes:

Unfortunately, still the same memory leaks. Looking forward to test with the 4th fix that is coming soon.
Let me know if you need any help in terms of testing.

Comment by Pravin Malali [ 2023-05-22 ]

The MariaDB 10.6.13 came out on 10th of May 2023. Does it address the OOM issues? Or are we still waiting on a fix for this issue in general?

Comment by Oleksandr Byelkin [ 2023-06-06 ]

As you can see (comment above) we have fixed some issues, but the bugreport does not point something concrete (no test case and so on) so we are just trying find memory leaks and fix them. If you have something we can repeat you are welcome to share.

Just now we are making tools to catch possible memory leaks in SP/PS and fix all what we have found (so far 3 of them).

Comment by Vincent Jancso [ 2023-06-06 ]

We can reproduce memory leaks with the attached query.

It's a typo3 instance which is doing a lot of SELECT queries with many nested subqueries.
Everytime I run this query, the memory usage jumps up (circa 10MB) and is not freed anymore. I can run the query in an infinite loop and see the usage constantly increasing.

No prepared statements nor stored procedures nor virtual columns are used.

Let me know If you need more informations.

query.sql

Comment by Oleksandr Byelkin [ 2023-06-06 ]

I will try to reconstruct some database to the query, will see it it will behave the same

Comment by Michael Widenius [ 2023-06-06 ]

Vincent, can you attach 'show create table' for the tables involved in the query ?
If possible, can you try add some 'dummy data' that would show the problem?
You can use the sequence table to insert a lot of data quickly:

insert into other_table (key1,key2,key3) select seq,seq,seq from seq_1_to_1000;

If we can repeat the issue, then we can fix it!

Please also add an EXPLAIN of the query so we can see what the optimizer is doing.

Comment by Vincent Jancso [ 2023-06-08 ]

As it's a customer's database, I have to clarify if we can share the table definitions and data. In case I get the permission, do you have a secure way to share that data? Otherwise I will have to anonymize part of the data.

Comment by Tobias Schneider [ 2023-06-13 ]

Good Morning,

we do also experiencing a memory leak and just starting to analyze it. Is there any data I can provide to help you?

Version: 10.6.13
Running in: kubernetes 1.24.14
Application: based on spring boot 2.6.15 with spring-data-jpa 2.6.10

Comment by Tobias Schneider [ 2023-06-13 ]

We upgraded to 10.6.14. I will provide more information about the memory consumption on Thursday.

Edit: We have still a memory leak with 10.6.14

Comment by Vincent Jancso [ 2023-06-28 ]

I just uploaded a full dump and a query through your FTP.

Best way to reproduce it is to run the query simultaneously (~10x) in an endless loop.
Then you should see the virtual memory jumping up pretty fast. RSS is also rising slowly but constantly.

Comment by Lena Startseva [ 2023-06-28 ]

Possible scripts for filling with testdata, but it is needed to clarify the description of the tables and valid range of values for some columns, allowable range and statistic of values for some data:

--source include/have_innodb.inc
--source include/have_sequence.inc
 
CREATE TABLE `tx_news_domain_model_news`  (
  `uid` int(11) NOT NULL,
  `deleted`  smallint(6) DEFAULT 0,
  `pid`  smallint(6) NOT NULL,
  `sys_language_uid` smallint(6) DEFAULT 0,
  `t3ver_state` smallint(6) DEFAULT 0,
  `t3ver_wsid` smallint(6) DEFAULT 0,
  `t3ver_oid` smallint(6) DEFAULT 0,
  `hidden` smallint(6) DEFAULT 0,
  `starttime` int DEFAULT NULL,
  `endtime` int DEFAULT NULL,
  `fe_group`  varchar(30) DEFAULT NULL,
  `datetime` int DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
 
CREATE TABLE `sys_category_record_mm`  (
  `uid_foreign` int(11) NOT NULL,
  `uid_local` int(11) NOT NULL,
  `fieldname` varchar(30) NOT NULL,
  `tablenames` varchar(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
 
insert into `tx_news_domain_model_news`(
  `uid`,
  `deleted`,
  `pid`,
  `sys_language_uid`,
  `t3ver_state`,
  `t3ver_wsid`,
  `t3ver_oid`,
  `hidden`,
  `starttime`,
  `endtime`,
  `fe_group`,
  `datetime`
) select
  FLOOR(1 + (RAND() * 300)),
  ELT(FLOOR(1 + (RAND() * 2)), 0, 1),
  FLOOR(1 + (RAND() * 10)),
  ELT(FLOOR(1 + (RAND() * 3)), 0, 1, -1),
  ELT(FLOOR(1 + (RAND() * 3)), 0, 1, -1),
  ELT(FLOOR(1 + (RAND() * 3)), 0, 1, -1),
  ELT(FLOOR(1 + (RAND() * 3)), 0, 1, -1),
  ELT(FLOOR(1 + (RAND() * 2)), 0, 1),
  UNIX_TIMESTAMP(CURRENT_TIMESTAMP - INTERVAL FLOOR(RAND() * 90 * 24 * 60 *60) SECOND),
  0,
  ELT(FLOOR(1 + (RAND() * 5)), '', '1', '0', '-1', NULL),
  UNIX_TIMESTAMP(CURRENT_TIMESTAMP - INTERVAL FLOOR(RAND() * 100 * 24 * 60 *60) SECOND)
from  seq_1_to_1000000;
 
 
insert into `sys_category_record_mm`(
  `uid_foreign`,
  `uid_local`,
  `fieldname`,
  `tablenames`
) select
    FLOOR(1 + (RAND() * 300)),
    FLOOR(1 + (RAND() * 10)),
    ELT(FLOOR(1 + (RAND() * 2)), 'categories','fieldname1'),
    ELT(FLOOR(1 + (RAND() * 2)), 'tx_news_domain_model_news','tx_news_domain_model_old')
from  seq_1_to_100000;
 
drop tables  `tx_news_domain_model_news`, `sys_category_record_mm`;

Comment by Oleksandr Byelkin [ 2023-06-29 ]

vrj95 Thank you a lot for the query and the dump, but I can not repeat memaory leack. I will try more but if you could you tell:

1) how the query executed (binary/text protocol)
2) settings of the server (especially sql_mode, optimizer_switch)
3) anything else which can help to repeat

it can speed up the process...

Comment by Oleksandr Byelkin [ 2023-06-30 ]

Actually outputs of:

select * from information_schema.processlist;

could be useful to detect which thread/comands lead to memory leack (and if the memory registered at all)

Comment by Michael Widenius [ 2023-06-30 ]

Could you do try to do a graph on
"show global status like "memory_used"
and see if this also increases over time?

If it increases then also doing
select id, info, memory_used, max_memory_used from information_schema.processlist;

And check with process increses max_memory_used the most.

If the above memory_used does not increase, it could be that the problem is memory fragmentation and changing to use jemalloc may help.

Comment by Michael Widenius [ 2023-09-06 ]

I was looking at the graphs attached to the issue.
It shows the memory growing over time, but it also goes down at certain times.
This means that this issue is likely to be one of the following:

  • Along running query that frees it's memory at end of the query.
  • A memory leak similar to MDEV-29988 where memory associated with a connection or prepared statement continues
    to grow until the connection ends.

Having the output from 'select * from information_schema.processlist' done every 1/2 hour over some period of time would be very helpful in figuring out what is wrong.
Same for "show global status like "memory_used"

Comment by Vincent Jancso [ 2023-09-29 ]

Sorry for the long delay. In collaboration with our customer, we managed to setup a replica of their production server so we can test around. I have disabled page caching of the typo3 instance and running a script that constantly makes some requests to the site.

Please have a look at the graph. I can now fully reproduce it.

RES size was about 800MB, Virtual Memory 5GB when I started yesterday.
Today, we're at RES 17.6GB and Virtual Memory 774GB.

I also made a graph from SHOW GLOBAL STATUS LIKE "memory_used" as requested. It does not increase over time.

I also logged the process list and uploaded it through your FTP (MDEV-30889-processlist.log)

Regarding the settings:

sql_mode = ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
in_predicate_conversion_threshold = 4294967295 (required because of the many subqueries)

Comment by Sergei Golubchik [ 2023-09-29 ]

vrj95, as global "memory_used" value isn't growing, it means the memory that the server can account for isn't leaking. It could still be a memory leak inside InnoDB or not a memory leak at all, but, for example, an extreme case of memory fragmentation that causes memory allocator to use more and more system memory. The latter is more likely, we've seen these cases before.

You're likely using ptmalloc2, default glibc memory allocator. Can you try jemalloc or tcmalloc? A different allocator will have a different memory allocation strategy and often it is enough to fix the fragmentation problem.

If you start MariaDB via mysqld_safe, run it with --malloc-lib=jemalloc or add this to your my.cnf.
If you start MariaDB with systemd, add Environment="LD_PRELOAD=..." to the mariadb.service file.

Comment by Vincent Jancso [ 2023-10-02 ]

Sorry, I just realized I never mentioned that I'm using FreeBSD. This may be important for you. I originally posted to MDEV-30848 which is FreeBSD related.

Under FreeBSD, it seems that jemalloc is the default memory allocator.
I will try to build devel/google-perftools which should provide tcmalloc and see if that makes any difference.
As far as I can see, ptmalloc2 is not really available on FreeBSD. There is a port but it is not maintained anymore.

Comment by Marko Mäkelä [ 2023-10-02 ]

For what it is worth, years ago in InnoDB there was a memory fragmentation issue that was fixed by reducing the number of heap memory allocations when tables are being added to the InnoDB data dictionary cache. I remember that the function dict_add_col_name() was added to address that. That change was bundled with a bigger change MySQL Bug #20877.

My understanding is that memory allocator libraries have it easier if there are fewer memory allocation requests and they are of the same size. Hopefully there are some debugging tools that would report internal memory fragmentation in the allocator, and allows to find the malloc() or realloc() that are most contributing to it.

Comment by Vincent Jancso [ 2023-10-04 ]

I was able to switch to the tcmalloc allocator which seems to fix the problem. Both virtual and resident memory remains stable.

Comment by Michael Widenius [ 2023-10-22 ]

When the user moved to use jemalloc, the OOM issues disappeared.
As jemalloc does not report any memory leaks, this is clearly a fragmentation issue.
What needs to be done is to analyze the queries that causes memory fragmentation and find out if there is anything that can be done on MariaDB side to minimize the fragmentation.

Some of the things we can try is to allocate most all memory blocks of the same size, making some special handling of big blobs etc.
However we need a test case that shows the fragmentation to be able to come up with some proven solutions.

Comment by Michal Dobroczynski [ 2023-11-13 ]

We are currently being hit by this issue big time.

  • `mysql Ver 15.1 Distrib 10.6.14-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2`
  • Ubuntu 20.04 LTS,
  • DO, 32GB RAM, 16vCPU (optimized type)
  • 2 large dbs (ca. 210GB).

The first graph shows mem usage as reported by the monitoring. With `innodb_buffer_pool_size = 16G` it should not soar so high.

Now I slashed it down to 12G (on a 32GB instance) - and after less than 24hrs it's on 76%.

The second graph - I took VSIZE and RSS from `ps` (every minute). Both basically only go up.

Can we safely jump on the jemalloc train or what are our options? Will upgrading to 10.6.15 change anything?

I am not sure if there is any correlation, but innodb does preload some stuff from disk to speed up "warming up" if I remember correctly? Disabling that feature - or discarding the current "image"?

We take regular backups (mysqldump, single transaction) - obviously during that time it tanks mem, but only after a fresh restart - and that is also expected (tanks below the innodb buffer pool).

When I ask mariadb about the configured size, to be 100% sure no monkey business is involved:

If there's anything I can do to help - let me know. This is our prod machine, so doing risky stuff is not really on the table, but I can spin a copy where I can run experiments.

Thanks in advance for any advice on the subject.

Comment by Sergei Golubchik [ 2023-11-13 ]

there are many other allocations besides InnoDB buffer pool. check also

show global status like 'memory_used%';

and

select id, memory_used, max_memory_used from information_schema.processlist;

Comment by Michal Dobroczynski [ 2023-11-13 ]

Thanks for your insight. Yes, I am aware of that, but unless there are thousands of connections it should not cause me much of a trouble?

I am puzzled because previously when you allocated innodb buffer pool to be 70% of total RAM (dedicated DB machine) - then it would hold. Now when allocating 37.5% it goes south within 2-3 days.

Comment by Vincent Jancso [ 2023-11-21 ]

Same problem with 10.6.16

@Michael Widenius Let me know if there's anything else I can do to help you find the cause. I still believe it may be caused by queries that have a lot of nested subqueries as the one I provided on 06/06/2023. Queries like this one caused the virtual memory to jump up by 400-500MB everytime I execute it.

Comment by Michal Dobroczynski [ 2023-11-21 ]

@Vincent Jancso The issue got completely out of hand here and I decided to follow the initial advice, i.e. `tcmalloc` or `jemalloc`. In my case it was the latter one. It solved the issue - although I am surprised that the fragmentation is going so much out of hand. I have similar workloads that run on the same version and are super stable.

Comment by Marko Mäkelä [ 2023-11-22 ]

I think that it could be a useful exercise to evaluate and classify all malloc() calls that are executed for some rather short SQL input.

When it comes to the InnoDB storage engine, I think that the situation was improved in the past by MDEV-20950 and MDEV-15199, which replaced some heap memory allocation with stack based allocation.

Comment by Roman Trapickin [ 2024-01-30 ]

We had a similar issue in our deployment with a containerized MariaDB 10.11.6. Memory usage was going up with the execution of our daily mysqldump backup and never went down. After some time the DB process was OOM killed. Memory usage curve was very similar to that of Michal. We gave the container up to 20 GB (overkill accroding to mysqltuner) of memory to no avail.

In our case it was tcmalloc which solved the issue. We tried jemalloc too, which changed the course of the curve but not the tendency.

FYI: cat /proc/buddyinfo listed very low numbers on the right side of the table. Memory chunks larger than 512 KiB were almost absent and there were plenty of very small chunks. It looks like a memory fragmentation issue indeed.

Comment by Bruno Bear [ 2024-02-01 ]

We still have this problem and tried to solve it with: disabling Transparent Huge Pages and using jemalloc, didnt helpt.
We are now trying tcmalloc and waiting for the result.
memory_used and the information_schema.processlist only show a fraction of the used memory. (300mb vs 6GB overhead).
We have this on multiple servers with different software, so its not a specific query.
I have a feeling it is caused by inserts and not from selects. If we import a database, wich is actually just create tables and inserts, the overhead grows. For example before the import 10GB and after the import 16GB. Buffer Pool Size is limited to 10GB. <-we checked inside the container it was not mariadb using the memory so this observation was wrong.
We use the 10.6.12 docker image on Debian 11.

Comment by Carl-Philip Hänsch [ 2024-02-05 ]

We also experience severe memory leaks that require a server reboot every 15 minutes to not run into OOM issues.

Here's the setup:

RAM Limit for MariaDB: 16 GB. Usage: 26.8G (84% of systems memory)

The issue occurs after writing queries like UPDATE or INSERT, especially when TRIGGERs are involved.

Server version: 10.11.6-MariaDB-1:10.11.6+maria~deb11

In the moment, we are porting a lot of projects to PostgreSQL to reduce the number of MariaDB crashes per hour.

Generated at Thu Feb 08 10:19:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.