[MDEV-22998] memory leak in Galera-4 Created: 2020-06-24  Updated: 2021-09-22  Resolved: 2020-08-05

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.13
Fix Version/s: 10.4.14, 10.5.5

Type: Bug Priority: Critical
Reporter: Rick Pizzi Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 3
Labels: None

Attachments: File code.lua     File my.cnf    
Issue Links:
Relates
relates to MDEV-22908 After OOM event on one node, entire c... Closed

 Description   

Galera wsrep layer leaks memory on nodes receiving writesets.
For some reason any issued DDL on master makes such nodes release some of the leaked memory (but doesn't stop the leak).

This is affecting a customer of ours.
It is clearly seen that the leak is in wsrep layer because if node gets evicted (wsrep layer is shut down but mysqld stays up) all memory is released instantly.

Besides, the nodes where the leak appears are receiving NO queries at all.

Leak only affects nodes receiving writesets via wsrep layer - master is NOT affected.

How to reproduce: 3 Centos7 nodes, 1G memory on each, use supplied sysbench script and leave it running for few hours.

sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root code.lua --tables=16 prepare
sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root code.lua  --tables=16 --threads=64 --time=0 run



 Comments   
Comment by Jan Lindström (Inactive) [ 2020-07-20 ]

As this bug most likely is on wsrep-lib interface or in Galera library, assigning it to teemu.ollakka

Comment by Julius Goryavsky [ 2020-07-21 ]

If a memory leak is confirmed in this test, then this may be an explanation for another problem - for the MDEV-22908. As far as I understand, the primary component is lost in MDEV-22908 due to the exhaustion of memory on one of the nodes. It is difficult to say for sure in advance, but perhaps these problems have a common cause.

Comment by Teemu Ollakka [ 2020-07-24 ]

Valgrind/Massif revealed a leak in

   n2: 2757856 0x1501019: alloc_root (my_alloc.c:250)
    n2: 2097940 0x7EDFCF: Query_arena::alloc(unsigned long) (sql_class.h:1049)
     n2: 2095660 0x884D89: lock_tables(THD*, TABLE_LIST*, unsigned int, unsigned 
int) (sql_base.cc:5522)
      n2: 2095660 0x88441E: open_and_lock_tables(THD*, DDL_options_st const&, TAB
LE_LIST*, bool, unsigned int, Prelocking_strategy*) (sql_base.cc:5278)
       n2: 2095660 0x83D8F9: open_and_lock_tables(THD*, TABLE_LIST*, bool, unsign
ed int) (sql_base.h:505)
        n1: 2095512 0xDD9596: Rows_log_event::do_apply_event(rpl_group_info*) (lo
g_event.cc:11399)
         n1: 2095512 0x826AFD: Log_event::apply_event(rpl_group_info*) (log_event
.h:1482)
          n1: 2095512 0xB95BE7: wsrep_apply_events(THD*, Relay_log_info*, void co
nst*, unsigned long) (wsrep_applier.cc:200)

With the following PR the leak disappeared: https://github.com/MariaDB/server/pull/1639

Could someone verify that the fix plugs the leak?

Comment by Doug Whitfield [ 2021-09-22 ]

Has anyone seen this behavior in 10.3?

Generated at Thu Feb 08 09:19:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.