[MDEV-20699] mysqldump of routines causes MariaDB to get killed by oom-killer Created: 2019-09-30  Updated: 2021-10-04  Resolved: 2021-09-28

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.1.25, 10.3.17, 10.4.8
Fix Version/s: 10.3.32, 10.4.22, 10.5.13, 10.6.4

Type: Bug Priority: Critical
Reporter: Matthias Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: crash, mysqldump, stored_procedures
Environment:
  • CentOS 7 (3.10.0-957.27.2.el7.x86_64)
  • CentOS 7 (3.10.0-957.5.1.el7.x86_64)

Attachments: Zip Archive Memory measurement.zip     HTML File massif-20190928     PNG File screenshot-1.png     File server.cnf     Zip Archive umdh.zip    
Issue Links:
Problem/Incident
causes MDEV-26754 main.sp test fails for embedded server Open

 Description   

Hi guys,

we have a problem creating a full dump of our database using mysqldump.

Since we upgraded from MariaDB 10.1.25 to MariaDB 10.3.17 two weeks ago, creating a full dump including all schemas (total of 3), triggers, events and routines causes the server get killed by oom-killer.

TL;DR:
It's the stored procedures that cause the server to go crazy and consume vast amounts of memory till it gets killed. Specifically the SHOW CREATE PROCEDURE command seems responsible for this behavior (tested also by running in MySQL Workbench with the same result).

Details:
After some stupid trial-and-error, I found out that it's dumping the stored procedures (422 in total, about 30 to 40 KB).
Dumping all the tables including events and triggers without the routines is fine and the output file is about 3 GB in size. But as soon as I try to dump the routines only, the server gets kill after only a few seconds.
For test purposes, I ran mysqldump on one of our slaves (MariaDB 10.3.17), dumping routines only and it crashed.
Then I tried to run the same command on our test system with an older version of MariaDB (10.1.25), but with the same data, events, triggers and routines as the slave. It crashed.
Out of pure desperation, I even dropped all routines and re-imported them manually. But as soon as I try to dump the routines only, the server crashes.

Before the upgrade, everything worked fine on the slave and the (older) test system. Is it possible that some internal structure or table of MariaDB got somehow corrupted? And if so, how can I fix this?

I also wanted to know what's going on inside MariaDB on our test system and so I started it using valgrind. I viewed the output file using massif-visualizer and it looks to me like MariaDB allocates huge amounts of memory for every stored procedure and never frees it again. But I'm completely new to valgrind, so I have no idea how to interpret the results correctly. If it helps, I can attach the massif's output file here (447 KB).

Here's the mysqldump command I'm using:
mysqldump --user=foobar --password --triggers --events --routines --result-file=/path/to/file --databases one two three

And even with this command it crashes:
mysqldump --user=foobar --password --routines --no-data --no-create-info --result-file=/path/to/file --databases one two three

Any help or idea is welcome!

Regards,
matze



 Comments   
Comment by Elena Stepanova [ 2019-10-06 ]

matze,

First of all, could you please specify how much memory your machine has, and how much (approximately) is "vast amounts of memory"?

Further, I'm not sure I quite understand, does it or does it not work on 10.1? First you said
> Then I tried to run the same command on our test system with an older version of MariaDB (10.1.25), but with the same data, events, triggers and routines as the slave. It crashed.
but then
> Before the upgrade, everything worked fine on the slave and the (older) test system.

So which is it, does it work fine or does it crash?

And yes, sure, please attach your massif output. Please also paste or attach your config file(s). If you can provide the actual dump on which the problem is reproducible, it would be even better. You can upload it to ftp.askmonty.org/private if you don't want to make it public.

Comment by Matthias [ 2019-10-10 ]

Hi @Elena,

thank you for your answer!

All of our servers had 2 GB of memory in total. I know, this sounds few, but it worked for more than 9 months. Now, our test server still has 2 GB, while the other (productive) servers have 4 to 10 GB.

I'm not sure if I interpret the results from valgrind and massif correctly, but it looks like MariaDB allocates about 80 MB per stored procedure (and never releases the memory again). As we have 422 stored procedures, that would make about 33 GB in total.

I'm sorry, I must have expressed myself somewhat inaccurately.
We have 3 productive nodes: one master and two slaves. Node 1 (master) is version 10.3.17, node 2 (first slave) is 10.4.8 and node 3 (second slave) is 10.3.17 too. Before the update, all servers had version 10.1.25 (like the test system right now). Every night at 10 pm, an automatic full backup using mysqldump was created from one of the slaves, including events, triggers and routines. Also, to keep the test system up-to-date, a dump was created at 9 pm from node 2 and imported into the test system. So creating a full dump or backup from one of these servers (all 10.1.25) was no problem. But since the update, when dumping the routines on the slave (no matter if locally or remotely) the server runs out of memory within a few seconds and gets killed. And it's since this update that even the test system runs out of memory, even though it's still 10.1.25. That's what's so confusing to us.

I'm afraid I can't provide the actual dump because it has 3 GB in size, but maybe I can provide one single stored procedure, so you can test it and see if it's the stored procedure that causes this issue. But actually it looks like that all stored procedures are affected, not only one single or maybe a couple...

I will provide the massif output and the config file.

By the way: I'm on vacation next week, so please don't be mad at me if I don't answer. But I'll be back on October 21st.

Thank you and regards,
Matthias

Comment by Oleksandr Byelkin [ 2020-03-25 ]

What about test suite?

Comment by Vladislav Vaintroub [ 2021-09-25 ]

I can't reproduce with current 10.3 (I did not test any earlier versions, since I understand, it is not used be the reporter).
Here is the full session with valgrind and mysqldump of the database netmon (routines in the attachment use that database). Valgrind report 288 bytes leaked in allocating tls variable somewhere, but I'm sure this is something on libc account, not on ours. In other words, there is no leak. It reports however 1GB memory use, which I believe is fine.

wlad@workpc:~/10.4/xxx/sql$ valgrind --leak-check=full ./mysqld --datadir=/home/wlad/10.4/xxx/sql/data &
[1] 10737
wlad@workpc:~/10.4/xxx/sql$ ==10737== Memcheck, a memory error detector
==10737== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==10737== Using Valgrind-3.14.0 and LibVEX; rerun with -h for copyright info
==10737== Command: ./mysqld --datadir=/home/wlad/10.4/xxx/sql/data
==10737==
2021-09-26  1:09:14 0 [Note] ./mysqld (mysqld 10.3.32-MariaDB) starting as process 10737 ...
2021-09-26  1:09:14 0 [Warning] Could not increase number of max_open_files to more than 1024 (request: 32190)
2021-09-26  1:09:14 0 [Warning] Changed limits: max_open_files: 1024  max_connections: 151 (was 151)  table_cache: 421 (was 2000)
2021-09-26  1:09:15 0 [Note] InnoDB: Using Linux native AIO
2021-09-26  1:09:15 0 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2021-09-26  1:09:15 0 [Note] InnoDB: Uses event mutexes
2021-09-26  1:09:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2021-09-26  1:09:15 0 [Note] InnoDB: Number of pools: 1
2021-09-26  1:09:15 0 [Note] InnoDB: Using SSE2 crc32 instructions
2021-09-26  1:09:15 0 [Note] InnoDB: Initializing buffer pool, total size = 128M, instances = 1, chunk size = 128M
2021-09-26  1:09:15 0 [Note] InnoDB: Completed initialization of buffer pool
2021-09-26  1:09:16 0 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2021-09-26  1:09:16 0 [Note] InnoDB: 128 out of 128 rollback segments are active.
2021-09-26  1:09:16 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2021-09-26  1:09:16 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2021-09-26  1:09:16 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2021-09-26  1:09:16 0 [Note] InnoDB: Waiting for purge to start
2021-09-26  1:09:17 0 [Note] InnoDB: 10.3.32 started; log sequence number 1625479; transaction id 20
2021-09-26  1:09:17 0 [Note] InnoDB: Loading buffer pool(s) from /home/wlad/10.4/xxx/sql/data/ib_buffer_pool
2021-09-26  1:09:17 0 [Note] InnoDB: Buffer pool(s) load completed at 210926  1:09:17
2021-09-26  1:09:17 0 [Note] Plugin 'FEEDBACK' is disabled.
2021-09-26  1:09:17 0 [Note] Server socket created on IP: '::'.
2021-09-26  1:09:17 0 [Note] Reading of all Master_info entries succeeded
2021-09-26  1:09:17 0 [Note] Added new Master_info '' to hash table
2021-09-26  1:09:17 0 [Note] ./mysqld: ready for connections.
Version: '10.3.32-MariaDB'  socket: '/tmp/mysql.sock'  port: 3306  Source distribution
 
wlad@workpc:~/10.4/xxx/sql$ ../client/mysqldump -uroot --routines netmon --protocol=tcp > dump.out
wlad@workpc:~/10.4/xxx/sql$ ls -l dump.out
-rw-r--r-- 1 wlad wlad 3966721 Sep 26 01:10 dump.out
wlad@workpc:~/10.4/xxx/sql$ ../client/mysql -uroot -e shutdown
wlad@workpc:~/10.4/xxx/sql$ 2021-09-26  1:11:20 0 [Note] ./mysqld (initiated by: root[root] @ localhost []): Normal shutdown
2021-09-26  1:11:20 0 [Note] Event Scheduler: Purging the queue. 0 events
2021-09-26  1:11:20 0 [Note] InnoDB: FTS optimize thread exiting.
2021-09-26  1:13:36 0 [Note] InnoDB: Starting shutdown...
2021-09-26  1:13:36 0 [Note] InnoDB: Dumping buffer pool(s) to /home/wlad/10.4/xxx/sql/data/ib_buffer_pool
2021-09-26  1:13:36 0 [Note] InnoDB: Buffer pool(s) dump completed at 210926  1:13:36
2021-09-26  1:13:39 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2021-09-26  1:13:39 0 [Note] InnoDB: Shutdown completed; log sequence number 1625488; transaction id 21
2021-09-26  1:13:39 0 [Note] ./mysqld: Shutdown complete
 
==10737==
==10737== HEAP SUMMARY:
==10737==     in use at exit: 288 bytes in 1 blocks
==10737==   total heap usage: 57,614 allocs, 57,613 frees, 1,066,096,426 bytes allocated
==10737==
==10737== 288 bytes in 1 blocks are possibly lost in loss record 1 of 1
==10737==    at 0x4837B65: calloc (vg_replace_malloc.c:752)
==10737==    by 0x40116D1: allocate_dtv (dl-tls.c:286)
==10737==    by 0x401203D: _dl_allocate_tls (dl-tls.c:532)
==10737==    by 0x5157B95: allocate_stack (allocatestack.c:621)
==10737==    by 0x5157B95: pthread_create@@GLIBC_2.2.5 (pthread_create.c:669)
==10737==    by 0x87A066: inline_mysql_thread_create (mysql_thread.h:1275)
==10737==    by 0x87A066: innodb_init(void*) (ha_innodb.cc:4316)
==10737==    by 0x71B322: ha_initialize_handlerton(st_plugin_int*) (handler.cc:549)
==10737==    by 0x5612DA: plugin_initialize(st_mem_root*, st_plugin_int*, int*, char**, bool) (sql_plugin.cc:1433)
==10737==    by 0x562409: plugin_init(int*, char**, int) (sql_plugin.cc:1715)
==10737==    by 0x4A5CB1: init_server_components() (mysqld.cc:5467)
==10737==    by 0x4AC6A4: mysqld_main(int, char**) (mysqld.cc:6087)
==10737==    by 0x519409A: (below main) (libc-start.c:308)
==10737==
==10737== LEAK SUMMARY:
==10737==    definitely lost: 0 bytes in 0 blocks
==10737==    indirectly lost: 0 bytes in 0 blocks
==10737==      possibly lost: 288 bytes in 1 blocks
==10737==    still reachable: 0 bytes in 0 blocks
==10737==         suppressed: 0 bytes in 0 blocks
==10737==
==10737== For counts of detected and suppressed errors, rerun with: -v
==10737== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
 
[1]+  Done                    valgrind --leak-check=full ./mysqld --datadir=/home/wlad/10.4/xxx/sql/data

However, in terms of virtual memory, pmap shows about 500MB more usage, which does not grow if dump is repeated

wlad@workpc:~/10.4/xxx/sql$ pmap `pidof mysqld` |grep total
 total          1760496K
wlad@workpc:~/10.4/xxx/sql$ ../client/mysqldump -uroot --routines netmon --protocol=tcp > dump.out
wlad@workpc:~/10.4/xxx/sql$ pmap `pidof mysqld` |grep total
 total          2350320K
wlad@workpc:~/10.4/xxx/sql$ ../client/mysqldump -uroot --routines netmon --protocol=tcp > dump.out
wlad@workpc:~/10.4/xxx/sql$ pmap `pidof mysqld` |grep total
 total          2350320K

That thing is the supposedly libc heap growth, and whether it is possible to ever get this memory back, and whether it is possible to force it to allocate less, or maybe there is a more compact allocator, I do not really know.

So, matze, is there still a problem we need to investigate? maybe you can try some recent version of mariadb, and be happy with results.

Comment by Vladislav Vaintroub [ 2021-09-27 ]

Alright, of course the routines to be dumped are parsed for SHOW CREATE PROCEDURE, and stored in a per-connection cache, which is a problem for mysqldump, if there is a lot of routines, and they are large. For the 80 stored procedures, found in Memory measurement.zip , I measured the overhead of SHOW CREATE, using umdh tool, on Windows.

The raw data is in umdh.zip , also containing callstacks of the allocation. The measurement is taken for the loop in mysqldump, that does "SHOW CREATE PROCEDURE".

Note, that some of those procedures do not work in 10.4 (server would complain about undeclared in_User variable).

I think that a workaround, for matze would be to SET GLOBAL stored_program_cache=1, at least for duration of mysqldump.

And a fix for this case, would be to bypass caching stored procedures in SHOW CREATE - the very reason for cache is to speedup SP execution, so caching during the execution only should be fine.

And a general remark, since this per-connection cache can grow almost uncontrollably, and SP can easily take several MB, maybe the decision on whether to cache, should not be based on number of stored procedures, but on size, in bytes, of parsed SPs.

Comment by Vladislav Vaintroub [ 2021-09-27 ]

sanja, please review bb-10.3-wlad-MDEV-20699

Comment by Oleksandr Byelkin [ 2021-09-28 ]

OK to push

Generated at Thu Feb 08 09:01:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.