[MDEV-33362] stop slave in 10.6.12 with tcmalloc added causes crash Created: 2024-02-02  Updated: 2024-02-07

Status: Needs Feedback
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.6.12
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Bruno Bear Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None
Environment:

Docker on Debian 11
10.6.12-MariaDB-1:10.6.12+maria~ubu2004-log source revision: 4c79e15cc3716f69c044d4287ad2160da8101cdc


Attachments: Text File error-log-more.txt     Text File errorlog.txt    

 Description   

After having memory issues we added tcmalloc to the official maradb 10.6.12 docker image and restarted a slave server with it. How we added tcmalloc:

RUN apt-get update && apt-get -y install google-perftools
ENV LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4

Slave was running fine until we used "stop slave" to switch to a new master. This resulted in an immediate crash: "[ERROR] mysqld got signal 6 ;"
We repeated it and everytime we use "stop slave" mariadb crashed.
After that we used the image without tcmalloc added and "stop slave" worked again.
Error Log is attached.



 Comments   
Comment by Sergei Golubchik [ 2024-02-02 ]

1. could you show few more lines from the log? before the "mysqld got signal 6" ? if it's assert, they'll tell what exactly condition was false.

2. could you also install binutils? they include addr2line, so you'll get rid of "Printing to addr2line failed" and instead will get resolved stack trace with file names and line numbers.

Comment by Kristian Nielsen [ 2024-02-02 ]

This crash/stacktrace looks very similar to those in MDEV-25633 and MDEV-32251.
It's a crash inside the system libraries (libpthreads/libstdc++/libgcc) during final stack unwinding in pthread_exit().

In those other MDEVs, the problem was the mixing of different versions of the same library, due to both statically and dynamically linking of libraries. But this does not appear to be the case here, the entire stack trace seems to be within the standard system .so dynamic libraries. So while the symptoms seem obviously related, I cannot immediately think of a common root cause.

There might be some conflict between libtcmalloc and the system libraries, but again I cannot immediately think what it could be, also it seems to be all standard dynamic .so system libraries used, which should hopefully be compatible (unless the google-perftools are from a 3rd-party repo?)

I wonder if there is something wrong in the way the mariadb code uses threads (or <something>) around this that makes it particularly fragile to assertions inside libunwind... but I have no concrete ideas unfortunately. Maybe Sergei's suggestion can give a few more hints to what could be wrong.

Comment by Bruno Bear [ 2024-02-02 ]

Ich will try to deliver what Sergei asked for, as soon as our master slave setup is up again.
I dont think we are using a 3rd-party repo, we just use the offical mariadb docker image and install tcmalloc with "apt-get update && apt-get -y install google-perftools"
The result of "SHOW VARIABLES like "%malloc%" is: "tcmalloc gperftools 2.7"
I dont know if it helps, but this also looks very similar:
https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/1979695

Comment by Bruno Bear [ 2024-02-06 ]

I startet the slave again with the tcmalloc image, used "stop slave" and it crashed.
I uploaded everything that was in the logs.
I also installed binutils, checked if they are installed, but I still get "Printing to addr2line failed".

Comment by Bruno Bear [ 2024-02-06 ]

I found a solution.
First I checked if we messed up with our custom image, so I used the original image and installed the google-perftools after the container was started. Didnt change anything, same problem.
I saw that the google-perftools do more than just add that library, so I searched for a smaller installation.
New Container but this time I installed "libtcmalloc-minimal4" instead of "google-perftools".
I changed ENV to "LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libtcmalloc_minimal.so.4" and now we dont get a crash after "slave stop".
Is "libtcmalloc_minimal" enough for MariaDB?

Comment by Sergei Golubchik [ 2024-02-06 ]

Well, the idea was to use any other memory allocator, not the system one. Because it'll use a different memory allocation pattern. So yes, I suppose "libtcmalloc_minimal" is different enough, let's see if it helps

Comment by Kristian Nielsen [ 2024-02-06 ]

Thanks for the additional stacktrace. From this, we can see that it crashes in __cxa_get_globals, which means it is different from the problem seen in MDEV-25633 and MDEV-32251.

A web search pops up some discussions that suggest some problem with thread local storage, but I wasn't really able to get any ideas what could be the root cause, unfortunately.

But good that you found a solution with a different libtcmalloc.

Comment by Bruno Bear [ 2024-02-07 ]

I only found a solution for "stop slave", because that was the only error I got right after one day. So far I dont know if MariaDB is stable in general with "libtcmalloc_minimal" or if it just fixed that one command.

Generated at Thu Feb 08 10:38:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.