[MDEV-25633] MariaDB crashes when compiled with link time optimizations Created: 2021-05-10  Updated: 2023-11-15

Status: In Review
Project: MariaDB Server
Component/s: Compiling, Replication
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.6, 10.11

Type: Bug Priority: Major
Reporter: Vicențiu Ciorbaru Assignee: Vicențiu Ciorbaru
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Duplicate
is duplicated by MDEV-28946 STOP SLAVE or slave errors (ex 1062, ... Closed
is duplicated by MDEV-29229 Server crashes every time the slave S... Closed
Relates
relates to MDEV-32251 Crash in stack unwinding during pthre... Closed

 Description   

Following the release of MariaDB 10.5.10 on Hirsute, we have discovered that appending -flto and -ffat-lto-objects as compile flags will cause MariaDB to crash with SIGABRT in pthread_exit when closing the replication slave thread.

This is happens regardless if MariaDB is compiled with PERFSCHEMA or not.

Steps to reproduce:

Set up an Ubuntu 21.04 docker container or VM.

run

debian/autobake-deb.sh

or

dpkg-buildpackage --build=binary

from the base server directory.

Notice the extra -flto and -ffat-lto-objects flags being passed during compilation (and linking).

Run any replication test such as rpl_sp

#1  0x00007f5831212864 in __GI_abort () at abort.c:79
#2  0x000055ffc77d5bc1 in _Unwind_SetGR.cold ()
#3  0x000055ffc7fa575d in __gcc_personality_v0 ()
#4  0x00007f5830fb1604 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#5  0x00007f5830fb1cf2 in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#6  0x00007f5831c76d46 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:131
#7  0x00007f5831c6e732 in __do_cancel () at pthreadP.h:307
#8  __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28
#9  0x000055ffc784b1b3 in handle_slave_sql (arg=0x55ffcb161d00) at ./sql/slave.cc:5298
#10 0x00007f5831c6d450 in start_thread (arg=0x7f58240dd640) at pthread_create.c:473
#11 0x00007f5831303d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95



 Comments   
Comment by Sergei Golubchik [ 2021-05-10 ]

just FYI (want to write it down somewhere before I forget):

  • flags are added by dpkg-buildflags
  • it's a perl script, -flto comes from the Dpkg::Vendor::Debian module (/usr/share/perl5/Dpkg/Vendor/Debian.pm), it's hard-coded there
  • but it checks the exception list, /usr/share/lto-disabled-list/lto-disabled-list
Comment by Marko Mäkelä [ 2022-04-26 ]

In case link-time optimization generates invalid code due to undefined behavior, then MDEV-26272 (affecting replication) would be a good candidate. While GCC’s -fsanitize=undefined does not complain much nowadays, clang’s would cause the server to crash on startup due to MDEV-26272.

Comment by Daniel Black [ 2022-04-29 ]

LTO removing the exception catching code: https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/1970634 resulting in assertion (MDEV-28441)

Comment by Daniel Black [ 2022-07-22 ]

I ran mtr tests on the quay.io/mariadb-foundation/mariadb-devel:10.8 based on our ubuntu jammy builds (like MDBF-453) and got the consistent crashes on the stopping of the slave thread like shown here.

If we can't find a solution maybe we could pull in https://git.launchpad.net/ubuntu/+source/mariadb-10.6/commit/?id=ae532f091e888f9302d2a5f3aad4c0b74521d158 before the release (10.6+ as ubuntu packages 10.6 on 22.04).

Comment by Daniel Black [ 2022-07-29 ]

Unable to reproduce with clang-14.0.0 / gcc-12.1.1 (fc36) with the CMAKE_C{,XX}_FLAGS and CMAKE_LINKER_FLAGS. containing -flto.

only gcc supported -ffat-lto-objects

Comment by Sergei Golubchik [ 2022-10-22 ]

There are many issues related to lto. The fix for the replication crash could be

MDEV-25633 MariaDB crashes when compiled with link time optimizations
 
when compiled with gcc -flto -ffat-lto-objects replication was crashing
when stopping the slave sql thread, during the stack unwinding on the
pthread_exit(0) call.
 
While the actual reason for the crash is unclear, pthread_exit() is
a complex function that throws an exception to properly unwind the
stack and call all necessary destructors.
 
There is no need to do that at the end of the thread start_routine
where a simple return(0) will suffice (man pthread_create).
 
diff --git a/sql/slave.cc b/sql/slave.cc
--- a/sql/slave.cc
+++ b/sql/slave.cc
@@ -5645,7 +5645,6 @@ pthread_handler_t handle_slave_sql(void *arg)
   DBUG_LEAVE;                                   // Must match DBUG_ENTER()
   my_thread_end();
   ERR_remove_state(0);
-  pthread_exit(0);
   return 0;                                     // Avoid compiler warnings
 }

Unfortunately there're lots of places that throw exceptions (in oqgraph and columnstore) and crash with lto, and they cannot be fixed like above.

Comment by Otto Kekäläinen [ 2023-10-05 ]

This issue still exists. Filed https://bugs.launchpad.net/ubuntu/+source/mariadb/+bug/2038500 to track this and to remember to remove the workaround eventually.

Comment by Kristian Nielsen [ 2023-10-28 ]

From the stacktrace in the description, this looks similar to MDEV-32251:

#1  0x00007f5831212864 in __GI_abort () at abort.c:79
#2  0x000055ffc77d5bc1 in _Unwind_SetGR.cold ()
#3  0x000055ffc7fa575d in __gcc_personality_v0 ()
#4  0x00007f5830fb1604 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#5  0x00007f5830fb1cf2 in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#6  0x00007f5831c76d46 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:131
#7  0x00007f5831c6e732 in __do_cancel () at pthreadP.h:307
#8  __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28
#9  0x000055ffc784b1b3 in handle_slave_sql (arg=0x55ffcb161d00) at ./sql/slave.cc:5298

We see that __pthread_exit() goes through dynamic libgcc_s.so, but the
functions __gcc_personality_v0 and _Unwind_SetGR.cold are inside the
mariadbd binary. We can see the mariadbd symbols live in 0x000055ff... while
the libgcc_s.so symbols live in 0x00007f58...

And in cmake/build_configurations/mysql_release.cmake I see that it uses
-lstatic-libgcc:

    SET(COMMON_CXX_FLAGS               "-g -static-libgcc -fno-omit-frame-pointer -fno-strict-aliasing -Wno-uninitialized")

So the code crashes exactly at the place where the dynamic libgcc code calls
into what seems to be statically linked libgcc. So this seems to be the
likely problem in this case also.

Is there still a way to reproduce this? If so, try removing the 4 occurences
of -static-libgcc from cmake/build_configurations/mysql_release.cmake and
see if it solves the problem. The dependency on LTO might be only
accidental.

It doesn't seem correct to use -lstatic-libgcc, static linking has been
problematic for many years. I think it should be removed in any case.

Comment by Sergei Golubchik [ 2023-11-06 ]

Thanks, knielsen. Without -static-libgcc it doesn't crash for me. There were few compilation failures though, easy to fix. Otherwise it appears to work now.

Comment by Sergei Golubchik [ 2023-11-06 ]

cvicentiu, please, see commits

259233e2e94 don't disable lto in DEB builds
f1644d8d17a MDEV-25633 MariaDB crashes when compiled with link time optimizations
24a276256ce better disable lto for libmysqld_exports.cc
475c39cdbfc C/C compilation failures under -flto

(including commits inside 475c39cdbfc)

Generated at Thu Feb 08 09:39:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.