Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25633

MariaDB crashes when compiled with link time optimizations

Details

    Description

      Following the release of MariaDB 10.5.10 on Hirsute, we have discovered that appending -flto and -ffat-lto-objects as compile flags will cause MariaDB to crash with SIGABRT in pthread_exit when closing the replication slave thread.

      This is happens regardless if MariaDB is compiled with PERFSCHEMA or not.

      Steps to reproduce:

      Set up an Ubuntu 21.04 docker container or VM.

      run

      debian/autobake-deb.sh
      

      or

      dpkg-buildpackage --build=binary
      

      from the base server directory.

      Notice the extra -flto and -ffat-lto-objects flags being passed during compilation (and linking).

      Run any replication test such as rpl_sp

      #1  0x00007f5831212864 in __GI_abort () at abort.c:79
      #2  0x000055ffc77d5bc1 in _Unwind_SetGR.cold ()
      #3  0x000055ffc7fa575d in __gcc_personality_v0 ()
      #4  0x00007f5830fb1604 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
      #5  0x00007f5830fb1cf2 in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
      #6  0x00007f5831c76d46 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:131
      #7  0x00007f5831c6e732 in __do_cancel () at pthreadP.h:307
      #8  __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28
      #9  0x000055ffc784b1b3 in handle_slave_sql (arg=0x55ffcb161d00) at ./sql/slave.cc:5298
      #10 0x00007f5831c6d450 in start_thread (arg=0x7f58240dd640) at pthread_create.c:473
      #11 0x00007f5831303d53 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
      

      Attachments

        Issue Links

          Activity

            cvicentiu Vicențiu Ciorbaru created issue -
            cvicentiu Vicențiu Ciorbaru made changes -
            Field Original Value New Value
            Priority Major [ 3 ] Critical [ 2 ]
            cvicentiu Vicențiu Ciorbaru made changes -
            Assignee Vicențiu Ciorbaru [ cvicentiu ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]

            just FYI (want to write it down somewhere before I forget):

            • flags are added by dpkg-buildflags
            • it's a perl script, -flto comes from the Dpkg::Vendor::Debian module (/usr/share/perl5/Dpkg/Vendor/Debian.pm), it's hard-coded there
            • but it checks the exception list, /usr/share/lto-disabled-list/lto-disabled-list
            serg Sergei Golubchik added a comment - just FYI (want to write it down somewhere before I forget): flags are added by dpkg-buildflags it's a perl script, -flto comes from the Dpkg::Vendor::Debian module ( /usr/share/perl5/Dpkg/Vendor/Debian.pm ), it's hard-coded there but it checks the exception list, /usr/share/lto-disabled-list/lto-disabled-list
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 121711 ] MariaDB v4 [ 142840 ]
            serg Sergei Golubchik made changes -
            Assignee Vicențiu Ciorbaru [ cvicentiu ] Sergei Golubchik [ serg ]

            In case link-time optimization generates invalid code due to undefined behavior, then MDEV-26272 (affecting replication) would be a good candidate. While GCC’s -fsanitize=undefined does not complain much nowadays, clang’s would cause the server to crash on startup due to MDEV-26272.

            marko Marko Mäkelä added a comment - In case link-time optimization generates invalid code due to undefined behavior, then MDEV-26272 (affecting replication) would be a good candidate. While GCC’s -fsanitize=undefined does not complain much nowadays, clang ’s would cause the server to crash on startup due to MDEV-26272 .
            marko Marko Mäkelä made changes -
            danblack Daniel Black added a comment -

            LTO removing the exception catching code: https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/1970634 resulting in assertion (MDEV-28441)

            danblack Daniel Black added a comment - LTO removing the exception catching code: https://bugs.launchpad.net/ubuntu/+source/mariadb-10.6/+bug/1970634 resulting in assertion ( MDEV-28441 )
            danblack Daniel Black made changes -
            danblack Daniel Black made changes -
            danblack Daniel Black added a comment -

            I ran mtr tests on the quay.io/mariadb-foundation/mariadb-devel:10.8 based on our ubuntu jammy builds (like MDBF-453) and got the consistent crashes on the stopping of the slave thread like shown here.

            If we can't find a solution maybe we could pull in https://git.launchpad.net/ubuntu/+source/mariadb-10.6/commit/?id=ae532f091e888f9302d2a5f3aad4c0b74521d158 before the release (10.6+ as ubuntu packages 10.6 on 22.04).

            danblack Daniel Black added a comment - I ran mtr tests on the quay.io/mariadb-foundation/mariadb-devel:10.8 based on our ubuntu jammy builds (like MDBF-453 ) and got the consistent crashes on the stopping of the slave thread like shown here. If we can't find a solution maybe we could pull in https://git.launchpad.net/ubuntu/+source/mariadb-10.6/commit/?id=ae532f091e888f9302d2a5f3aad4c0b74521d158 before the release (10.6+ as ubuntu packages 10.6 on 22.04).
            danblack Daniel Black added a comment -

            Unable to reproduce with clang-14.0.0 / gcc-12.1.1 (fc36) with the CMAKE_C{,XX}_FLAGS and CMAKE_LINKER_FLAGS. containing -flto.

            only gcc supported -ffat-lto-objects

            danblack Daniel Black added a comment - Unable to reproduce with clang-14.0.0 / gcc-12.1.1 (fc36) with the CMAKE_C{,XX}_FLAGS and CMAKE_LINKER_FLAGS . containing -flto . only gcc supported -ffat-lto-objects
            danblack Daniel Black made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.2 [ 14601 ]
            danblack Daniel Black made changes -
            serg Sergei Golubchik made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            serg Sergei Golubchik made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.3 [ 22126 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]

            There are many issues related to lto. The fix for the replication crash could be

            MDEV-25633 MariaDB crashes when compiled with link time optimizations
             
            when compiled with gcc -flto -ffat-lto-objects replication was crashing
            when stopping the slave sql thread, during the stack unwinding on the
            pthread_exit(0) call.
             
            While the actual reason for the crash is unclear, pthread_exit() is
            a complex function that throws an exception to properly unwind the
            stack and call all necessary destructors.
             
            There is no need to do that at the end of the thread start_routine
            where a simple return(0) will suffice (man pthread_create).
             
            diff --git a/sql/slave.cc b/sql/slave.cc
            --- a/sql/slave.cc
            +++ b/sql/slave.cc
            @@ -5645,7 +5645,6 @@ pthread_handler_t handle_slave_sql(void *arg)
               DBUG_LEAVE;                                   // Must match DBUG_ENTER()
               my_thread_end();
               ERR_remove_state(0);
            -  pthread_exit(0);
               return 0;                                     // Avoid compiler warnings
             }
            

            Unfortunately there're lots of places that throw exceptions (in oqgraph and columnstore) and crash with lto, and they cannot be fixed like above.

            serg Sergei Golubchik added a comment - There are many issues related to lto. The fix for the replication crash could be MDEV-25633 MariaDB crashes when compiled with link time optimizations   when compiled with gcc -flto -ffat-lto-objects replication was crashing when stopping the slave sql thread, during the stack unwinding on the pthread_exit(0) call.   While the actual reason for the crash is unclear, pthread_exit() is a complex function that throws an exception to properly unwind the stack and call all necessary destructors.   There is no need to do that at the end of the thread start_routine where a simple return(0) will suffice (man pthread_create).   diff --git a/sql/slave.cc b/sql/slave.cc --- a/sql/slave.cc +++ b/sql/slave.cc @@ -5645,7 +5645,6 @@ pthread_handler_t handle_slave_sql(void *arg) DBUG_LEAVE; // Must match DBUG_ENTER() my_thread_end(); ERR_remove_state(0); - pthread_exit(0); return 0; // Avoid compiler warnings } Unfortunately there're lots of places that throw exceptions (in oqgraph and columnstore) and crash with lto, and they cannot be fixed like above.
            serg Sergei Golubchik made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.7 [ 24805 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.8 [ 26121 ]
            otto Otto Kekäläinen made changes -
            Fix Version/s 10.11 [ 27614 ]

            This issue still exists. Filed https://bugs.launchpad.net/ubuntu/+source/mariadb/+bug/2038500 to track this and to remember to remove the workaround eventually.

            otto Otto Kekäläinen added a comment - This issue still exists. Filed https://bugs.launchpad.net/ubuntu/+source/mariadb/+bug/2038500 to track this and to remember to remove the workaround eventually.
            knielsen Kristian Nielsen added a comment - - edited

            From the stacktrace in the description, this looks similar to MDEV-32251:

            #1  0x00007f5831212864 in __GI_abort () at abort.c:79
            #2  0x000055ffc77d5bc1 in _Unwind_SetGR.cold ()
            #3  0x000055ffc7fa575d in __gcc_personality_v0 ()
            #4  0x00007f5830fb1604 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1
            #5  0x00007f5830fb1cf2 in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1
            #6  0x00007f5831c76d46 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:131
            #7  0x00007f5831c6e732 in __do_cancel () at pthreadP.h:307
            #8  __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28
            #9  0x000055ffc784b1b3 in handle_slave_sql (arg=0x55ffcb161d00) at ./sql/slave.cc:5298
            

            We see that __pthread_exit() goes through dynamic libgcc_s.so, but the
            functions __gcc_personality_v0 and _Unwind_SetGR.cold are inside the
            mariadbd binary. We can see the mariadbd symbols live in 0x000055ff... while
            the libgcc_s.so symbols live in 0x00007f58...

            And in cmake/build_configurations/mysql_release.cmake I see that it uses
            -lstatic-libgcc:

                SET(COMMON_CXX_FLAGS               "-g -static-libgcc -fno-omit-frame-pointer -fno-strict-aliasing -Wno-uninitialized")
            

            So the code crashes exactly at the place where the dynamic libgcc code calls
            into what seems to be statically linked libgcc. So this seems to be the
            likely problem in this case also.

            Is there still a way to reproduce this? If so, try removing the 4 occurences
            of -static-libgcc from cmake/build_configurations/mysql_release.cmake and
            see if it solves the problem. The dependency on LTO might be only
            accidental.

            It doesn't seem correct to use -lstatic-libgcc, static linking has been
            problematic for many years. I think it should be removed in any case.

            knielsen Kristian Nielsen added a comment - - edited From the stacktrace in the description, this looks similar to MDEV-32251 : #1 0x00007f5831212864 in __GI_abort () at abort.c:79 #2 0x000055ffc77d5bc1 in _Unwind_SetGR.cold () #3 0x000055ffc7fa575d in __gcc_personality_v0 () #4 0x00007f5830fb1604 in ?? () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #5 0x00007f5830fb1cf2 in _Unwind_ForcedUnwind () from /lib/x86_64-linux-gnu/libgcc_s.so.1 #6 0x00007f5831c76d46 in __GI___pthread_unwind (buf=<optimized out>) at unwind.c:131 #7 0x00007f5831c6e732 in __do_cancel () at pthreadP.h:307 #8 __pthread_exit (value=value@entry=0x0) at pthread_exit.c:28 #9 0x000055ffc784b1b3 in handle_slave_sql (arg=0x55ffcb161d00) at ./sql/slave.cc:5298 We see that __pthread_exit() goes through dynamic libgcc_s.so, but the functions __gcc_personality_v0 and _Unwind_SetGR.cold are inside the mariadbd binary. We can see the mariadbd symbols live in 0x000055ff... while the libgcc_s.so symbols live in 0x00007f58... And in cmake/build_configurations/mysql_release.cmake I see that it uses -lstatic-libgcc: SET(COMMON_CXX_FLAGS "-g -static-libgcc -fno-omit-frame-pointer -fno-strict-aliasing -Wno-uninitialized") So the code crashes exactly at the place where the dynamic libgcc code calls into what seems to be statically linked libgcc. So this seems to be the likely problem in this case also. Is there still a way to reproduce this? If so, try removing the 4 occurences of -static-libgcc from cmake/build_configurations/mysql_release.cmake and see if it solves the problem. The dependency on LTO might be only accidental. It doesn't seem correct to use -lstatic-libgcc, static linking has been problematic for many years. I think it should be removed in any case.
            danblack Daniel Black made changes -

            Thanks, knielsen. Without -static-libgcc it doesn't crash for me. There were few compilation failures though, easy to fix. Otherwise it appears to work now.

            serg Sergei Golubchik added a comment - Thanks, knielsen . Without -static-libgcc it doesn't crash for me. There were few compilation failures though, easy to fix. Otherwise it appears to work now.

            cvicentiu, please, see commits

            259233e2e94 don't disable lto in DEB builds
            f1644d8d17a MDEV-25633 MariaDB crashes when compiled with link time optimizations
            24a276256ce better disable lto for libmysqld_exports.cc
            475c39cdbfc C/C compilation failures under -flto

            (including commits inside 475c39cdbfc)

            serg Sergei Golubchik added a comment - cvicentiu , please, see commits 259233e2e94 don't disable lto in DEB builds f1644d8d17a MDEV-25633 MariaDB crashes when compiled with link time optimizations 24a276256ce better disable lto for libmysqld_exports.cc 475c39cdbfc C/C compilation failures under -flto (including commits inside 475c39cdbfc)
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Vicențiu Ciorbaru [ cvicentiu ]
            Status Stalled [ 10000 ] In Review [ 10002 ]
            otto Otto Kekäläinen added a comment - https://bugs.launchpad.net/ubuntu/+source/mariadb/+bug/2038500 I tested again building on Ubuntu 24.04 Noble with latest dependencies at https://launchpadlibrarian.net/728345044/buildlog_ubuntu-noble-amd64.mariadb_1%3A10.11.7-5~bpo24.04.1~1715052025.97428f7b341+debian.latest_BUILDING.txt.gz and it is still failing on this bug.
            danblack Daniel Black added a comment -

            so which do you think is incorrect - MariaDB or the compiler/linker generating incorrect code?

            danblack Daniel Black added a comment - so which do you think is incorrect - MariaDB or the compiler/linker generating incorrect code?

            The build log referenced by Otto two comments back still uses -static-libgcc in the compile, so that's why it's still failing, presumably.

            knielsen Kristian Nielsen added a comment - The build log referenced by Otto two comments back still uses -static-libgcc in the compile, so that's why it's still failing, presumably.
            serg Sergei Golubchik made changes -
            Assignee Vicențiu Ciorbaru [ cvicentiu ] Daniel Black [ danblack ]

            cherry-picked into 10.6, new commit hashes:

            397762d4e2e don't disable lto in DEB builds
            ea07244d444 MDEV-25633 MariaDB crashes when compiled with link time optimizatio>
            b6cd03d7b04 better disable lto for libmysqld_exports.cc
            2ea6c73a4d2 C/C compilation failures under -flto
            

            branch bb-10.6-MDEV-25633

            serg Sergei Golubchik added a comment - cherry-picked into 10.6, new commit hashes: 397762d4e2e don't disable lto in DEB builds ea07244d444 MDEV-25633 MariaDB crashes when compiled with link time optimizatio> b6cd03d7b04 better disable lto for libmysqld_exports.cc 2ea6c73a4d2 C/C compilation failures under -flto branch bb-10.6- MDEV-25633
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 112319
            danblack Daniel Black added a comment -

            Looks good to me. Go ahead.

            danblack Daniel Black added a comment - Looks good to me. Go ahead.
            danblack Daniel Black made changes -
            Assignee Daniel Black [ danblack ] Sergei Golubchik [ serg ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            serg Sergei Golubchik made changes -
            Status Stalled [ 10000 ] In Testing [ 10301 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.6.20 [ 29903 ]
            Fix Version/s 10.11.10 [ 29904 ]
            Fix Version/s 11.2.6 [ 29906 ]
            Fix Version/s 11.4.4 [ 29907 ]
            Fix Version/s 11.6.2 [ 29908 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.11 [ 27614 ]
            Resolution Fixed [ 1 ]
            Status In Testing [ 10301 ] Closed [ 6 ]
            danblack Daniel Black made changes -

            Was this really fixed in 11.2.6, 11.4.4 and 11.6.2? What is the impact of this revert due to MCOL-5819? Does it only affect Ubuntu packages?

            marko Marko Mäkelä added a comment - Was this really fixed in 11.2.6, 11.4.4 and 11.6.2? What is the impact of this revert due to MCOL-5819 ? Does it only affect Ubuntu packages?

            It was fixed, in the sense that it doesn't crash (except for MCOL-5819).

            It looks like only Ubuntu and derivatives are affected, but I think it'd be safer to disable LTO for ColumnStore unconditionally, I'll try that.

            serg Sergei Golubchik added a comment - It was fixed, in the sense that it doesn't crash (except for MCOL-5819 ). It looks like only Ubuntu and derivatives are affected, but I think it'd be safer to disable LTO for ColumnStore unconditionally, I'll try that.

            People

              serg Sergei Golubchik
              cvicentiu Vicențiu Ciorbaru
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.