Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15607

mysqld crashed few after node is being joined with sst

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.3.4, 10.1(EOL), 10.2(EOL)
    • 10.1.35, 10.2.17, 10.3.8
    • Galera SST
    • None
    • CentOS 7.4

    Description

      mysqld crashed few after node is being joined with sst

      join Node to galera cluster loaded with ~12G data and set wsrep_sst_method=mariabackup

      1. service mysql start --wsrep_cluster_address=gcomm://192.168.104.195
        Starting mysql (via systemctl): Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.
        [FAILED]

       systemctl status mariadb.service
      ● mariadb.service - MariaDB 10.3.4 database server
         Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/mariadb.service.d
                 └─migrated-from-my.cnf-settings.conf
         Active: failed (Result: timeout) since Tue 2018-03-20 16:34:15 EET; 1h 7min ago
           Docs: man:mysqld(8)
                 https://mariadb.com/kb/en/library/systemd/
        Process: 16512 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)
        Process: 16509 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
       Main PID: 16601
         CGroup: /system.slice/mariadb.service
      
      

       joiner: => Rate:[42.6MiB/s] Avg:[37.3MiB/s] Elapsed:0:03:10
       
      2018-03-20 16:36:09 0 [Note] WSREP: 1.0 (t4w5): State transfer to 0.0 (t4w3) complete.
      2018-03-20 16:36:09 0 [Note] WSREP: Member 1.0 (t4w5) synced with group.
         joiner: => Rate:[37.8MiB/s] Avg:[37.8MiB/s] Elapsed:0:04:55
      WSREP_SST: [INFO] NOTE: Joiner-Recv-SST took 297 seconds (20180320 16:36:14.043)
      WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql//.sst (20180320 16:36:14.053)
      WSREP_SST: [INFO] Evaluating mariabackup --innobackupex   --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20180320 16:36:14.058            )
       
      ~~
       
      2018-03-20 16:36:16 0 [ERROR] WSREP: Trying to launch slave threads before creating connection at 'gcomm://192.168.104.191,192.168.104.193,192.16            8.104.195,192.168.104.196'
      mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.4/sql/wsrep_thd.cc:446: void wsrep_create_appliers(lo            ng int): Assertion `0' failed.
      180320 16:36:16 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.
       
      Server version: 10.3.4-MariaDB
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=0
      max_threads=153
      thread_count=6
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467345 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x555fbe5edede]
      /usr/sbin/mysqld(handle_fatal_signal+0x357)[0x555fbe069027]
      /lib64/libpthread.so.0(+0xf5e0)[0x7f9a97af55e0]
      /lib64/libc.so.6(gsignal+0x37)[0x7f9a960021f7]
      /lib64/libc.so.6(abort+0x148)[0x7f9a960038e8]
      /lib64/libc.so.6(+0x2e266)[0x7f9a95ffb266]
      /lib64/libc.so.6(+0x2e312)[0x7f9a95ffb312]
      /usr/sbin/mysqld(+0x6cd43f)[0x555fbdfe843f]
      /usr/sbin/mysqld(_Z11mysqld_mainiPPc+0xd50)[0x555fbddff140]
      /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9a95feec05]
      /usr/sbin/mysqld(+0x4d736d)[0x555fbddf236d]
      The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
      information that should help you find out what is causing the crash.
      
      

      Attachments

        Issue Links

          Activity

            Yurchenko Alexey added a comment -

            This is a typical systemd response - it tries to shutdown the joiner (due to "timeout") before the joiner manages to complete SST:

            2018-03-20 16:36:15 0 [Note] /usr/sbin/mysqld (unknown): Normal shutdown

            The assert happens due to a race between the ongoing startup and systemd-initialted shutdown. So it is a matter of fixing the systemd script.

            Yurchenko Alexey added a comment - This is a typical systemd response - it tries to shutdown the joiner (due to "timeout") before the joiner manages to complete SST: 2018-03-20 16:36:15 0 [Note] /usr/sbin/mysqld (unknown): Normal shutdown The assert happens due to a race between the ongoing startup and systemd-initialted shutdown. So it is a matter of fixing the systemd script.

            It seems that wsrep_view_handler_cb() could add calls to sd_notifyf("EXTEND_TIMEOUT_USEC=…") to extend the startup timeout. Alternatively, wsrep_SE_init_wait() could call mysql_cond_timedwait() instead of mysql_cond_wait() in the loop and keep extending the timeout.

            marko Marko Mäkelä added a comment - It seems that wsrep_view_handler_cb() could add calls to sd_notifyf("EXTEND_TIMEOUT_USEC=…") to extend the startup timeout. Alternatively, wsrep_SE_init_wait() could call mysql_cond_timedwait() instead of mysql_cond_wait() in the loop and keep extending the timeout.

            on 10.2 joiner: => Rate:[ 39MiB/s] Avg:[32.9MiB/s] Elapsed:0:01:20 ---- SST failed to complete :Interrupted system call
            on 10.3 joiner: => Rate:[37.8MiB/s] Avg:[37.8MiB/s] Elapsed:0:04:55 ---- SST completed but few after mysqld crashed

            winstone Zdravelina Sokolovska (Inactive) added a comment - on 10.2 joiner: => Rate:[ 39MiB/s] Avg: [32.9MiB/s] Elapsed:0:01:20 ---- SST failed to complete :Interrupted system call on 10.3 joiner: => Rate: [37.8MiB/s] Avg: [37.8MiB/s] Elapsed:0:04:55 ---- SST completed but few after mysqld crashed
            winstone Zdravelina Sokolovska (Inactive) added a comment - - edited note : on 10.2 with the same setup and data it's come to the problem MDEV-15606 https://jira.mariadb.org/browse/MDEV-15606?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=109584#comment-109584

            Why 10.3 needs 4:55 when 10.2 completes SST in 1:20? Why 10.3 is 3.5 times slower?

            serg Sergei Golubchik added a comment - Why 10.3 needs 4:55 when 10.2 completes SST in 1:20? Why 10.3 is 3.5 times slower?
            jplindst Jan Lindström (Inactive) added a comment - https://github.com/MariaDB/server/commit/48e3b4ca5dd6a6cffbee64381dc301d43c66e036

            Looks as if timeouts still happen on server versions that should have this fixed according to version info above. SystemD timeouts during startup still happened on 10.2.18 and 10.3.9 servers. See MDEV-17571

            hholzgra Hartmut Holzgraefe added a comment - Looks as if timeouts still happen on server versions that should have this fixed according to version info above. SystemD timeouts during startup still happened on 10.2.18 and 10.3.9 servers. See MDEV-17571

            People

              jplindst Jan Lindström (Inactive)
              winstone Zdravelina Sokolovska (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.