[MDEV-15607] mysqld crashed few after node is being joined with sst Created: 2018-03-20  Updated: 2022-07-06  Resolved: 2018-06-28

Status: Closed
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.1, 10.3.4, 10.2
Fix Version/s: 10.1.35, 10.2.17, 10.3.8

Type: Bug Priority: Critical
Reporter: Zdravelina Sokolovska (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.4


Attachments: Text File logs_joiner.txt    
Issue Links:
Duplicate
duplicates MDEV-15761 mysqld crashed after WSREP ERROR Try... Closed
Relates
relates to MDEV-14705 systemd: EXTEND_TIMEOUT_USEC= to avoi... Closed
relates to MDEV-17934 Make systemd timeout behavior more co... Closed
relates to MDEV-21231 notify systemd of long running SST to... Stalled
relates to MDEV-15466 SST not working after upgrade from 10... Closed
relates to MDEV-17571 Make systemd timeout behavior more co... Closed
relates to MDEV-17934 Make systemd timeout behavior more co... Closed
relates to MDEV-25887 "Got notification message from PID xx... Closed

 Description   

mysqld crashed few after node is being joined with sst

join Node to galera cluster loaded with ~12G data and set wsrep_sst_method=mariabackup

  1. service mysql start --wsrep_cluster_address=gcomm://192.168.104.195
    Starting mysql (via systemctl): Job for mariadb.service failed because a timeout was exceeded. See "systemctl status mariadb.service" and "journalctl -xe" for details.
    [FAILED]

 systemctl status mariadb.service
● mariadb.service - MariaDB 10.3.4 database server
   Loaded: loaded (/usr/lib/systemd/system/mariadb.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/mariadb.service.d
           └─migrated-from-my.cnf-settings.conf
   Active: failed (Result: timeout) since Tue 2018-03-20 16:34:15 EET; 1h 7min ago
     Docs: man:mysqld(8)
           https://mariadb.com/kb/en/library/systemd/
  Process: 16512 ExecStartPre=/bin/sh -c [ ! -e /usr/bin/galera_recovery ] && VAR= ||   VAR=`/usr/bin/galera_recovery`; [ $? -eq 0 ]   && systemctl set-environment _WSREP_START_POSITION=$VAR || exit 1 (code=exited, status=0/SUCCESS)
  Process: 16509 ExecStartPre=/bin/sh -c systemctl unset-environment _WSREP_START_POSITION (code=exited, status=0/SUCCESS)
 Main PID: 16601
   CGroup: /system.slice/mariadb.service

 joiner: => Rate:[42.6MiB/s] Avg:[37.3MiB/s] Elapsed:0:03:10
 
2018-03-20 16:36:09 0 [Note] WSREP: 1.0 (t4w5): State transfer to 0.0 (t4w3) complete.
2018-03-20 16:36:09 0 [Note] WSREP: Member 1.0 (t4w5) synced with group.
   joiner: => Rate:[37.8MiB/s] Avg:[37.8MiB/s] Elapsed:0:04:55
WSREP_SST: [INFO] NOTE: Joiner-Recv-SST took 297 seconds (20180320 16:36:14.043)
WSREP_SST: [INFO] Preparing the backup at /var/lib/mysql//.sst (20180320 16:36:14.053)
WSREP_SST: [INFO] Evaluating mariabackup --innobackupex   --apply-log $rebuildcmd ${DATA} &>${DATA}/innobackup.prepare.log (20180320 16:36:14.058            )
 
~~
 
2018-03-20 16:36:16 0 [ERROR] WSREP: Trying to launch slave threads before creating connection at 'gcomm://192.168.104.191,192.168.104.193,192.16            8.104.195,192.168.104.196'
mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.3.4/sql/wsrep_thd.cc:446: void wsrep_create_appliers(lo            ng int): Assertion `0' failed.
180320 16:36:16 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.3.4-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467345 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x555fbe5edede]
/usr/sbin/mysqld(handle_fatal_signal+0x357)[0x555fbe069027]
/lib64/libpthread.so.0(+0xf5e0)[0x7f9a97af55e0]
/lib64/libc.so.6(gsignal+0x37)[0x7f9a960021f7]
/lib64/libc.so.6(abort+0x148)[0x7f9a960038e8]
/lib64/libc.so.6(+0x2e266)[0x7f9a95ffb266]
/lib64/libc.so.6(+0x2e312)[0x7f9a95ffb312]
/usr/sbin/mysqld(+0x6cd43f)[0x555fbdfe843f]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0xd50)[0x555fbddff140]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f9a95feec05]
/usr/sbin/mysqld(+0x4d736d)[0x555fbddf236d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.



 Comments   
Comment by Alexey [ 2018-03-20 ]

This is a typical systemd response - it tries to shutdown the joiner (due to "timeout") before the joiner manages to complete SST:

2018-03-20 16:36:15 0 [Note] /usr/sbin/mysqld (unknown): Normal shutdown

The assert happens due to a race between the ongoing startup and systemd-initialted shutdown. So it is a matter of fixing the systemd script.

Comment by Marko Mäkelä [ 2018-03-20 ]

It seems that wsrep_view_handler_cb() could add calls to sd_notifyf("EXTEND_TIMEOUT_USEC=…") to extend the startup timeout. Alternatively, wsrep_SE_init_wait() could call mysql_cond_timedwait() instead of mysql_cond_wait() in the loop and keep extending the timeout.

Comment by Zdravelina Sokolovska (Inactive) [ 2018-03-26 ]

on 10.2 joiner: => Rate:[ 39MiB/s] Avg:[32.9MiB/s] Elapsed:0:01:20 ---- SST failed to complete :Interrupted system call
on 10.3 joiner: => Rate:[37.8MiB/s] Avg:[37.8MiB/s] Elapsed:0:04:55 ---- SST completed but few after mysqld crashed

Comment by Zdravelina Sokolovska (Inactive) [ 2018-04-11 ]

note : on 10.2 with the same setup and data it's come to the problem MDEV-15606
https://jira.mariadb.org/browse/MDEV-15606?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=109584#comment-109584

Comment by Sergei Golubchik [ 2018-04-17 ]

Why 10.3 needs 4:55 when 10.2 completes SST in 1:20? Why 10.3 is 3.5 times slower?

Comment by Jan Lindström (Inactive) [ 2018-04-24 ]

https://github.com/MariaDB/server/commit/48e3b4ca5dd6a6cffbee64381dc301d43c66e036

Comment by Hartmut Holzgraefe [ 2018-10-31 ]

Looks as if timeouts still happen on server versions that should have this fixed according to version info above. SystemD timeouts during startup still happened on 10.2.18 and 10.3.9 servers. See MDEV-17571

Generated at Thu Feb 08 08:22:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.