[MDEV-9772] mysqld crash [Failed to JOIN the cluster after SST] Created: 2016-03-21  Updated: 2016-11-10  Resolved: 2016-06-03

Status: Closed
Project: MariaDB Server
Component/s: Galera SST, Replication
Affects Version/s: 10.1.12, 10.1.13
Fix Version/s: 10.1.15

Type: Bug Priority: Major
Reporter: Matyashov Andrey Assignee: Nirbhay Choubey (Inactive)
Resolution: Won't Fix Votes: 1
Labels: None
Environment:

Debian Jessie (with systemd)
Linux mariadb-02 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.5-1~bpo8+1 (2016-02-23) x86_64 GNU/Linux


Issue Links:
Duplicate
is duplicated by MDEV-10021 Joiner node crashes badly after succe... Closed

 Description   

Sometimes with restart any node after replication daemon mysqld is crashed.

2016-03-21 23:48:37 140077095184128 [Warning] WSREP: Member 1.2 (mariadb-02) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2016-03-21 23:48:38 140077095184128 [Warning] WSREP: Member 1.2 (mariadb-02) requested state transfer from '*any*', but it is impossible to select State Transfer donor: Resource temporarily unavailable
2016-03-21 23:48:38 140077095184128 [Note] WSREP: 2.2 (mariadb-01): State transfer to 0.2 (mariadb-03) complete.
2016-03-21 23:48:38 140077095184128 [Note] WSREP: Member 2.2 (mariadb-01) synced with group.
2016-03-21 23:48:39 140077095184128 [Note] WSREP: Member 1.2 (mariadb-02) requested state transfer from '*any*'. Selected 2.2 (mariadb-01)(SYNCED) as donor.
2016-03-21 23:48:39 140077095184128 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 45386)
2016-03-21 23:48:39 140082886372096 [Note] WSREP: Requesting state transfer: success after 80 tries, donor: 2
2016-03-21 23:49:02 140077103576832 [Note] WSREP: (18e4fe19, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.100.22.53:4567 
2016-03-21 23:49:02 140077103576832 [Note] WSREP: declaring b0cafdc9 at tcp://10.100.22.51:4567 stable
2016-03-21 23:49:02 140077103576832 [Note] WSREP: declaring d4ba184b at tcp://10.100.22.54:4567 stable
2016-03-21 23:49:02 140077103576832 [Note] WSREP: forgetting 01c16e15 (tcp://10.100.22.53:4567)
2016-03-21 23:49:02 140077103576832 [Note] WSREP: (18e4fe19, 'tcp://0.0.0.0:4567') turning message relay requesting off
2016-03-21 23:49:03 140077103576832 [Note] WSREP: Node 18e4fe19 state prim
2016-03-21 23:49:03 140077103576832 [Note] WSREP: view(view_id(PRIM,18e4fe19,123) memb {
        18e4fe19,2
        b0cafdc9,2
        d4ba184b,0
} joined {
} left {
} partitioned {
        01c16e15,2
})
2016-03-21 23:49:03 140077103576832 [Note] WSREP: save pc into disk
2016-03-21 23:49:03 140077103576832 [Note] WSREP: forgetting 01c16e15 (tcp://10.100.22.53:4567)
2016-03-21 23:49:03 140077095184128 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 3
2016-03-21 23:49:03 140077095184128 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 58c1a508-efa6-11e5-af34-e3ae0409c024
2016-03-21 23:49:03 140077095184128 [Note] WSREP: STATE EXCHANGE: sent state msg: 58c1a508-efa6-11e5-af34-e3ae0409c024
2016-03-21 23:49:03 140077095184128 [Note] WSREP: STATE EXCHANGE: got state msg: 58c1a508-efa6-11e5-af34-e3ae0409c024 from 0 (mariadb-02)
2016-03-21 23:49:03 140077095184128 [Note] WSREP: STATE EXCHANGE: got state msg: 58c1a508-efa6-11e5-af34-e3ae0409c024 from 2 (garb)
2016-03-21 23:49:03 140077095184128 [Note] WSREP: STATE EXCHANGE: got state msg: 58c1a508-efa6-11e5-af34-e3ae0409c024 from 1 (mariadb-01)
2016-03-21 23:49:03 140077095184128 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 71,
        members    = 2/3 (joined/total),
        act_id     = 45815,
        last_appl. = 45683,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = 3f92b3c2-ef9c-11e5-ad47-93c5357ece23
2016-03-21 23:49:03 140077095184128 [Note] WSREP: Flow-control interval: [28, 28]
2016-03-21 23:49:08 140077103576832 [Note] WSREP:  cleaning up 01c16e15 (tcp://10.100.22.53:4567)
2016-03-21 23:49:56 140077095184128 [Note] WSREP: 1.2 (mariadb-01): State transfer to 0.2 (mariadb-02) complete.
2016-03-21 23:49:56 140077095184128 [Note] WSREP: Member 1.2 (mariadb-01) synced with group.
2016-03-21 23:50:15 140082886686656 [Note] WSREP: SST complete, seqno: 45838
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: The InnoDB memory heap is disabled
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Memory barrier is not used
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Compressed tables use zlib 1.2.8
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Using Linux native AIO
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Using SSE crc32 instructions
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Completed initialization of buffer pool
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Highest supported file format is Barracuda.
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: 128 rollback segment(s) are active.
2016-03-21 23:50:16 140082886686656 [Note] InnoDB: Waiting for purge to start
2016-03-21 23:50:17 140082886686656 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.28-76.1 started; log sequence number 74594461331
2016-03-21 23:50:17 140076138858240 [Note] InnoDB: Dumping buffer pool(s) not yet started
2016-03-21 23:50:17 140082886686656 [Note] Plugin 'FEEDBACK' is disabled.
2016-03-21 23:50:17 140082886686656 [Note] Recovering after a crash using /var/log/mysql/mariadb-bin
2016-03-21 23:50:17 140082886686656 [Note] Starting crash recovery...
2016-03-21 23:50:17 140082886686656 [Note] Crash recovery finished.
2016-03-21 23:50:17 140082886686656 [Note] Server socket created on IP: '0.0.0.0'.
2016-03-21 23:50:17 140082885253888 [Note] /usr/sbin/mysqld: Normal shutdown
 
2016-03-21 23:50:17 140082885253888 [Note] WSREP: Stop replication
2016-03-21 23:50:17 140082885253888 [Note] WSREP: Closing send monitor...
2016-03-21 23:50:17 140082885253888 [Note] WSREP: Closed send monitor.
2016-03-21 23:50:17 140082885253888 [Note] WSREP: gcomm: terminating thread
2016-03-21 23:50:17 140082885253888 [Note] WSREP: gcomm: joining thread
2016-03-21 23:50:17 140082885253888 [Note] WSREP: gcomm: closing backend
2016-03-21 23:50:17 140082886686656 [Warning] 'user' entry 'root@mariadb-01' ignored in --skip-name-resolve mode.
2016-03-21 23:50:17 140082886686656 [Warning] 'proxies_priv' entry '@% root@mariadb-01' ignored in --skip-name-resolve mode.
2016-03-21 23:50:17 140082885253888 [Note] WSREP: view(view_id(NON_PRIM,18e4fe19,123) memb {
        18e4fe19,2
} joined {
} left {
} partitioned {
        b0cafdc9,2
        d4ba184b,0
})
2016-03-21 23:50:17 140082885253888 [Note] WSREP: view((empty))
2016-03-21 23:50:17 140077095184128 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2016-03-21 23:50:17 140082885253888 [Note] WSREP: gcomm: closed
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Flow-control interval: [16, 16]
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Received NON-PRIMARY.
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Shifting JOINER -> OPEN (TO: 45889)
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Received self-leave message.
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Flow-control interval: [0, 0]
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2016-03-21 23:50:17 140077095184128 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 45889)
2016-03-21 23:50:17 140077095184128 [Note] WSREP: RECV thread exiting 0: Success
2016-03-21 23:50:17 140082885253888 [Note] WSREP: recv_thread() joined.
2016-03-21 23:50:17 140082885253888 [Note] WSREP: Closing replication queue.
2016-03-21 23:50:17 140082885253888 [Note] WSREP: Closing slave action queue.
2016-03-21 23:50:17 140082886686656 [Note] WSREP: Signalling provider to continue.
2016-03-21 23:50:17 140082833533696 [Note] Event Scheduler: scheduler thread started with id 4
2016-03-21 23:50:17 140082886686656 [Note] WSREP: SST received: 3f92b3c2-ef9c-11e5-ad47-93c5357ece23:45838
2016-03-21 23:50:17 140082886686656 [ERROR] WSREP: Trying to launch slave threads before creating connection at 'gcomm://10.100.22.54,10.100.22.53,10.100.22.51'
mysqld: /home/buildbot/buildbot/build/mariadb-10.1.12/sql/wsrep_thd.cc:401: void wsrep_create_appliers(long int): Assertion `0' failed.
160321 23:50:17 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see http://kb.askmonty.org/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.12-MariaDB-1~jessie
key_buffer_size=67108864
read_buffer_size=33554432
max_used_connections=0
max_threads=1026
thread_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 67326468 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
2016-03-21 23:50:17 140082886372096 [ERROR] WSREP: gcs/src/gcs.cpp:_join():810: Sending JOIN failed: -103 (Software caused connection abort).
stack_bottom = 0x0 thread_stack 0x48400
2016-03-21 23:50:17 140082886372096 [ERROR] WSREP: Failed to JOIN the cluster after SST
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43139, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43266, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43395, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43521, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43650, -77 (File descriptor in bad state)
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5630f4313a3e]
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43778, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 43904, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44036, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44160, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44290, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44418, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44547, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44674, -77 (File descriptor in bad state)
/usr/sbin/mysqld(handle_fatal_signal+0x34d)[0x5630f3e55cbd]
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44801, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 44931, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45058, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45185, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45312, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45440, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45522, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45650, -77 (File descriptor in bad state)
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8d0)[0x7f679671c8d0]
2016-03-21 23:50:17 140077157435136 [Warning] WSREP: Failed to report last committed 45779, -77 (File descriptor in bad state)
2016-03-21 23:50:17 140082886372096 [Note] WSREP: New cluster view: global state: 3f92b3c2-ef9c-11e5-ad47-93c5357ece23:45815, view# 72: Primary, number of nodes: 3, my index: 0, protocol version 3
2016-03-21 23:50:17 140082886372096 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-03-21 23:50:17 140082886372096 [Note] WSREP: REPL Protocols: 7 (3, 2)
2016-03-21 23:50:17 140077157435136 [Note] WSREP: Service thread queue flushed.
2016-03-21 23:50:17 140082886372096 [Note] WSREP: Assign initial position for certification: 45815, protocol version: 3
2016-03-21 23:50:17 140077157435136 [Note] WSREP: Service thread queue flushed.
2016-03-21 23:50:17 140082886372096 [ERROR] WSREP: gcs/src/gcs.cpp:_join():810: Sending JOIN failed: -103 (Software caused connection abort).
2016-03-21 23:50:17 140082886372096 [ERROR] WSREP: Failed to JOIN the cluster after SST
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f679480d067]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f679480e448]
/lib/x86_64-linux-gnu/libc.so.6(+0x2e266)[0x7f6794806266]
/lib/x86_64-linux-gnu/libc.so.6(+0x2e312)[0x7f6794806312]
/usr/sbin/mysqld(+0x552592)[0x5630f3e06592]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0x1484)[0x5630f3c41044]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0x7f67947f9b45]
/usr/sbin/mysqld(+0x38007d)[0x5630f3c3407d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.



 Comments   
Comment by Nirbhay Choubey (Inactive) [ 2016-04-04 ]

sys.admin Does it happen in a scenario where multiple joiner are contenting to join at the same time? Could you share the configurations?

Comment by Matyashov Andrey [ 2016-04-04 ]

My config:

[client]
max_allowed_packet      = 128M
port                    = 3306
socket                  = /var/run/mysqld/mysqld.sock
 
[mysqld_safe]
socket                  = /var/run/mysqld/mysqld.sock
nice                    = 0
 
[mysqld]
user                    = mysql
pid-file                = /var/run/mysqld/mysqld.pid
socket                  = /var/run/mysqld/mysqld.sock
port                    = 3306
basedir                 = /usr
datadir                 = /var/lib/mysql
tmpdir                  = /tmp
lc_messages_dir         = /usr/share/mysql
lc_messages             = en_US
skip-external-locking   = 1
collation-server        = utf8_general_ci
init-connect            = 'SET NAMES utf8'
character-set-server    = utf8
bind-address            = 0.0.0.0
net_write_timeout       = 300
skip-name-resolve       = yes
max_connections         = 128
connect_timeout         = 5
wait_timeout            = 600
max_allowed_packet      = 128M
thread_cache_size       = 128
sort_buffer_size        = 16M
bulk_insert_buffer_size = 64M
tmp_table_size          = 128M
max_heap_table_size     = 128M
myisam_recover_options  = BACKUP
key_buffer_size         = 32M
table_open_cache        = 400
myisam_sort_buffer_size = 512M
concurrent_insert       = 2
read_buffer_size        = 32M
read_rnd_buffer_size    = 16M
query_cache_limit       = 32M
query_cache_size        = 256M
general_log_file        = /var/log/mysql/mysql.log
general_log             = 0
log-error               = /var/log/mysql/mysql_err.log
log_warnings            = 2
slow_query_log=0
slow_query_log_file     = /var/log/mysql/mariadb-slow.log
long_query_time         = 10
log_slow_verbosity      = query_plan
server-id               = 1
report_host             = mariadb-01
log_bin                 = /var/log/mysql/mariadb-bin
log_bin_index           = /var/log/mysql/mariadb-bin.index
event_scheduler         = on
sync_binlog             = 0
expire_logs_days        = 32
max_binlog_size         = 1024M
relay_log              = /var/log/mysql/relay-bin
relay_log_index        = /var/log/mysql/relay-bin.index
relay_log_info_file    = /var/log/mysql/relay-bin.info
log_slave_updates       = 1
slave_type_conversions  = ALL_NON_LOSSY
default_storage_engine  = InnoDB
innodb_log_file_size    = 512M
innodb_buffer_pool_size = 512M
innodb_log_buffer_size  = 64M
innodb_file_per_table   = 1
innodb_doublewrite      = 1
innodb_open_files       = 2400
innodb_io_capacity      = 2400
innodb_flush_method     = O_DIRECT
innodb_file_format      = barracuda
innodb-data-file-path   = ibdata1:10M:autoextend
 
[galera]
wsrep_on                = ON
wsrep_provider          = /usr/lib/libgalera_smm.so
wsrep_node_address      = 10.100.22.51
wsrep_cluster_address   = gcomm://10.100.22.54,10.100.22.53,10.100.22.52
wsrep_sst_auth          = root:MY_SECRET
wsrep_cluster_name      = mariadb-cluster
wsrep_node_name         = mariadb-01
binlog_format           = row
default_storage_engine  = InnoDB
innodb_autoinc_lock_mode= 2
wsrep_sst_method        = xtrabackup-v2
wsrep_retry_autocommit  = 3
wsrep_restart_slave     = On
wsrep_debug             = Off
wsrep_auto_increment_control = Off
wsrep_provider_options  = "gcache.size=5G;repl.commit_order=1;gmcast.segment=2;pc.recovery=TRUE;debug=no;gcs.fc_debug=0;pc.wait_prim=true"
wsrep_replicate_myisam  = On
max_allowed_packet      = 128M
bind-address            = 0.0.0.0
wsrep_slave_threads     = 32
innodb_flush_log_at_trx_commit = 1
 
[mysqldump]
quick
quote-names
max_allowed_packet      = 128M
 
[mysql]
 
[isamchk]
key_buffer              = 128M
!includedir /etc/mysql/conf.d/

Comment by Andrew Garner [ 2016-04-22 ]

I just ran into this by accident last night in a test environment. This seemed to be caused by the service timing out during startup and systemd sends a SIGTERM. SST finished up but mysqld immediately starts the shutdown process afterwards. This seems to shutdown the wsrep connections (I suppose in the shutdown thread) and when the startup bits finish trickling through to wsrep_create_appliers we hit this assertion. This seems to line up with the logs in this issue where we see an immediate shutdown starting after the initial startup.

Avoiding was fairly easy by setting a sufficiently high TimeoutStartSec for the mariadb service. I was able to reproduce this multiple times this morning by forcing SST and setting TimeoutStartSec rather low. If it is useful I can provide logs or other debugging info from my environment.

Comment by Maiavria [ 2016-05-04 ]

Thanks Andrew Garner, this fixed my problem too.

I think this should be fixed in the next release, because when starting a node of a cluster it may always need to synchronize with other nodes and thus take from 1 to 20 or even 40 minutes depending of the amount of data.

Comment by Nirbhay Choubey (Inactive) [ 2016-06-03 ]

This could happen when systemd service timeout for mariadb is not sufficiently large.

Comment by Alexander Chernov [ 2016-11-10 ]

Just a heads up I am still seeing this behavior on 10.1.18-MariaDB-1~jessie (ticket is reporting 10.1.15 as fixed)

Comment by Nirbhay Choubey (Inactive) [ 2016-11-10 ]

alekc This was closed as Won't Fix. You'll need to increase the TimeoutStartSec.

Generated at Thu Feb 08 07:37:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.