[MDEV-14282] SIGHUP after rsync SST causes crash in my_hash_element - Jira

Geoff Montee (Inactive) created issue - 2017-11-03 23:29

Elena Stepanova made changes - 2017-11-04 00:08

Field	Original Value	New Value
Assignee		Andrii Nikitin [ anikitin ]

Geoff Montee (Inactive) made changes - 2017-11-07 21:01

Description

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

{noformat}
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)$
db-g3-bin.000042$
ls: cannot access db-devg1-binlog.*: No such file or directory$
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)$
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)$
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used or memory barrier$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.$
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN= 244892544453$
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) sy ced with group.$
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages f om the doublewrite buffer...$
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log$
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages f om redo log$
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.0000 1$
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.$
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start$
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6 36-82.2 started; log sequence number 1244894457235$
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.$
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started$
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.$
$
Status information:$
$
Current dir: /app/mysql/galera/$
Running threads: 2 Stack size: 295936$
Current locks:$
lock: 0x7fc98da48788:$
$
lock: 0x7fc9cdc6a508:$
$
lock: 0x7fcc598b5520:$
$
lock: 0x7fcc598b5440:$
$
lock: 0x7fcc598b4480:$
$
lock: 0x7fcc598b43a0:$
$
lock: 0x7fcc598b42c0:$
$
lock: 0x7fcc598b4640:$
$
lock: 0x7fcc598b2880:$
$
lock: 0x7fcc598b5360:$
$
lock: 0x7fcc598b2f80:$
$
lock: 0x7fcc598b3220:$
$
lock: 0x7fcc598b2ea0:$
$
lock: 0x7fcc598b3140:$
$
lock: 0x7fcc598b2dc0:$
$
lock: 0x7fcc598b3060:$
$
lock: 0x7fcc598b35a0:$
$
lock: 0x7fcc598b3680:$
$
lock: 0x7fcc598b3760:$
$
lock: 0x7fcc598b2ce0:$
$
lock: 0x7fcc598b2a40:$
$
lock: 0x7fcc598b2c00:$
$
lock: 0x7fcc598b2960:$
$
lock: 0x7fcc598b2b20:$
$
lock: 0x7fcc598b3300:$
$
lock: 0x7fcc598b33e0:$
$
lock: 0x7fcc598b34c0:$
$
lock: 0x7fcc598b4fe0:$
$
lock: 0x7fcc598b5280:$
$
lock: 0x7fcc598b51a0:$
$
lock: 0x7fcc598b50c0:$
$
lock: 0x7fcc598b4c60:$
$
lock: 0x7fcc598b4b80:$
$
lock: 0x7fcc598b4aa0:$
$
lock: 0x7fcc598b49c0:$
$
lock: 0x7fcc598b48e0:$
$
lock: 0x7fcc598b4e20:$
$
lock: 0x7fcc598b4800:$
$
lock: 0x7fcc598b4720:$
$
lock: 0x7fcc598b4f00:$
$
lock: 0x7fcc598b4560:$
$
lock: 0x7fcc598b4100:$
$
lock: 0x7fcc598b41e0:$
$
lock: 0x7fcc598b4020:$
$
lock: 0x7fcc598b3f40:$
$
lock: 0x7fcc598b3bc0:$
$
lock: 0x7fcc598b3e60:$
$
lock: 0x7fcc598b3d80:$
$
lock: 0x7fcc598b3ae0:$
$
lock: 0x7fcc598b3ca0:$
$
lock: 0x7fcc598b3840:$
$
lock: 0x7fcc598b3920:$
$
lock: 0x7fcc598b3a00:$
$
lock: 0x7fcc598b4d40:$
$
$
Key caches:$
default$
Buffer_size: 134217728$
Block_size: 1024$
Division_limit: 100$
Age_threshold: 300$
Partitions: 0$
blocks used: 0$
not flushed: 0$
w_requests: 0$
writes: 0$
r_requests: 0$
reads: 0$
$
$
handler status:$
read_key: 0$
read_next: 0$
read_rnd 0$
read_first: 0$
write: 0$
delete 0$
update: 0$
$
Table status:$
Opened tables: 2$
Open tables: 2$
Open files: 8$
Open streams: 0$
$
Alarm status:$
Active alarms: 0$
Max used alarms: 0$
Next alarm time: 0$
$
Memory status:$
Non-mmapped space allocated from system: 0$
Number of free chunks: 1$
Number of fastbin blocks: 0$
Number of mmapped regions: 0$
Space in mmapped regions: 0$
Maximum total allocated space: 0$
Space available in freed fastbin blocks: 0$
Total allocated space: 0$
Total free space: 0$
Top-most, releasable space: 0$
Estimated memory (with thread stack): 591872$
$
$
$
Events status:$
LLA = Last Locked At LUA = Last Unlocked At$
WOC = Waiting On Condition DL = Data Locked$
The Event Scheduler is disabled$
$
2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.$
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748 6170:230930977$
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4$
171103 17:07:23 [ERROR] mysqld got signal 11 ;$
This could be because you hit a bug. It is also possible that this binary$
or one of the libraries it was linked against is corrupt, improperly built,$
or misconfigured. This error can also be caused by malfunctioning hardware.$
$
To report this bug, see https://mariadb.com/kb/en/reporting-bugs$
$
We will try our best to scrape up some info that will hopefully help$
diagnose the problem, but since we have already crashed, $
something is definitely wrong and this may fail.$
$
Server version: 10.1.28-MariaDB$
key_buffer_size=134217728$
read_buffer_size=131072$
max_used_connections=0$
max_threads=1002$
thread_count=10$
It is possible that mysqld could use up to $
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K bytes of memor $
Hope that's ok; if not, decrease some variables in the equation.$
$
Thread pointer: 0x0$
Attempting backtrace. You can use the following information to find out$
where mysqld died. If you see no messages after this, something went$
terribly wrong...$
stack_bottom = 0x0 thread_stack 0x48400$
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]$
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded$
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table$
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.$
Version: '10.1.28-MariaDB' socket: '/app/mysql/temporaries/galera/mysqld.sock' port: 3306 MariaDB Server$
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.$
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)$
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]$
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]$
addr2line: 'mysqld': No such file$
mysqld(my_hash_element+0x13)[0x7fcc5900c023]$
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]$
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]$
mysqld(signal_hand+0x382)[0x7fcc5892c812]$
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]$
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]$
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains$
information that should help you find out what is causing the crash.$
{noformat}

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

{noformat}
2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required:
Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
Local state: 00000000-0000-0000-0000-000000000000:-1
2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/' --defaults-file '/app/mysql/config/galera.cnf' --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
db-g3-bin.000042
ls: cannot access db-node1-binlog.*: No such file or directory
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.

Status information:

Current dir: /app/mysql/galera/
Running threads: 2 Stack size: 295936
Current locks:
lock: 0x7fc98da48788:

lock: 0x7fc9cdc6a508:

lock: 0x7fcc598b5520:

lock: 0x7fcc598b5440:

lock: 0x7fcc598b4480:

lock: 0x7fcc598b43a0:

lock: 0x7fcc598b42c0:

lock: 0x7fcc598b4640:

lock: 0x7fcc598b2880:

lock: 0x7fcc598b5360:

lock: 0x7fcc598b2f80:

lock: 0x7fcc598b3220:

lock: 0x7fcc598b2ea0:

lock: 0x7fcc598b3140:

lock: 0x7fcc598b2dc0:

lock: 0x7fcc598b3060:

lock: 0x7fcc598b35a0:

lock: 0x7fcc598b3680:

lock: 0x7fcc598b3760:

lock: 0x7fcc598b2ce0:

lock: 0x7fcc598b2a40:

lock: 0x7fcc598b2c00:

lock: 0x7fcc598b2960:

lock: 0x7fcc598b2b20:

lock: 0x7fcc598b3300:

lock: 0x7fcc598b33e0:

lock: 0x7fcc598b34c0:

lock: 0x7fcc598b4fe0:

lock: 0x7fcc598b5280:

lock: 0x7fcc598b51a0:

lock: 0x7fcc598b50c0:

lock: 0x7fcc598b4c60:

lock: 0x7fcc598b4b80:

lock: 0x7fcc598b4aa0:

lock: 0x7fcc598b49c0:

lock: 0x7fcc598b48e0:

lock: 0x7fcc598b4e20:

lock: 0x7fcc598b4800:

lock: 0x7fcc598b4720:

lock: 0x7fcc598b4f00:

lock: 0x7fcc598b4560:

lock: 0x7fcc598b4100:

lock: 0x7fcc598b41e0:

lock: 0x7fcc598b4020:

lock: 0x7fcc598b3f40:

lock: 0x7fcc598b3bc0:

lock: 0x7fcc598b3e60:

lock: 0x7fcc598b3d80:

lock: 0x7fcc598b3ae0:

lock: 0x7fcc598b3ca0:

lock: 0x7fcc598b3840:

lock: 0x7fcc598b3920:

lock: 0x7fcc598b3a00:

lock: 0x7fcc598b4d40:

Key caches:
default
Buffer_size: 134217728
Block_size: 1024
Division_limit: 100
Age_threshold: 300
Partitions: 0
blocks used: 0
not flushed: 0
w_requests: 0
writes: 0
r_requests: 0
reads: 0

handler status:
read_key: 0
read_next: 0
read_rnd 0
read_first: 0
write: 0
delete 0
update: 0

Table status:
Opened tables: 2
Open tables: 2
Open files: 8
Open streams: 0

Alarm status:
Active alarms: 0
Max used alarms: 0
Next alarm time: 0

Memory status:
Non-mmapped space allocated from system: 0
Number of free chunks: 1
Number of fastbin blocks: 0
Number of mmapped regions: 0
Space in mmapped regions: 0
Maximum total allocated space: 0
Space available in freed fastbin blocks: 0
Total allocated space: 0
Total free space: 0
Top-most, releasable space: 0
Estimated memory (with thread stack): 591872

Events status:
LLA = Last Locked At LUA = Last Unlocked At
WOC = Waiting On Condition DL = Data Locked
The Event Scheduler is disabled

2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
171103 17:07:23 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.28-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
Version: '10.1.28-MariaDB' socket: '/app/mysql/temporaries/galera/mysqld.sock' port: 3306 MariaDB Server
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
addr2line: 'mysqld': No such file
mysqld(my_hash_element+0x13)[0x7fcc5900c023]
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
mysqld(signal_hand+0x382)[0x7fcc5892c812]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
{noformat}

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

Geoff Montee (Inactive) made changes - 2017-11-07 21:05

Description

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

{noformat}
2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required:
Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
Local state: 00000000-0000-0000-0000-000000000000:-1
2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/' --defaults-file '/app/mysql/config/galera.cnf' --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
db-g3-bin.000042
ls: cannot access db-node1-binlog.*: No such file or directory
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.

Status information:

Current dir: /app/mysql/galera/
Running threads: 2 Stack size: 295936
Current locks:
lock: 0x7fc98da48788:

lock: 0x7fc9cdc6a508:

lock: 0x7fcc598b5520:

lock: 0x7fcc598b5440:

lock: 0x7fcc598b4480:

lock: 0x7fcc598b43a0:

lock: 0x7fcc598b42c0:

lock: 0x7fcc598b4640:

lock: 0x7fcc598b2880:

lock: 0x7fcc598b5360:

lock: 0x7fcc598b2f80:

lock: 0x7fcc598b3220:

lock: 0x7fcc598b2ea0:

lock: 0x7fcc598b3140:

lock: 0x7fcc598b2dc0:

lock: 0x7fcc598b3060:

lock: 0x7fcc598b35a0:

lock: 0x7fcc598b3680:

lock: 0x7fcc598b3760:

lock: 0x7fcc598b2ce0:

lock: 0x7fcc598b2a40:

lock: 0x7fcc598b2c00:

lock: 0x7fcc598b2960:

lock: 0x7fcc598b2b20:

lock: 0x7fcc598b3300:

lock: 0x7fcc598b33e0:

lock: 0x7fcc598b34c0:

lock: 0x7fcc598b4fe0:

lock: 0x7fcc598b5280:

lock: 0x7fcc598b51a0:

lock: 0x7fcc598b50c0:

lock: 0x7fcc598b4c60:

lock: 0x7fcc598b4b80:

lock: 0x7fcc598b4aa0:

lock: 0x7fcc598b49c0:

lock: 0x7fcc598b48e0:

lock: 0x7fcc598b4e20:

lock: 0x7fcc598b4800:

lock: 0x7fcc598b4720:

lock: 0x7fcc598b4f00:

lock: 0x7fcc598b4560:

lock: 0x7fcc598b4100:

lock: 0x7fcc598b41e0:

lock: 0x7fcc598b4020:

lock: 0x7fcc598b3f40:

lock: 0x7fcc598b3bc0:

lock: 0x7fcc598b3e60:

lock: 0x7fcc598b3d80:

lock: 0x7fcc598b3ae0:

lock: 0x7fcc598b3ca0:

lock: 0x7fcc598b3840:

lock: 0x7fcc598b3920:

lock: 0x7fcc598b3a00:

lock: 0x7fcc598b4d40:

Key caches:
default
Buffer_size: 134217728
Block_size: 1024
Division_limit: 100
Age_threshold: 300
Partitions: 0
blocks used: 0
not flushed: 0
w_requests: 0
writes: 0
r_requests: 0
reads: 0

handler status:
read_key: 0
read_next: 0
read_rnd 0
read_first: 0
write: 0
delete 0
update: 0

Table status:
Opened tables: 2
Open tables: 2
Open files: 8
Open streams: 0

Alarm status:
Active alarms: 0
Max used alarms: 0
Next alarm time: 0

Memory status:
Non-mmapped space allocated from system: 0
Number of free chunks: 1
Number of fastbin blocks: 0
Number of mmapped regions: 0
Space in mmapped regions: 0
Maximum total allocated space: 0
Space available in freed fastbin blocks: 0
Total allocated space: 0
Total free space: 0
Top-most, releasable space: 0
Estimated memory (with thread stack): 591872

Events status:
LLA = Last Locked At LUA = Last Unlocked At
WOC = Waiting On Condition DL = Data Locked
The Event Scheduler is disabled

2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
171103 17:07:23 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.28-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
Version: '10.1.28-MariaDB' socket: '/app/mysql/temporaries/galera/mysqld.sock' port: 3306 MariaDB Server
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
addr2line: 'mysqld': No such file
mysqld(my_hash_element+0x13)[0x7fcc5900c023]
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
mysqld(signal_hand+0x382)[0x7fcc5892c812]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
{noformat}

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

{noformat}
2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required:
Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
Local state: 00000000-0000-0000-0000-000000000000:-1
2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/' --defaults-file '/app/mysql/config/galera.cnf' --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
db-g3-bin.000042
ls: cannot access db-node1-binlog.*: No such file or directory
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.

Status information:

Current dir: /app/mysql/galera/
Running threads: 2 Stack size: 295936
Current locks:
lock: 0x7fc98da48788:

lock: 0x7fc9cdc6a508:

lock: 0x7fcc598b5520:

lock: 0x7fcc598b5440:

lock: 0x7fcc598b4480:

lock: 0x7fcc598b43a0:

lock: 0x7fcc598b42c0:

lock: 0x7fcc598b4640:

lock: 0x7fcc598b2880:

lock: 0x7fcc598b5360:

lock: 0x7fcc598b2f80:

lock: 0x7fcc598b3220:

lock: 0x7fcc598b2ea0:

lock: 0x7fcc598b3140:

lock: 0x7fcc598b2dc0:

lock: 0x7fcc598b3060:

lock: 0x7fcc598b35a0:

lock: 0x7fcc598b3680:

lock: 0x7fcc598b3760:

lock: 0x7fcc598b2ce0:

lock: 0x7fcc598b2a40:

lock: 0x7fcc598b2c00:

lock: 0x7fcc598b2960:

lock: 0x7fcc598b2b20:

lock: 0x7fcc598b3300:

lock: 0x7fcc598b33e0:

lock: 0x7fcc598b34c0:

lock: 0x7fcc598b4fe0:

lock: 0x7fcc598b5280:

lock: 0x7fcc598b51a0:

lock: 0x7fcc598b50c0:

lock: 0x7fcc598b4c60:

lock: 0x7fcc598b4b80:

lock: 0x7fcc598b4aa0:

lock: 0x7fcc598b49c0:

lock: 0x7fcc598b48e0:

lock: 0x7fcc598b4e20:

lock: 0x7fcc598b4800:

lock: 0x7fcc598b4720:

lock: 0x7fcc598b4f00:

lock: 0x7fcc598b4560:

lock: 0x7fcc598b4100:

lock: 0x7fcc598b41e0:

lock: 0x7fcc598b4020:

lock: 0x7fcc598b3f40:

lock: 0x7fcc598b3bc0:

lock: 0x7fcc598b3e60:

lock: 0x7fcc598b3d80:

lock: 0x7fcc598b3ae0:

lock: 0x7fcc598b3ca0:

lock: 0x7fcc598b3840:

lock: 0x7fcc598b3920:

lock: 0x7fcc598b3a00:

lock: 0x7fcc598b4d40:

Key caches:
default
Buffer_size: 134217728
Block_size: 1024
Division_limit: 100
Age_threshold: 300
Partitions: 0
blocks used: 0
not flushed: 0
w_requests: 0
writes: 0
r_requests: 0
reads: 0

handler status:
read_key: 0
read_next: 0
read_rnd 0
read_first: 0
write: 0
delete 0
update: 0

Table status:
Opened tables: 2
Open tables: 2
Open files: 8
Open streams: 0

Alarm status:
Active alarms: 0
Max used alarms: 0
Next alarm time: 0

Memory status:
Non-mmapped space allocated from system: 0
Number of free chunks: 1
Number of fastbin blocks: 0
Number of mmapped regions: 0
Space in mmapped regions: 0
Maximum total allocated space: 0
Space available in freed fastbin blocks: 0
Total allocated space: 0
Total free space: 0
Top-most, releasable space: 0
Estimated memory (with thread stack): 591872

Events status:
LLA = Last Locked At LUA = Last Unlocked At
WOC = Waiting On Condition DL = Data Locked
The Event Scheduler is disabled

2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
171103 17:07:23 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.28-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
Version: '10.1.28-MariaDB' socket: '/app/mysql/temporaries/galera/mysqld.sock' port: 3306 MariaDB Server
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
addr2line: 'mysqld': No such file
mysqld(my_hash_element+0x13)[0x7fcc5900c023]
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
mysqld(signal_hand+0x382)[0x7fcc5892c812]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
{noformat}

I'm not sure if it could provide any hint as to the cause of the crash, but the donor's error log contained a lot of messages about long semaphore waits during the time period in which it acted as donor:

{noformat}
InnoDB: ###### Diagnostic info printed to the standard error stream
WSREP: avoiding InnoDB self crash due to long semaphore wait of > 600 seconds
Server is processing SST donor operation, fatal_cnt now: 0InnoDB: Warning: a long semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
{noformat}

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

Geoff Montee (Inactive) made changes - 2017-11-07 21:07

Description

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

{noformat}
2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required:
Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
Local state: 00000000-0000-0000-0000-000000000000:-1
2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/' --defaults-file '/app/mysql/config/galera.cnf' --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
db-g3-bin.000042
ls: cannot access db-node1-binlog.*: No such file or directory
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.

Status information:

Current dir: /app/mysql/galera/
Running threads: 2 Stack size: 295936
Current locks:
lock: 0x7fc98da48788:

lock: 0x7fc9cdc6a508:

lock: 0x7fcc598b5520:

lock: 0x7fcc598b5440:

lock: 0x7fcc598b4480:

lock: 0x7fcc598b43a0:

lock: 0x7fcc598b42c0:

lock: 0x7fcc598b4640:

lock: 0x7fcc598b2880:

lock: 0x7fcc598b5360:

lock: 0x7fcc598b2f80:

lock: 0x7fcc598b3220:

lock: 0x7fcc598b2ea0:

lock: 0x7fcc598b3140:

lock: 0x7fcc598b2dc0:

lock: 0x7fcc598b3060:

lock: 0x7fcc598b35a0:

lock: 0x7fcc598b3680:

lock: 0x7fcc598b3760:

lock: 0x7fcc598b2ce0:

lock: 0x7fcc598b2a40:

lock: 0x7fcc598b2c00:

lock: 0x7fcc598b2960:

lock: 0x7fcc598b2b20:

lock: 0x7fcc598b3300:

lock: 0x7fcc598b33e0:

lock: 0x7fcc598b34c0:

lock: 0x7fcc598b4fe0:

lock: 0x7fcc598b5280:

lock: 0x7fcc598b51a0:

lock: 0x7fcc598b50c0:

lock: 0x7fcc598b4c60:

lock: 0x7fcc598b4b80:

lock: 0x7fcc598b4aa0:

lock: 0x7fcc598b49c0:

lock: 0x7fcc598b48e0:

lock: 0x7fcc598b4e20:

lock: 0x7fcc598b4800:

lock: 0x7fcc598b4720:

lock: 0x7fcc598b4f00:

lock: 0x7fcc598b4560:

lock: 0x7fcc598b4100:

lock: 0x7fcc598b41e0:

lock: 0x7fcc598b4020:

lock: 0x7fcc598b3f40:

lock: 0x7fcc598b3bc0:

lock: 0x7fcc598b3e60:

lock: 0x7fcc598b3d80:

lock: 0x7fcc598b3ae0:

lock: 0x7fcc598b3ca0:

lock: 0x7fcc598b3840:

lock: 0x7fcc598b3920:

lock: 0x7fcc598b3a00:

lock: 0x7fcc598b4d40:

Key caches:
default
Buffer_size: 134217728
Block_size: 1024
Division_limit: 100
Age_threshold: 300
Partitions: 0
blocks used: 0
not flushed: 0
w_requests: 0
writes: 0
r_requests: 0
reads: 0

handler status:
read_key: 0
read_next: 0
read_rnd 0
read_first: 0
write: 0
delete 0
update: 0

Table status:
Opened tables: 2
Open tables: 2
Open files: 8
Open streams: 0

Alarm status:
Active alarms: 0
Max used alarms: 0
Next alarm time: 0

Memory status:
Non-mmapped space allocated from system: 0
Number of free chunks: 1
Number of fastbin blocks: 0
Number of mmapped regions: 0
Space in mmapped regions: 0
Maximum total allocated space: 0
Space available in freed fastbin blocks: 0
Total allocated space: 0
Total free space: 0
Top-most, releasable space: 0
Estimated memory (with thread stack): 591872

Events status:
LLA = Last Locked At LUA = Last Unlocked At
WOC = Waiting On Condition DL = Data Locked
The Event Scheduler is disabled

2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
171103 17:07:23 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.28-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
Version: '10.1.28-MariaDB' socket: '/app/mysql/temporaries/galera/mysqld.sock' port: 3306 MariaDB Server
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
addr2line: 'mysqld': No such file
mysqld(my_hash_element+0x13)[0x7fcc5900c023]
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
mysqld(signal_hand+0x382)[0x7fcc5892c812]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
{noformat}

I'm not sure if it could provide any hint as to the cause of the crash, but the donor's error log contained a lot of messages about long semaphore waits during the time period in which it acted as donor:

{noformat}
InnoDB: ###### Diagnostic info printed to the standard error stream
WSREP: avoiding InnoDB self crash due to long semaphore wait of > 600 seconds
Server is processing SST donor operation, fatal_cnt now: 0InnoDB: Warning: a long semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
{noformat}

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

{noformat}
2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required:
Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
Local state: 00000000-0000-0000-0000-000000000000:-1
2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/' --defaults-file '/app/mysql/config/galera.cnf' --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
db-g3-bin.000042
ls: cannot access db-node1-binlog.*: No such file or directory
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.

Status information:

Current dir: /app/mysql/galera/
Running threads: 2 Stack size: 295936
Current locks:
lock: 0x7fc98da48788:

lock: 0x7fc9cdc6a508:

lock: 0x7fcc598b5520:

lock: 0x7fcc598b5440:

lock: 0x7fcc598b4480:

lock: 0x7fcc598b43a0:

lock: 0x7fcc598b42c0:

lock: 0x7fcc598b4640:

lock: 0x7fcc598b2880:

lock: 0x7fcc598b5360:

lock: 0x7fcc598b2f80:

lock: 0x7fcc598b3220:

lock: 0x7fcc598b2ea0:

lock: 0x7fcc598b3140:

lock: 0x7fcc598b2dc0:

lock: 0x7fcc598b3060:

lock: 0x7fcc598b35a0:

lock: 0x7fcc598b3680:

lock: 0x7fcc598b3760:

lock: 0x7fcc598b2ce0:

lock: 0x7fcc598b2a40:

lock: 0x7fcc598b2c00:

lock: 0x7fcc598b2960:

lock: 0x7fcc598b2b20:

lock: 0x7fcc598b3300:

lock: 0x7fcc598b33e0:

lock: 0x7fcc598b34c0:

lock: 0x7fcc598b4fe0:

lock: 0x7fcc598b5280:

lock: 0x7fcc598b51a0:

lock: 0x7fcc598b50c0:

lock: 0x7fcc598b4c60:

lock: 0x7fcc598b4b80:

lock: 0x7fcc598b4aa0:

lock: 0x7fcc598b49c0:

lock: 0x7fcc598b48e0:

lock: 0x7fcc598b4e20:

lock: 0x7fcc598b4800:

lock: 0x7fcc598b4720:

lock: 0x7fcc598b4f00:

lock: 0x7fcc598b4560:

lock: 0x7fcc598b4100:

lock: 0x7fcc598b41e0:

lock: 0x7fcc598b4020:

lock: 0x7fcc598b3f40:

lock: 0x7fcc598b3bc0:

lock: 0x7fcc598b3e60:

lock: 0x7fcc598b3d80:

lock: 0x7fcc598b3ae0:

lock: 0x7fcc598b3ca0:

lock: 0x7fcc598b3840:

lock: 0x7fcc598b3920:

lock: 0x7fcc598b3a00:

lock: 0x7fcc598b4d40:

Key caches:
default
Buffer_size: 134217728
Block_size: 1024
Division_limit: 100
Age_threshold: 300
Partitions: 0
blocks used: 0
not flushed: 0
w_requests: 0
writes: 0
r_requests: 0
reads: 0

handler status:
read_key: 0
read_next: 0
read_rnd 0
read_first: 0
write: 0
delete 0
update: 0

Table status:
Opened tables: 2
Open tables: 2
Open files: 8
Open streams: 0

Alarm status:
Active alarms: 0
Max used alarms: 0
Next alarm time: 0

Memory status:
Non-mmapped space allocated from system: 0
Number of free chunks: 1
Number of fastbin blocks: 0
Number of mmapped regions: 0
Space in mmapped regions: 0
Maximum total allocated space: 0
Space available in freed fastbin blocks: 0
Total allocated space: 0
Total free space: 0
Top-most, releasable space: 0
Estimated memory (with thread stack): 591872

Events status:
LLA = Last Locked At LUA = Last Unlocked At
WOC = Waiting On Condition DL = Data Locked
The Event Scheduler is disabled

2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
171103 17:07:23 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.28-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
Version: '10.1.28-MariaDB' socket: '/app/mysql/temporaries/galera/mysqld.sock' port: 3306 MariaDB Server
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
addr2line: 'mysqld': No such file
mysqld(my_hash_element+0x13)[0x7fcc5900c023]
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
mysqld(signal_hand+0x382)[0x7fcc5892c812]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
{noformat}

I'm not sure if it could provide any hint as to the cause of the crash, but the donor's error log contained a lot of messages about long semaphore waits involving log_sys->checkpoint_lock during the time period in which it acted as donor:

{noformat}
InnoDB: ###### Diagnostic info printed to the standard error stream
WSREP: avoiding InnoDB self crash due to long semaphore wait of > 600 seconds
Server is processing SST donor operation, fatal_cnt now: 0InnoDB: Warning: a long semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
{noformat}

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

Andrii Nikitin (Inactive) added a comment - 2017-11-08 08:29 - edited

Geoff, I am not sure if checkoint on master is related, but the crash happened during something like kill -HUP $(pidof mysqld) (which follows something like mysqladmin debug) on problem node.
So, the most we can do is to try find out if user was using some custom commands/scripts during/after sst and try to write test case which replicates that behavior.

Andrii Nikitin (Inactive) added a comment - 2017-11-08 08:29 - edited Geoff, I am not sure if checkoint on master is related, but the crash happened during something like kill -HUP $(pidof mysqld) (which follows something like mysqladmin debug ) on problem node. So, the most we can do is to try find out if user was using some custom commands/scripts during/after sst and try to write test case which replicates that behavior.

Andrii Nikitin (Inactive) made changes - 2017-11-08 08:53

Fix Version/s

10.1 [ 16100 ]

Geoff Montee (Inactive) added a comment - 2017-11-08 18:01

Thanks, Andrii. It sounds like you don't think that this was a crash due to a bug, and instead, some user or other process actually explicitly killed mysqld on the joiner. Do I understand you correctly? If so, do you think the kill could have occurred due to a systemd timeout (i.e. the problem described by ~~MDEV-9202~~)?

Geoff Montee (Inactive) added a comment - 2017-11-08 18:01 Thanks, Andrii. It sounds like you don't think that this was a crash due to a bug, and instead, some user or other process actually explicitly killed mysqld on the joiner. Do I understand you correctly? If so, do you think the kill could have occurred due to a systemd timeout (i.e. the problem described by MDEV-9202 )?

Andrii Nikitin (Inactive) added a comment - 2017-11-08 18:15

Any crash is a bug, just it looks that conditions which caused this particular crash are linked to user actions. (Small chance exists that it was sst script which generated that `kill` command , which is reflected in `signal_hand` call in the problem stack trace, but I doubt it). kill -HUP is not dangerous for uptime, or at least shouldn't be.

Andrii Nikitin (Inactive) added a comment - 2017-11-08 18:15 Any crash is a bug, just it looks that conditions which caused this particular crash are linked to user actions. (Small chance exists that it was sst script which generated that `kill` command , which is reflected in `signal_hand` call in the problem stack trace, but I doubt it). kill -HUP is not dangerous for uptime, or at least shouldn't be.

Geoff Montee (Inactive) added a comment - 2017-11-13 18:12 - edited

Ah, I see what you mean. You are saying that it looks like the server received a SIGHUP right before it crashed. SIGHUP causes the server to reload privilege tables and flush its logs, which shouldn't cause a problem for a server under normal scenarios, but maybe it crashed the server in this case because the server was not completely initialized yet, as it had just finished receiving its SST.

It doesn't look like systemd would be responsible for the SIGHUP, because systemd only sends SIGHUP if SendSIGHUP is enabled. That is not enabled in MariaDB's unit file, and it defaults to "no", so it should be disabled for MariaDB.

I'll ask the user who saw this crash if they know why the server received a SIGHUP in this case.

Geoff Montee (Inactive) added a comment - 2017-11-13 18:12 - edited Ah, I see what you mean. You are saying that it looks like the server received a SIGHUP right before it crashed. SIGHUP causes the server to reload privilege tables and flush its logs, which shouldn't cause a problem for a server under normal scenarios, but maybe it crashed the server in this case because the server was not completely initialized yet, as it had just finished receiving its SST. It doesn't look like systemd would be responsible for the SIGHUP, because systemd only sends SIGHUP if SendSIGHUP is enabled. That is not enabled in MariaDB's unit file, and it defaults to "no", so it should be disabled for MariaDB. I'll ask the user who saw this crash if they know why the server received a SIGHUP in this case.

Geoff Montee (Inactive) made changes - 2017-11-28 22:09

Summary

Galera joiner node crashes in my_hash_element after rsync SST completes

SIGHUP after rsync SST causes crashe in my_hash_element

Geoff Montee (Inactive) added a comment - 2017-11-28 22:10 - edited

The SIGHUP seems to be caused by Jenkins and/or Ansible somehow.

Geoff Montee (Inactive) added a comment - 2017-11-28 22:10 - edited The SIGHUP seems to be caused by Jenkins and/or Ansible somehow.

Geoff Montee (Inactive) made changes - 2017-11-28 22:11

Summary

SIGHUP after rsync SST causes crashe in my_hash_element

SIGHUP after rsync SST causes crash in my_hash_element

Sergei Golubchik made changes - 2017-12-12 09:28

Assignee

Andrii Nikitin [ anikitin ]

Sergei Golubchik added a comment - 2017-12-12 09:54

So, while it is a bug, it is caused by a user sending SIGHUP. If someone has enough privileges to send a SIGHUP, one can send a SIGKILL just as well. So the bug is that a user, who can crash mysqld, crashes mysqld. Thus, I believe, it can be downgraded to Minor.

Sergei Golubchik added a comment - 2017-12-12 09:54 So, while it is a bug, it is caused by a user sending SIGHUP. If someone has enough privileges to send a SIGHUP, one can send a SIGKILL just as well. So the bug is that a user, who can crash mysqld, crashes mysqld. Thus, I believe, it can be downgraded to Minor.

Sergei Golubchik made changes - 2017-12-12 09:54

Priority

Major [ 3 ]

Minor [ 4 ]

Elena Stepanova added a comment - 2017-12-16 02:59

Setting to Confirmed based on comments above

Elena Stepanova added a comment - 2017-12-16 02:59 Setting to Confirmed based on comments above

Elena Stepanova made changes - 2017-12-16 02:59

Status

Open [ 1 ]

Confirmed [ 10101 ]

Geoff Montee (Inactive) made changes - 2018-04-16 21:46

Link

This issue relates to ~~MDEV-15891~~ [ ~~MDEV-15891~~ ]

Jan Lindström (Inactive) made changes - 2018-07-19 04:44

Assignee

Andrei Elkin [ elkin ]

Jan Lindström (Inactive) added a comment - 2018-07-19 04:45

Elkin Can you have a look as stack seems to contain replication related functions.

Jan Lindström (Inactive) added a comment - 2018-07-19 04:45 Elkin Can you have a look as stack seems to contain replication related functions.

Sergei Golubchik made changes - 2021-12-06 21:36

Workflow

MariaDB v3 [ 83622 ]

MariaDB v4 [ 143985 ]

Jan Lindström added a comment - 2023-06-06 09:28

10.1 is EOL.

Jan Lindström added a comment - 2023-06-06 09:28 10.1 is EOL.

Jan Lindström made changes - 2023-06-06 09:28

Fix Version/s		N/A [ 14700 ]
Fix Version/s	10.1 [ 16100 ]
Resolution		Won't Fix [ 2 ]
Status	Confirmed [ 10101 ]	Closed [ 6 ]

Jira Automation (IT) made changes - 2024-07-04 08:10

Zendesk Related Tickets

171572

MariaDB Server

SIGHUP after rsync SST causes crash in my_hash_element

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration