[MDEV-14282] SIGHUP after rsync SST causes crash in my_hash_element Created: 2017-11-03  Updated: 2023-06-06  Resolved: 2023-06-06

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST, wsrep
Affects Version/s: 10.1.28
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Geoff Montee (Inactive) Assignee: Andrei Elkin
Resolution: Won't Fix Votes: 0
Labels: crash, galera, sst, wsrep

Issue Links:
Relates
relates to MDEV-15891 SIGHUP during rsync SST causes SST to... Closed

 Description   

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required: 
	Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
	Local state: 00000000-0000-0000-0000-000000000000:-1
2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.
2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/'  --defaults-file '/app/mysql/config/galera.cnf'  --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst
2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3
2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.
2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)
	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.
2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)
2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0
2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S
2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.
WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)
db-g3-bin.000042
ls: cannot access db-node1-binlog.*: No such file or directory
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)
WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)
2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.
2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453
2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.
2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...
2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log
2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log
InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.
2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start
2017-11-03 17:07:15 140515632003328 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235
2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.
2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started
2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.
 
Status information:
 
Current dir: /app/mysql/galera/
Running threads: 2  Stack size: 295936
Current locks:
lock: 0x7fc98da48788:
 
lock: 0x7fc9cdc6a508:
 
lock: 0x7fcc598b5520:
 
lock: 0x7fcc598b5440:
 
lock: 0x7fcc598b4480:
 
lock: 0x7fcc598b43a0:
 
lock: 0x7fcc598b42c0:
 
lock: 0x7fcc598b4640:
 
lock: 0x7fcc598b2880:
 
lock: 0x7fcc598b5360:
 
lock: 0x7fcc598b2f80:
 
lock: 0x7fcc598b3220:
 
lock: 0x7fcc598b2ea0:
 
lock: 0x7fcc598b3140:
 
lock: 0x7fcc598b2dc0:
 
lock: 0x7fcc598b3060:
 
lock: 0x7fcc598b35a0:
 
lock: 0x7fcc598b3680:
 
lock: 0x7fcc598b3760:
 
lock: 0x7fcc598b2ce0:
 
lock: 0x7fcc598b2a40:
 
lock: 0x7fcc598b2c00:
 
lock: 0x7fcc598b2960:
 
lock: 0x7fcc598b2b20:
 
lock: 0x7fcc598b3300:
 
lock: 0x7fcc598b33e0:
 
lock: 0x7fcc598b34c0:
 
lock: 0x7fcc598b4fe0:
 
lock: 0x7fcc598b5280:
 
lock: 0x7fcc598b51a0:
 
lock: 0x7fcc598b50c0:
 
lock: 0x7fcc598b4c60:
 
lock: 0x7fcc598b4b80:
 
lock: 0x7fcc598b4aa0:
 
lock: 0x7fcc598b49c0:
 
lock: 0x7fcc598b48e0:
 
lock: 0x7fcc598b4e20:
 
lock: 0x7fcc598b4800:
 
lock: 0x7fcc598b4720:
 
lock: 0x7fcc598b4f00:
 
lock: 0x7fcc598b4560:
 
lock: 0x7fcc598b4100:
 
lock: 0x7fcc598b41e0:
 
lock: 0x7fcc598b4020:
 
lock: 0x7fcc598b3f40:
 
lock: 0x7fcc598b3bc0:
 
lock: 0x7fcc598b3e60:
 
lock: 0x7fcc598b3d80:
 
lock: 0x7fcc598b3ae0:
 
lock: 0x7fcc598b3ca0:
 
lock: 0x7fcc598b3840:
 
lock: 0x7fcc598b3920:
 
lock: 0x7fcc598b3a00:
 
lock: 0x7fcc598b4d40:
 
 
Key caches:
default
Buffer_size:     134217728
Block_size:           1024
Division_limit:        100
Age_threshold:         300
Partitions:              0
blocks used:             0
not flushed:             0
w_requests:              0
writes:                  0
r_requests:              0
reads:                   0
 
 
handler status:
read_key:            0
read_next:           0
read_rnd             0
read_first:          0
write:               0
delete               0
update:              0
 
Table status:
Opened tables:          2
Open tables:            2
Open files:             8
Open streams:           0
 
Alarm status:
Active alarms:   0
Max used alarms: 0
Next alarm time: 0
 
Memory status:
Non-mmapped space allocated from system: 0
Number of free chunks:			 1
Number of fastbin blocks:		 0
Number of mmapped regions:		 0
Space in mmapped regions:		 0
Maximum total allocated space:		 0
Space available in freed fastbin blocks: 0
Total allocated space:			 0
Total free space:			 0
Top-most, releasable space:		 0
Estimated memory (with thread stack):    591872
 
 
 
Events status:
LLA = Last Locked At  LUA = Last Unlocked At
WOC = Waiting On Condition  DL = Data Locked
The Event Scheduler is disabled
 
2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.
2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977
2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4
171103 17:07:23 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.1.28-MariaDB
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=1002
thread_count=10
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48400
mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]
2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded
2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table
2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.
Version: '10.1.28-MariaDB'  socket: '/app/mysql/temporaries/galera/mysqld.sock'  port: 3306  MariaDB Server
2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.
2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)
mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]
/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]
addr2line: 'mysqld': No such file
mysqld(my_hash_element+0x13)[0x7fcc5900c023]
mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]
mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]
mysqld(signal_hand+0x382)[0x7fcc5892c812]
/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

I'm not sure if it could provide any hint as to the cause of the crash, but the donor's error log contained a lot of messages about long semaphore waits involving log_sys->checkpoint_lock during the time period in which it acted as donor:

InnoDB: ###### Diagnostic info printed to the standard error stream
WSREP: avoiding InnoDB self crash due to long semaphore wait of  > 600 seconds
Server is processing SST donor operation, fatal_cnt now: 0InnoDB: Warning: a long semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: semaphore wait:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0
InnoDB: Warning: Writer thread is waiting this semaphore:
--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:
S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'
a writer (thread id 140439973771008) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file log0log.cc line 2352
Last time write locked in file log0log.cc line 2109
Holder thread 0 file not yet reserved line 0

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.



 Comments   
Comment by Andrii Nikitin (Inactive) [ 2017-11-08 ]

Geoff, I am not sure if checkoint on master is related, but the crash happened during something like kill -HUP $(pidof mysqld) (which follows something like mysqladmin debug) on problem node.
So, the most we can do is to try find out if user was using some custom commands/scripts during/after sst and try to write test case which replicates that behavior.

Comment by Geoff Montee (Inactive) [ 2017-11-08 ]

Thanks, Andrii. It sounds like you don't think that this was a crash due to a bug, and instead, some user or other process actually explicitly killed mysqld on the joiner. Do I understand you correctly? If so, do you think the kill could have occurred due to a systemd timeout (i.e. the problem described by MDEV-9202)?

Comment by Andrii Nikitin (Inactive) [ 2017-11-08 ]

Any crash is a bug, just it looks that conditions which caused this particular crash are linked to user actions. (Small chance exists that it was sst script which generated that `kill` command , which is reflected in `signal_hand` call in the problem stack trace, but I doubt it). kill -HUP is not dangerous for uptime, or at least shouldn't be.

Comment by Geoff Montee (Inactive) [ 2017-11-13 ]

Ah, I see what you mean. You are saying that it looks like the server received a SIGHUP right before it crashed. SIGHUP causes the server to reload privilege tables and flush its logs, which shouldn't cause a problem for a server under normal scenarios, but maybe it crashed the server in this case because the server was not completely initialized yet, as it had just finished receiving its SST.

It doesn't look like systemd would be responsible for the SIGHUP, because systemd only sends SIGHUP if SendSIGHUP is enabled. That is not enabled in MariaDB's unit file, and it defaults to "no", so it should be disabled for MariaDB.

I'll ask the user who saw this crash if they know why the server received a SIGHUP in this case.

Comment by Geoff Montee (Inactive) [ 2017-11-28 ]

The SIGHUP seems to be caused by Jenkins and/or Ansible somehow.

Comment by Sergei Golubchik [ 2017-12-12 ]

So, while it is a bug, it is caused by a user sending SIGHUP. If someone has enough privileges to send a SIGHUP, one can send a SIGKILL just as well. So the bug is that a user, who can crash mysqld, crashes mysqld. Thus, I believe, it can be downgraded to Minor.

Comment by Elena Stepanova [ 2017-12-16 ]

Setting to Confirmed based on comments above

Comment by Jan Lindström (Inactive) [ 2018-07-19 ]

Elkin Can you have a look as stack seems to contain replication related functions.

Comment by Jan Lindström [ 2023-06-06 ]

10.1 is EOL.

Generated at Thu Feb 08 08:12:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.