[MDEV-14282] SIGHUP after rsync SST causes crash in my_hash_element - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 10.1.28
Fix Version/s: N/A
Component/s: Galera, Galera SST, wsrep
Labels:
- crash
- galera
- sst
- wsrep

Description

A joiner node performed an rsync SST from another node in the cluster, and when the SST finished, the node saw the following crash:

2017-11-03 16:44:46 140515631684352 [Note] WSREP: State transfer required:

	Group state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977

	Local state: 00000000-0000-0000-0000-000000000000:-1

2017-11-03 16:44:46 140515631684352 [Note] WSREP: New cluster view: global state: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977, view# 102: Primary, number of nodes: 4, my index: 3, protocol version 3

2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Gap in state sequence. Need state transfer.

2017-11-03 16:44:46 140504717063936 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '10.119.17.237' --datadir '/app/mysql/galera/'  --defaults-file '/app/mysql/config/galera.cnf'  --parent '8214' --binlog '/app/mysql/galera/db-node1-binlog' '

2017-11-03 16:44:46 140515631684352 [Note] WSREP: Prepared SST request: rsync|10.119.17.237:4444/rsync_sst

2017-11-03 16:44:46 140515631684352 [Note] WSREP: REPL Protocols: 7 (3, 2)

2017-11-03 16:44:46 140515631684352 [Note] WSREP: Assign initial position for certification: 230930977, protocol version: 3

2017-11-03 16:44:46 140504792139520 [Note] WSREP: Service thread queue flushed.

2017-11-03 16:44:46 140515631684352 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (b1c4fe17-b433-11e7-bd44-52c2748b6170): 1 (Operation not permitted)

	 at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.

2017-11-03 16:44:46 140504733837056 [Note] WSREP: Member 3.0 (10-119-17-237.domain.com) requested state transfer from '*any*'. Selected 0.0 (10-119-17-69.domain.com)(SYNCED) as donor.

2017-11-03 16:44:46 140504733837056 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 230930977)

2017-11-03 16:44:46 140515631684352 [Note] WSREP: Requesting state transfer: success, donor: 0

2017-11-03 16:44:46 140515631684352 [Note] WSREP: GCache history reset: old(00000000-0000-0000-0000-000000000000:0) -> new(b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977)

2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') connection to peer d3790118 with addr tcp://10.119.17.237:4567 timed out, no messages seen in PT3S

2017-11-03 16:44:49 140504742229760 [Note] WSREP: (d3790118, 'tcp://0.0.0.0:4567') turning message relay requesting off

2017-11-03 17:06:34 140504733837056 [Note] WSREP: 0.0 (10-119-17-69.domain.com): State transfer to 3.0 (10-119-17-237.domain.com) complete.

WSREP_SST: [INFO] Extracting binlog files: (20171103 17:06:34.145)

db-g3-bin.000042

ls: cannot access db-node1-binlog.*: No such file or directory

WSREP_SST: [INFO] Joiner cleanup. rsync PID: 8293 (20171103 17:06:34.156)

WSREP_SST: [INFO] Joiner cleanup done. (20171103 17:06:34.662)

2017-11-03 17:06:34 140515632003328 [Note] WSREP: SST complete, seqno: 230930977

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using mutexes to ref count buffer pool pages

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: The InnoDB memory heap is disabled

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Compressed tables use zlib 1.2.7

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using Linux native AIO

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Using SSE crc32 instructions

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Initializing buffer pool, size = 512.0M

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Completed initialization of buffer pool

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Highest supported file format is Barracuda.

2017-11-03 17:06:34 140515632003328 [Note] InnoDB: Starting crash recovery from checkpoint LSN=1244892544453

2017-11-03 17:06:36 140504733837056 [Note] WSREP: Member 0.0 (10-119-17-69.domain.com) synced with group.

2017-11-03 17:06:42 140515632003328 [Note] InnoDB: Restoring possible half-written data pages from the doublewrite buffer...

2017-11-03 17:06:49 140515632003328 [Note] InnoDB: To recover: 2702 pages from log

2017-11-03 17:06:53 140515632003328 [Note] InnoDB: Starting final batch to recover 2585 pages from redo log

InnoDB: Last MySQL binlog file position 0 786976686, file name /app/mysql/galera/db-g3-bin.000041

2017-11-03 17:07:15 140515632003328 [Note] InnoDB: 128 rollback segment(s) are active.

2017-11-03 17:07:15 140515632003328 [Note] InnoDB: Waiting for purge to start

2017-11-03 17:07:15 140515632003328 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.36-82.2 started; log sequence number 1244894457235

2017-11-03 17:07:23 140515632003328 [Note] Plugin 'FEEDBACK' is disabled.

2017-11-03 17:07:23 140503555213056 [Note] InnoDB: Dumping buffer pool(s) not yet started

2017-11-03 17:07:23 140515632003328 [Note] Server socket created on IP: '10.119.17.237'.

Status information:

Current dir: /app/mysql/galera/

Running threads: 2  Stack size: 295936

Current locks:

lock: 0x7fc98da48788:

lock: 0x7fc9cdc6a508:

lock: 0x7fcc598b5520:

lock: 0x7fcc598b5440:

lock: 0x7fcc598b4480:

lock: 0x7fcc598b43a0:

lock: 0x7fcc598b42c0:

lock: 0x7fcc598b4640:

lock: 0x7fcc598b2880:

lock: 0x7fcc598b5360:

lock: 0x7fcc598b2f80:

lock: 0x7fcc598b3220:

lock: 0x7fcc598b2ea0:

lock: 0x7fcc598b3140:

lock: 0x7fcc598b2dc0:

lock: 0x7fcc598b3060:

lock: 0x7fcc598b35a0:

lock: 0x7fcc598b3680:

lock: 0x7fcc598b3760:

lock: 0x7fcc598b2ce0:

lock: 0x7fcc598b2a40:

lock: 0x7fcc598b2c00:

lock: 0x7fcc598b2960:

lock: 0x7fcc598b2b20:

lock: 0x7fcc598b3300:

lock: 0x7fcc598b33e0:

lock: 0x7fcc598b34c0:

lock: 0x7fcc598b4fe0:

lock: 0x7fcc598b5280:

lock: 0x7fcc598b51a0:

lock: 0x7fcc598b50c0:

lock: 0x7fcc598b4c60:

lock: 0x7fcc598b4b80:

lock: 0x7fcc598b4aa0:

lock: 0x7fcc598b49c0:

lock: 0x7fcc598b48e0:

lock: 0x7fcc598b4e20:

lock: 0x7fcc598b4800:

lock: 0x7fcc598b4720:

lock: 0x7fcc598b4f00:

lock: 0x7fcc598b4560:

lock: 0x7fcc598b4100:

lock: 0x7fcc598b41e0:

lock: 0x7fcc598b4020:

lock: 0x7fcc598b3f40:

lock: 0x7fcc598b3bc0:

lock: 0x7fcc598b3e60:

lock: 0x7fcc598b3d80:

lock: 0x7fcc598b3ae0:

lock: 0x7fcc598b3ca0:

lock: 0x7fcc598b3840:

lock: 0x7fcc598b3920:

lock: 0x7fcc598b3a00:

lock: 0x7fcc598b4d40:

Key caches:

default

Buffer_size:     134217728

Block_size:           1024

Division_limit:        100

Age_threshold:         300

Partitions:              0

blocks used:             0

not flushed:             0

w_requests:              0

writes:                  0

r_requests:              0

reads:                   0

handler status:

read_key:            0

read_next:           0

read_rnd             0

read_first:          0

write:               0

delete               0

update:              0

Table status:

Opened tables:          2

Open tables:            2

Open files:             8

Open streams:           0

Alarm status:

Active alarms:   0

Max used alarms: 0

Next alarm time: 0

Memory status:

Non-mmapped space allocated from system: 0

Number of free chunks:			 1

Number of fastbin blocks:		 0

Number of mmapped regions:		 0

Space in mmapped regions:		 0

Maximum total allocated space:		 0

Space available in freed fastbin blocks: 0

Total allocated space:			 0

Total free space:			 0

Top-most, releasable space:		 0

Estimated memory (with thread stack):    591872

Events status:

LLA = Last Locked At  LUA = Last Unlocked At

WOC = Waiting On Condition  DL = Data Locked

The Event Scheduler is disabled

2017-11-03 17:07:23 140515632003328 [Note] WSREP: Signalling provider to continue.

2017-11-03 17:07:23 140515632003328 [Note] WSREP: SST received: b1c4fe17-b433-11e7-bd44-52c2748b6170:230930977

2017-11-03 17:07:23 140515630676736 [Note] Event Scheduler: scheduler thread started with id 4

171103 17:07:23 [ERROR] mysqld got signal 11 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.1.28-MariaDB

key_buffer_size=134217728

read_buffer_size=131072

max_used_connections=0

max_threads=1002

thread_count=10

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2332069 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x48400

mysqld(my_print_stacktrace+0x2e)[0x7fcc5902d92e]

2017-11-03 17:07:23 140515632003328 [Note] Reading of all Master_info entries succeded

2017-11-03 17:07:23 140515632003328 [Note] Added new Master_info '' to hash table

2017-11-03 17:07:23 140515632003328 [Note] mysqld: ready for connections.

Version: '10.1.28-MariaDB'  socket: '/app/mysql/temporaries/galera/mysqld.sock'  port: 3306  MariaDB Server

2017-11-03 17:07:23 140504733837056 [Note] WSREP: 3.0 (10-119-17-237.domain.com): State transfer from 0.0 (10-119-17-69.domain.com) complete.

2017-11-03 17:07:23 140504733837056 [Note] WSREP: Shifting JOINER -> JOINED (TO: 230971731)

mysqld(handle_fatal_signal+0x305)[0x7fcc58b50ce5]

/lib64/libpthread.so.0(+0xf370)[0x7fcc58161370]

addr2line: 'mysqld': No such file

mysqld(my_hash_element+0x13)[0x7fcc5900c023]

mysqld(_ZN17Master_info_index20flush_all_relay_logsEv+0x93)[0x7fcc58a98c83]

mysqld(_Z20reload_acl_and_cacheP3THDyP10TABLE_LISTPi+0x5d2)[0x7fcc58ab8672]

mysqld(signal_hand+0x382)[0x7fcc5892c812]

/lib64/libpthread.so.0(+0x7dc5)[0x7fcc58159dc5]

/lib64/libc.so.6(clone+0x6d)[0x7fcc5657976d]

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains

information that should help you find out what is causing the crash.

I'm not sure if it could provide any hint as to the cause of the crash, but the donor's error log contained a lot of messages about long semaphore waits involving log_sys->checkpoint_lock during the time period in which it acted as donor:

InnoDB: ###### Diagnostic info printed to the standard error stream

WSREP: avoiding InnoDB self crash due to long semaphore wait of  > 600 seconds

Server is processing SST donor operation, fatal_cnt now: 0InnoDB: Warning: a long semaphore wait:

--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:

S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'

a writer (thread id 140439973771008) has reserved it in mode  exclusive

number of readers 0, waiters flag 1, lock_word: 0

Last time read locked in file log0log.cc line 2352

Last time write locked in file log0log.cc line 2109

Holder thread 0 file not yet reserved line 0

InnoDB: Warning: semaphore wait:

--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:

S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'

a writer (thread id 140439973771008) has reserved it in mode  exclusive

number of readers 0, waiters flag 1, lock_word: 0

Last time read locked in file log0log.cc line 2352

Last time write locked in file log0log.cc line 2109

Holder thread 0 file not yet reserved line 0

InnoDB: Warning: Writer thread is waiting this semaphore:

--Thread 140439973771008 has waited at log0log.cc line 2352 for 1265.0 seconds the semaphore:

S-lock on RW-latch at 0x7fbd82c5d3b8 '&log_sys->checkpoint_lock'

a writer (thread id 140439973771008) has reserved it in mode  exclusive

number of readers 0, waiters flag 1, lock_word: 0

Last time read locked in file log0log.cc line 2352

Last time write locked in file log0log.cc line 2109

Holder thread 0 file not yet reserved line 0

In case it is relevant, the donor node was a replication master, but not a slave.

After the first SST led to a crash, another SST was attempted, and that one was successful without a crash.

Attachments

Issue Links

relates to

MDEV-15891 SIGHUP during rsync SST causes SST to fail

Closed

Activity

Descending order - Click to sort in ascending order

Jan Lindström added a comment - 2023-06-06 09:28

10.1 is EOL.

Jan Lindström added a comment - 2023-06-06 09:28 10.1 is EOL.

Jan Lindström (Inactive) added a comment - 2018-07-19 04:45

Elkin Can you have a look as stack seems to contain replication related functions.

Jan Lindström (Inactive) added a comment - 2018-07-19 04:45 Elkin Can you have a look as stack seems to contain replication related functions.

Elena Stepanova added a comment - 2017-12-16 02:59

Setting to Confirmed based on comments above

Elena Stepanova added a comment - 2017-12-16 02:59 Setting to Confirmed based on comments above

Sergei Golubchik added a comment - 2017-12-12 09:54

So, while it is a bug, it is caused by a user sending SIGHUP. If someone has enough privileges to send a SIGHUP, one can send a SIGKILL just as well. So the bug is that a user, who can crash mysqld, crashes mysqld. Thus, I believe, it can be downgraded to Minor.

Sergei Golubchik added a comment - 2017-12-12 09:54 So, while it is a bug, it is caused by a user sending SIGHUP. If someone has enough privileges to send a SIGHUP, one can send a SIGKILL just as well. So the bug is that a user, who can crash mysqld, crashes mysqld. Thus, I believe, it can be downgraded to Minor.

Geoff Montee (Inactive) added a comment - 2017-11-28 22:10 - edited

The SIGHUP seems to be caused by Jenkins and/or Ansible somehow.

Geoff Montee (Inactive) added a comment - 2017-11-28 22:10 - edited The SIGHUP seems to be caused by Jenkins and/or Ansible somehow.

View 4 older comments

People

Assignee:: Andrei Elkin

Reporter:: Geoff Montee (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2017-11-03 23:29

Updated:: 2024-07-08 00:37

Resolved:: 2023-06-06 09:28

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.