[MDEV-15891] SIGHUP during rsync SST causes SST to fail Created: 2018-04-16  Updated: 2023-06-06  Resolved: 2023-06-06

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.1.32
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Geoff Montee (Inactive) Assignee: Seppo Jaakola
Resolution: Won't Fix Votes: 2
Labels: rsync, sst

Issue Links:
Relates
relates to MDEV-14282 SIGHUP after rsync SST causes crash i... Closed

 Description   

This is similar to MDEV-14282, but it seems slightly different.

If a SIGHUP is received during an rsync SST, then the SST will fail:

Hangup
WSREP_SST: [INFO] Joiner cleanup. rsync PID: 11637 (20180411 18:48:23.814)
WSREP_SST: [INFO] Joiner cleanup done. (20180411 18:48:24.318)
2018-04-11 18:48:24 140510861715200 [Warning] WSREP: 0.0 (144-70-2-195.domain.com): State transfer to 1.0 (144-70-17-147.domain.com) failed: -255 (Unknown error 255)
2018-04-11 18:48:24 140510861715200 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():737: Will never receive state. Need to abort.
2018-04-11 18:48:24 140510861715200 [Note] WSREP: gcomm: terminating thread
2018-04-11 18:48:24 140510861715200 [Note] WSREP: gcomm: joining thread
2018-04-11 18:48:24 140510832359168 [ERROR] WSREP: Process completed with error: wsrep_sst_rsync --role 'joiner' --address 'server1.domain.com' --datadir '/app/mysql/galera/' --defaults-file '/app/mysql/config/galera.cnf' --parent '11591' --binlog '/app/mysql/galera/server1-binlog' : 32 (Broken pipe)
2018-04-11 18:48:24 140510832359168 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
2018-04-11 18:48:24 140521767868672 [ERROR] WSREP: SST failed: 32 (Broken pipe)

Jenkins and/or Ansible seems to send SIGHUP signals for some reason, so that's when this issue occurs.



 Comments   
Comment by Sergei Golubchik [ 2018-04-17 ]

would running mysqld under nohup be a workaround?

Comment by Seppo Jaakola [ 2018-11-07 ]

SST processes are configured with following signals as enabled:

/* make sure the following signals are not ignored in child process */
sigset_t default_signals; sigemptyset(&default_signals);
sigaddset(&default_signals, SIGHUP);
sigaddset(&default_signals, SIGINT);
sigaddset(&default_signals, SIGQUIT);
sigaddset(&default_signals, SIGPIPE);
sigaddset(&default_signals, SIGTERM);
sigaddset(&default_signals, SIGCHLD);

These can be changed, of course, but what actual problem does it cause if SST process can be interrupted by SIGHUP?
Note, that the related MDEV-14282 has lowered priority to "Minor", and this one has "Critical" priority

Comment by Geoff Montee (Inactive) [ 2018-11-07 ]

seppo,

The main problem that we've seen is that Ansible seems to raise SIGHUPs at strange times, so if a DBA starts a Galera node with Ansible and the node SSTs, then SST failures are common due to SIGHUPs. I do not know why Ansible is raising the signal to begin with though.

Comment by Jan Lindström [ 2023-06-06 ]

10.1 is EOL.

Generated at Thu Feb 08 08:24:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.