[MDEV-25880] rsync may be mistakenly killed when overlapping SST Created: 2021-06-09  Updated: 2021-06-22  Resolved: 2021-06-15

Status: Closed
Project: MariaDB Server
Component/s: Galera, Galera SST
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.2.40, 10.3.31, 10.4.21, 10.5.12, 10.6.3

Type: Bug Priority: Critical
Reporter: Julius Goryavsky Assignee: Julius Goryavsky
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Issue split
Problem/Incident
Relates
relates to MDEV-24097 galera_3nodes suite tests in MTR spor... Closed

 Description   

This bug was originally seen in the galera_nbo_sst_slave mtr test for 10.6, however it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server) overlaps the old SST process from the previous (already terminated) server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems.
For example:

2021-06-09  3:28:56 0 [Warning] WSREP: 0.0 (panda): State transfer to 1.0 (panda) failed: -11 (Resource temporarily unavailable)
2021-06-09  3:28:56 0 [ERROR] WSREP: /home/panda/galera-es-4.x/gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():1205: Will never receive state. Need to abort.
2021-06-09  3:28:56 0 [Note] WSREP: gcomm: terminating thread
2021-06-09  3:28:56 0 [Note] WSREP: gcomm: joining thread
2021-06-09  3:28:56 0 [Note] WSREP: gcomm: closing backend
2021-06-09  3:28:56 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): discarded 24 bytes
2021-06-09  3:28:56 2 [Note] WSREP: GCache DEBUG: RingBuffer::seqno_reset(): found 1/2 locked buffers
2021-06-09  3:28:57 0 [Note] WSREP: PC protocol downgrade 1 -> 0
2021-06-09  3:28:57 0 [Note] WSREP: view((empty))
2021-06-09  3:28:57 0 [Note] WSREP: gcomm: closed
2021-06-09  3:28:57 0 [Note] WSREP: /home/panda/maria-10.6/build/sql/mariadbd: Terminated.
2021-06-09  3:28:58 0 [Warning] WSREP: option --wsrep-causal-reads is deprecated
2021-06-09  3:28:58 0 [Note] /home/panda/maria-10.6/build/sql/mariadbd (mysqld 10.6.1-1-MariaDB-debug-log) starting as process 410627 ...
.................
.................
2021-06-09  3:28:58 0 [Note] WSREP: save pc into disk
WSREP_SST: [ERROR] Parent mysqld process (PID: 410497) terminated unexpectedly. (20210609 03:28:58.800)
/home/panda/maria-10.6/build/scripts/wsrep_sst_rsync: line 681: kill: (-410497) - No such process
WSREP_SST: [INFO] Joiner cleanup: rsync PID=0, stunnel PID=410592 (20210609 03:28:58.803)



 Comments   
Comment by Julius Goryavsky [ 2021-06-09 ]

This commit fixes a bug was originally discovered during the galera_nbo_sst_slave mtr test for 10.6 branch. However it is relevant for all versions and can lead to intermittent SST crashes via rsync on very fast server restarts - when a new SST process (for example, after starting a new server instance) overlaps the old SST process started by the previous, already terminated server. This overlap can result in the new rsync being killed instead of the old rsync, or the pid file from the new rsync being killed, which then lead to problems:
.
https://github.com/MariaDB/server/commit/dfb6931fe214cacbfbf889d1b6a6273221987697
https://buildbot.askmonty.org/buildbot/grid?category=main&branch=bb-10.2-MDEV-25880-galera

Comment by Jan Lindström (Inactive) [ 2021-06-10 ]

ok to push

Comment by Jan Lindström (Inactive) [ 2021-06-15 ]

ok to push when bb has finished testing.

Comment by Julius Goryavsky [ 2021-06-15 ]

Fixed by https://github.com/MariaDB/server/commit/2edb8e12e10179b970007b3e1d5c465b9d0e110e and
https://github.com/MariaDB/server/commit/18d5be5b54b1a05e6107a1c5828d9eed9cf18636

Generated at Thu Feb 08 09:41:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.