[MDEV-32734] Galera SST using rsync over stunnel fails Created: 2023-11-08  Updated: 2023-11-10

Status: Open
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.5.21
Fix Version/s: 10.5, 10.6

Type: Bug Priority: Major
Reporter: Richard Assignee: Teemu Ollakka
Resolution: Unresolved Votes: 0
Labels: galera, rsync, sst
Environment:

Server version: 10.5.21-MariaDB-0+deb11u1 Debian 11
uname: Linux 5.10.0-26-amd64 #1 SMP Debian 5.10.197-1 (2023-09-29) x86_64 GNU/Linux



 Description   

Before enabling encryption over stunnel, no issues were observed. After stunnel was enabled, SST during startup fails. The problem is not consistent, but seems to happen quite regularly when RTT between donor and joiner is high (> 100ms).

I did some debugging, and the issue seems to be related to a race condition where stunnel is killed prematurely, before rsync is 100% complete. Below is some relevant lines from the log when the problem happens (with stunnel debug = info):

Nov  8 15:14:01 xxx rsyncd[35829]: connect from yyy (x.x.x.x)
Nov  8 15:14:01 xxx rsyncd[35829]: rsync to rsync_sst/ from yyy (x.x.x.x)
Nov  8 15:14:01 xxx rsyncd[35829]: receiving file list
Nov  8 15:14:01 xxx mariadbd[35677]: WSREP_SST: [INFO] rsync IST completed on joiner (20231108 15:14:01.897)
Nov  8 15:14:01 xxx mariadbd[35677]: WSREP_SST: [INFO] Joiner cleanup: rsync PID=0, stunnel PID=35814 (20231108 15:14:01.900)
Nov  8 15:14:01 xxx stunnel: LOG5[ui]: Terminated
Nov  8 15:14:01 xxx stunnel: LOG5[ui]: Terminating 2 service thread(s)
Nov  8 15:14:01 xxx stunnel: LOG5[ui]: Service threads terminated
Nov  8 15:14:02 xxx rsyncd[35829]: rsync: connection unexpectedly closed (90 bytes received so far) [generator]
Nov  8 15:14:02 xxx mariadbd[35829]: rsync: connection unexpectedly closed (90 bytes received so far) [generator]
Nov  8 15:14:02 xxx mariadbd[35829]: rsync error: error in rsync protocol data stream (code 12) at io.c(228) [generator=3.2.3]
Nov  8 15:14:02 xxx rsyncd[35829]: rsync error: error in rsync protocol data stream (code 12) at io.c(228) [generator=3.2.3]

If adding a "sleep 5" to the end of the wsrep_sst_rsync script, just before the exit trap kills stunnel, the problem is no longer observed:

--- wsrep_sst_rsync.bak	2023-07-23 23:26:58.000000000 +0200
+++ wsrep_sst_rsync	2023-11-08 15:59:18.793493850 +0100
@@ -1003,4 +1003,5 @@
fi
 
wsrep_log_info "$WSREP_METHOD $WSREP_TRANSFER_TYPE completed on $WSREP_SST_OPT_ROLE"
+sleep 5
exit 0

success:

Nov  8 16:33:17 xxx rsyncd[63392]: connect from yyy (x.x.x.x)
Nov  8 16:33:18 xxx rsyncd[63392]: rsync to rsync_sst/ from yyy (x.x.x.x)
Nov  8 16:33:18 xxx rsyncd[63392]: receiving file list
Nov  8 16:33:18 xxx mariadbd[63240]: WSREP_SST: [INFO] rsync IST completed on joiner (20231108 16:33:18.489)
Nov  8 16:33:18 xxx rsyncd[63392]: sent 48 bytes  received 194 bytes  total size 52
Nov  8 16:33:18 xxx stunnel: LOG6[0]: Read socket closed (readsocket)
Nov  8 16:33:18 xxx stunnel: LOG6[0]: SSL_shutdown successfully sent close_notify alert
Nov  8 16:33:18 xxx stunnel: LOG6[ui]: Child process 63392 finished with code 0
Nov  8 16:33:18 xxx stunnel: LOG6[0]: SSL_read: Socket is closed
Nov  8 16:33:18 xxx stunnel: LOG6[0]: TLS socket closed (SSL_read)
Nov  8 16:33:18 xxx stunnel: LOG5[0]: Connection closed: 111 byte(s) sent to TLS, 289 byte(s) sent to socket
Nov  8 16:33:23 xxx mariadbd[63240]: WSREP_SST: [INFO] Joiner cleanup: rsync PID=0, stunnel PID=63377 (20231108 16:33:23.492)
Nov  8 16:33:23 xxx stunnel: LOG5[ui]: Terminated
Nov  8 16:33:23 xxx stunnel: LOG5[ui]: Terminating 1 service thread(s)
Nov  8 16:33:23 xxx stunnel: LOG5[ui]: Service threads terminated


Generated at Thu Feb 08 10:33:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.