[MDEV-10934] SST fails when SSL is enabled Created: 2016-10-01  Updated: 2017-11-06  Resolved: 2017-10-13

Status: Closed
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.1.18
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: DEZILLIUM LIMITED Assignee: Andrii Nikitin (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Environment:

Debian 8.6, packages from MariaDB repo, Galera replication enabled


Issue Links:
Relates
relates to MDEV-9403 When using xtrabackup-v2 SST, socat +... Closed
relates to MDEV-14010 merge issue in wsrep_sst_xtrabackup-v2 Closed
relates to MDEV-14011 (draft) wsrep_sst_xtrabackup-v2 sst e... Closed
Sprint: 10.1.20

 Description   

Enabling SSL for SST results in a a split brain when a new node joins the cluster:

Oct 1 15:01:20 1 -innobackupex-backup: 161001 15:01:20 [01] Encrypting and streaming ./ibdata1
Oct 1 15:01:21 1 -wsrep-sst-donor: 2016/10/01 15:01:21 socat[15723] E write(13, 0xc16220, 8192): Broken pipe
Oct 1 15:01:21 1 -innobackupex-backup: #007innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
Oct 1 15:01:21 1 -innobackupex-backup: xb_stream_write_data() failed.
Oct 1 15:01:21 1 -innobackupex-backup: encrypt: write to the destination file failed.
Oct 1 15:01:21 1 -innobackupex-backup: #007innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
Oct 1 15:01:21 1 -innobackupex-backup: [01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
Oct 1 15:01:21 1 -innobackupex-backup: [01] xtrabackup: Error: failed to copy datafile.
Oct 1 15:01:21 1 mysqld[15281]: 2016-10-01 15:01:21 140606113225472 [Warning] Aborted connection 19 to db: 'unconnected' user: 'root' host: 'localhost' (Got an error re
ading communication packets)
Oct 1 15:01:21 1 -wsrep-sst-donor: innobackupex finished with error: 1. Check /var/lib/mysql//innobackup.backup.log
Oct 1 15:01:21 1 -wsrep-sst-donor: Cleanup after exit with status:22
Oct 1 15:01:21 1 -wsrep-sst-donor: Cleaning up temporary directories
Oct 1 15:01:21 1 mysqld[15281]: 2016-10-01 15:01:21 140604830906112 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '1.1.1.1
:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' '' --gtid '2320744f-86e5-11e6-9fd8-87f46ed48225:2' --gtid-domain-id '0'
Oct 1 15:01:21 1 mysqld[15281]: 2016-10-01 15:01:21 140604830906112 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '1.1.1.1:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' '' --gtid '2320744f-86e5-11e6-9fd8-87f46ed48225:2' --gtid-domain
-id '0': 22 (Invalid argument)
Oct 1 15:01:21 1 mysqld[15281]: 2016-10-01 15:01:21 140604830906112 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '1.1.1.1
:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' '' --gtid '2320744f-86e5-11e6-9fd8-87f46ed48225:2' --gtid-domain-id '0'
Oct 1 15:01:21 1 mysqld[15281]: 2016-10-01 15:01:21 140605682349824 [Warning] WSREP: 1.0 (server1): State transfer to 0.0 (server2) failed: -22 (Invalid argument)

Logs have been sanitized of course (IP/hostnames).

Configuration files were taken from a production MariaDB cluster running Galera replication. The only changes made were:
1) Enabled SSL where applicable (server/client/wsrep/SST)
2) Changed passwords (verified to be correct on both nodes)
3) Copied debian.cnf from server1 to server2

Here are the .conf settings:
[sst]
encrypt=1
encrypt-algo=AES128
encrypt-key=[some long key]
tca=/somepath/ca.pem
tcert=/somepath/combined.pem

[some long key] is of course edited, there is a string generated from openssl, without the square brackets.
somepath is a sanitized path

Tried with encrypt=1 and encrypt=3 and still failing.

wsrep_sst_method=xtrabackup-v2 declared in a [mysqld] section.

SST succeeds when streamfmt is set to tar, but of course that is unencrypted. This is not a firewall issue, it has been verified that there are rules enabling all the nodes to talk to each other.

Tried with 10.1.17 and 10.1.18 released today.



 Comments   
Comment by DEZILLIUM LIMITED [ 2016-10-04 ]

The issue lies with socat.

Using socat from jessie (Debian) doesn't work and always results in 100% CPU usage with processes that need to be killed with a -9 signal in order to terminate (100% reproducible).

Using socat from jessie-backports results in a hostname lookup failure (joiner does not have a DNS A record pointed to it, but IS declared in /etc/hosts).

Editing /usr/bin/wsrep_sst_xtrabackup-v2:
Line 235:
tcmd="socat -u stdio openssl-connect:${REMOTEHOST}:${TSST_PORT},cert=${tpem},key=${tkey},cafile=${tcert}${sockopt}"

Change to:
tcmd="socat -u stdio openssl-connect:${REMOTEIP}:${TSST_PORT},cert=${tpem},key=${tkey},cafile=${tcert}${sockopt}"

This gets the SST progress further. Since the certificates where generated with a hostname as their CN, their verification fails and SST stops again.

Generating a new set of certificates with IPs as CNs (but leaving the old ones still used by wsrep, just not for SST), and setting those under a [sst] segment (keeping the old tca, since the new set is also signed by the same CA) allows the SST to complete, fully encrypted

Edited to add:
using encrypt = 3 under [sst]

Comment by Andrii Nikitin (Inactive) [ 2017-10-05 ]

Thank you for your analysis and confirmation.
Just for reference - 'Broken pipe' usually means that some of chained command was aborted (e.g. by kill) . Especially in sst - if some of parties meets any problem - communication will be closed and thus 'Broken pipe' is very possible. So, that may mean that real error message might be on the other node or not included into provided logs.

I could reproduce similar issue with encrypt=1 (without SSL-certificated nodes) in 10.1.18 with message on joiner:

2017-10-05 13:41:13 140227855251200 [Note] WSREP: (150a3819, 'tcp://0.0.0.0:4569') turning message relay requesting off
xbcrypt:xb_crypt_read_chunk: wrong chunk magic at offset 0x0.
2017/10/05 13:41:21 socat[11977] E write(1, 0x55b852ddb5d0, 8192): Broken pipe

Just the same command succeeds in 10.1.28 , so while problem really exists in previous version, there is nothing to fix at the moment.
Below is script which downloads and unpacks corresponding tar package, generates templates and manages three local nodes:

set -e
ver=${1:-10.1.28}
 
# just use current directory if called from framework
if [ ! -f common.sh ] ; then
  [ -d mariadb-environs ] || git clone http://github.com/AndriiNikitin/mariadb-environs
  cd mariadb-environs
  ./get_plugin.sh galera
fi
 
function onExit {
  [ "$passed" == yes ] && echo TEST PASSED && exit
  cluster1/tail_log.sh 100
# uncomment if you wish docker build hang on failire (to attach to container and troubleshoot)
#  sleep 10000
}
trap onExit EXIT
 
_template/plant_cluster.sh cluster1
echo m1 > cluster1/nodes.lst
echo m2 >> cluster1/nodes.lst
echo m3 >> cluster1/nodes.lst
cluster1/replant.sh ${ver}
 
./build_or_download.sh m1
 
cluster1/gen_cnf.sh general_log=1
cluster1/install_db.sh
cluster1/configure_ssl.sh
cluster1/galera_setup_acl.sh
cluster1/galera_start_new.sh $WSREP_EXTRA_OPT
 
sleep 5
cluster1/galera_cluster_size.sh
cluster1/sql.sh 'show variables like "wsrep_sst_method"'
 
grep -A10 -B10 -i "\[ERROR\]" m1*/dt/error.log || :
 
cluster_size=$(m1*/sql.sh 'show status like "wsrep_cluster_size"')
[[ "${cluster_size}" =~ 3 ]] || exit 1
 
cluster1/sql.sh status | grep 'SSL: Cipher in use' && passed=yes

To force sst xtrabackup-v2 with encryption I paste command below into shell (use both encrypt=1 and encrypt=3):

export SST_OPT="
[sst]
encrypt=1
encrypt-algo=AES128
encrypt-key=A1EDC73815467C083B0869508406637E
tca=$(pwd)/_depot/ca.pem
tcert=$(pwd)/_depot/ca.pem
tkey=$(pwd)/_depot/ca-key.pem
"
 
WSREP_EXTRA_OPT="wsrep_sst_method=xtrabackup-v2 $SST_OPT" bash example_ssl.sh 10.1.28

To make it work in 10.1.28 I had to address several bugs and patch related scripts in unpacked tar image:

$ diff 10.1.28.orig/bin 10.1.28.fixed/bin
diff 10.1.28.orig/bin/mysqld_safe 10.1.28.fixed/bin/mysqld_safe
249c249
<   local wr_logfile=$(mktemp wsrep_recovery.XXXXXX)
---
>   local wr_logfile=$(mktemp -t wsrep_recovery.XXXXXX)
271c271
<   eval_log_error "$mysqld_cmd --wsrep_recover $wr_options > $wr_logfile"
---
>   eval "$mysqld_cmd --wsrep_recover $wr_options 2> $wr_logfile"
diff 10.1.28.orig/bin/wsrep_sst_common 10.1.28.fixed/bin/wsrep_sst_common
48c48
<         readonly WSREP_SST_OPT_PORT=$(echo $WSREP_SST_OPT_ADDR | \
---
>         WSREP_SST_OPT_PORT=$(echo $WSREP_SST_OPT_ADDR | \
89c89
<         readonly WSREP_SST_OPT_PORT="$2"
---
>         WSREP_SST_OPT_PORT="$2"
diff 10.1.28.orig/bin/wsrep_sst_mysqldump 10.1.28.fixed/bin/wsrep_sst_mysqldump
120c120
< MYSQL="$MYSQL_CLIENT --defaults-extra-file=$WSREP_SST_OPT_CONF "\
---
> MYSQL="$MYSQL_CLIENT $WSREP_SST_OPT_CONF "\
diff 10.1.28.orig/bin/wsrep_sst_xtrabackup-v2 10.1.28.fixed/bin/wsrep_sst_xtrabackup-v2
361,362c361
<                     tcmd="socat -u openssl-listen:${TSST_PORT},reuseaddr,cert=${tpem},key=${tkey},verify=0${sockopt} stdio"
<                 else
---
>             else
457c456
<     cpat=$(parse_cnf sst cpat '.*\.pem$\|.*init\.ok$\|.*galera\.cache$\|.*sst_in_progress$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$')
---
>     cpat=$(parse_cnf sst cpat '.*\.pem$\|.*init\.ok$\|.*galera\.cache$\|.*sst_in_progress$\|.*\.sst$\|.*gvwstate\.dat$\|.*grastate\.dat$\|.*\.err$\|.*\.log$\|.*RPM_UPGRADE_MARKER$\|.*RPM_UPGRADE_HISTORY$\|.*innobackup*log$')
870c869
<         INNOMOVE="${INNOBACKUPEX_BIN} --defaults-file=${WSREP_SST_OPT_CONF} $disver $impts --datadir=${DATA} --move-back --force-non-empty-directories \${DATA} 2>&1 | logger -p daemon.err -t ${ssystag}innobackupex-move "
---
>         INNOMOVE="${INNOBACKUPEX_BIN} ${WSREP_SST_OPT_CONF} $disver $impts --datadir=${DATA} --move-back --force-non-empty-directories \${DATA} 2>&1 | logger -p daemon.err -t ${ssystag}innobackupex-move "
876c875
<     INNOMOVE="${INNOBACKUPEX_BIN} --defaults-file=${WSREP_SST_OPT_CONF}  --defaults-group=mysqld${WSREP_SST_OPT_CONF_SUFFIX} $disver $impts --datadir=${DATA} --move-back --force-non-empty-directories \${DATA} &>\${DATA}/innobackup.move.log"
---
>     INNOMOVE="${INNOBACKUPEX_BIN} ${WSREP_SST_OPT_CONF}  --defaults-group=mysqld${WSREP_SST_OPT_CONF_SUFFIX} $disver $impts --datadir=${DATA} --move-back --force-non-empty-directories \${DATA} &>\${DATA}/innobackup.move.log"

So, since original problem looks solved - I will close this call with resolution 'fixed'.
It is hard work to find exact version/commit where it was fixed - so please let me know if somebody really needs this info and/or use script above to verify particular versions

Generated at Thu Feb 08 07:46:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.