[MDEV-14313] Joiner fails to SST when upgraded to 10.1.28 from 10.1.23 Created: 2017-11-07  Updated: 2018-02-21  Resolved: 2018-01-27

Status: Closed
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.1.28
Fix Version/s: 10.1.29

Type: Bug Priority: Major
Reporter: Claudio Nanni Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-15254 10.1.31 does not join an existing clu... Closed

 Description   

A Galera cluster uses encryption for SST with the following configuration:

[sst]
encrypt=3
tkey=/path/to/key.pem
tcert=/path/to/cert.pem
tca=/path/to/ca.pem

Upgrading a node to 10.1.28 (in a Galera cluster of 10.1.23 nodes) fails with:

2017-11-04 18:23:17 140462627223296 [Note] WSREP: (3804918b, 'tcp://0.0.0.0:4567') turning message relay requesting off
xb_stream_read_chunk(): wrong chunk magic at offset 0x0.
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 137 1 (20171104 18:24:55.419)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20171104 18:24:55.421)

One explication is that the joiner doesn't expect an encrypted stream, poining at a problem with the configuration under [sst] section.
After some investigation I restricted the problem to the wsrep_sst_xtrabackup-v2's function parse_cnf() which was moved to wsrep_sst_common.

In 10.1.23 parse_cnf() is in wsrep_sst_xtrabackup-v2 and it's:

parse_cnf()
{
    local group=$1
    local var=$2
    # print the default settings for given group using my_print_default.
    # normalize the variable names specified in cnf file (user can use _ or - for example log-bin or log_bin)
    # then grep for needed variable
    # finally get the variable value (if variables has been specified multiple time use the last value only)
    reval=$($MY_PRINT_DEFAULTS $group | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--$var=" | cut -d= -f2- | tail -1)
    if [[ -z $reval ]];then
        [[ -n $3 ]] && reval=$3
    fi
    echo $reval
}

In 10.1.28 it's moved to wsrep_sst_common and it's:

parse_cnf()
{
    local group=$1
    local var=$2
    local reval=""
 
    # print the default settings for given group using my_print_default.
    # normalize the variable names specified in cnf file (user can use _ or - for example log-bin or log_bin)
    # then grep for needed variable
    # finally get the variable value (if variables has been specified multiple time use the last value only)
 
    # look in group+suffix
    if [[ -n $WSREP_SST_OPT_CONF_SUFFIX ]]; then
        reval=$($MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF "${group}${WSREP_SST_OPT_CONF_SUFFIX}" | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--$var=" | cut -d= -f2- | tail -1)
    fi
 
    # look in group
    if [[ -z $reval ]]; then
        reval=$($MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF $group | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--$var=" | cut -d= -f2- | tail -1)
    fi
 
    # use default if we haven't found a value
    if [[ -z $reval ]]; then
        [[ -n $3 ]] && reval=$3
    fi
    echo $reval
}
         

Using wsrep_sst_common and wsrep_sst_xtrabackup-v2 from version 10.1.23 on 10.1.28 fixes the problem and the node succesfully joins.



 Comments   
Comment by Miguel Escobedo [ 2017-11-16 ]

We are experiencing a similar problem. We have compressor and decompressor configured under the [sst] section of my.cnf

The problem as we see it is that those settings are being ignored because the code that tries to execute is

my_print_defaults -c sst | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--compressor=" | cut -d= -f2- | tail -1

because the $WSREP_SST_OPT_CONF variable is empty.

If we run the line as

my_print_defaults sst | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--decompressor=" | cut -d= -f2- | tail -1

basically just remove the "-c" as well as the reference to the variable the value is properly extracted.

Comment by Andrii Nikitin (Inactive) [ 2017-11-28 ]

I do see error like below in 10.1.28,

WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u openssl-listen:4444,reuseaddr,cert=,key=/home/a/env3/mariadb-environs/m2-10.1.28/ssl/client-key.pem,verify=0 stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20171128 10:58:03.902)
2017/11/28 10:58:03 socat[955] E SSL_CTX_use_certificate_file(): error:02001002:system library:fopen:No such file or directory
WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 1 0 (20171128 10:58:03.906)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20171128 10:58:03.907)
2017-11-28 10:58:06 140545364064000 [Note] WSREP: (9f7acfe0, 'ssl://0.0.0.0:4569') turning message relay requesting off
2017-11-28 10:59:06 140545614260992 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.88.225:4444/xtrabackup_sst//1
2017-11-28 10:59:06 140545614260992 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2017-11-28 10:59:06 140545614260992 [Note] WSREP: REPL Protocols: 7 (3, 2)
2017-11-28 10:59:06 140545614260992 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
2017-11-28 10:59:06 140545317926656 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.88.225' --datadir '/home/a/env3/mariadb-environs/m2-10.1.28/dt/'  --defaults-file '/home/a/env3/mariadb-environs/m2-10.1.28/my.cnf'  --parent '613'  '' : 32 (Broken pipe)
2017-11-28 10:59:06 140545317926656 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
2017-11-28 10:59:06 140545614583360 [ERROR] WSREP: SST failed: 32 (Broken pipe)

but 10.1.29 and current 10.1 source tree succeed with fix for unrelated problem mentioned in https://jira.mariadb.org/browse/MDEV-13969?focusedCommentId=101180&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-101180 (just needs to remove "--defaults-file=" from sst script at https://github.com/MariaDB/server/blob/40756c9151a564b57f351111d7486b4d18ef5e39/scripts/wsrep_sst_xtrabackup-v2.sh#L868 and few lines later).

Unless somebody can claim that 10.1.29+ doesn't work for them, I will close this with 'Fixed in 10.1.29' resolution.

Comment by Elena Stepanova [ 2018-01-27 ]

Closed based on the comment above. Please comment if you still experience the problem with 10.1.29+.

Comment by Claudio Nanni [ 2018-02-12 ]

Seen again in 10.1.31: https://jira.mariadb.org/browse/MDEV-15254

Comment by Daniel Black [ 2018-02-12 ]

All of the code here was restructured by me in https://github.com/MariaDB/server/pull/549. Will look at cause and follow up in MDEV-15254. Apologies for whatever I missed.

Generated at Thu Feb 08 08:12:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.