Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14313

Joiner fails to SST when upgraded to 10.1.28 from 10.1.23

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.1.28
    • 10.1.29
    • Galera SST
    • None

    Description

      A Galera cluster uses encryption for SST with the following configuration:

      [sst]
      encrypt=3
      tkey=/path/to/key.pem
      tcert=/path/to/cert.pem
      tca=/path/to/ca.pem
      

      Upgrading a node to 10.1.28 (in a Galera cluster of 10.1.23 nodes) fails with:

      2017-11-04 18:23:17 140462627223296 [Note] WSREP: (3804918b, 'tcp://0.0.0.0:4567') turning message relay requesting off
      xb_stream_read_chunk(): wrong chunk magic at offset 0x0.
      WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 137 1 (20171104 18:24:55.419)
      WSREP_SST: [ERROR] Cleanup after exit with status:32 (20171104 18:24:55.421)
      

      One explication is that the joiner doesn't expect an encrypted stream, poining at a problem with the configuration under [sst] section.
      After some investigation I restricted the problem to the wsrep_sst_xtrabackup-v2's function parse_cnf() which was moved to wsrep_sst_common.

      In 10.1.23 parse_cnf() is in wsrep_sst_xtrabackup-v2 and it's:

      parse_cnf()
      {
          local group=$1
          local var=$2
          # print the default settings for given group using my_print_default.
          # normalize the variable names specified in cnf file (user can use _ or - for example log-bin or log_bin)
          # then grep for needed variable
          # finally get the variable value (if variables has been specified multiple time use the last value only)
          reval=$($MY_PRINT_DEFAULTS $group | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--$var=" | cut -d= -f2- | tail -1)
          if [[ -z $reval ]];then
              [[ -n $3 ]] && reval=$3
          fi
          echo $reval
      }
      

      In 10.1.28 it's moved to wsrep_sst_common and it's:

      parse_cnf()
      {
          local group=$1
          local var=$2
          local reval=""
       
          # print the default settings for given group using my_print_default.
          # normalize the variable names specified in cnf file (user can use _ or - for example log-bin or log_bin)
          # then grep for needed variable
          # finally get the variable value (if variables has been specified multiple time use the last value only)
       
          # look in group+suffix
          if [[ -n $WSREP_SST_OPT_CONF_SUFFIX ]]; then
              reval=$($MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF "${group}${WSREP_SST_OPT_CONF_SUFFIX}" | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--$var=" | cut -d= -f2- | tail -1)
          fi
       
          # look in group
          if [[ -z $reval ]]; then
              reval=$($MY_PRINT_DEFAULTS -c $WSREP_SST_OPT_CONF $group | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--$var=" | cut -d= -f2- | tail -1)
          fi
       
          # use default if we haven't found a value
          if [[ -z $reval ]]; then
              [[ -n $3 ]] && reval=$3
          fi
          echo $reval
      }
               
      

      Using wsrep_sst_common and wsrep_sst_xtrabackup-v2 from version 10.1.23 on 10.1.28 fixes the problem and the node succesfully joins.

      Attachments

        Issue Links

          Activity

            mescobedo@labattfood.com Miguel Escobedo added a comment - - edited

            We are experiencing a similar problem. We have compressor and decompressor configured under the [sst] section of my.cnf

            The problem as we see it is that those settings are being ignored because the code that tries to execute is

            my_print_defaults -c sst | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--compressor=" | cut -d= -f2- | tail -1
            

            because the $WSREP_SST_OPT_CONF variable is empty.

            If we run the line as

            my_print_defaults sst | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--decompressor=" | cut -d= -f2- | tail -1
            

            basically just remove the "-c" as well as the reference to the variable the value is properly extracted.

            mescobedo@labattfood.com Miguel Escobedo added a comment - - edited We are experiencing a similar problem. We have compressor and decompressor configured under the [sst] section of my.cnf The problem as we see it is that those settings are being ignored because the code that tries to execute is my_print_defaults -c sst | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--compressor=" | cut -d= -f2- | tail -1 because the $WSREP_SST_OPT_CONF variable is empty. If we run the line as my_print_defaults sst | awk -F= '{if ($1 ~ /_/) { gsub(/_/,"-",$1); print $1"="$2 } else { print $0 }}' | grep -- "--decompressor=" | cut -d= -f2- | tail -1 basically just remove the "-c" as well as the reference to the variable the value is properly extracted.

            I do see error like below in 10.1.28,

            WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u openssl-listen:4444,reuseaddr,cert=,key=/home/a/env3/mariadb-environs/m2-10.1.28/ssl/client-key.pem,verify=0 stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20171128 10:58:03.902)
            2017/11/28 10:58:03 socat[955] E SSL_CTX_use_certificate_file(): error:02001002:system library:fopen:No such file or directory
            WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 1 0 (20171128 10:58:03.906)
            WSREP_SST: [ERROR] Cleanup after exit with status:32 (20171128 10:58:03.907)
            2017-11-28 10:58:06 140545364064000 [Note] WSREP: (9f7acfe0, 'ssl://0.0.0.0:4569') turning message relay requesting off
            2017-11-28 10:59:06 140545614260992 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.88.225:4444/xtrabackup_sst//1
            2017-11-28 10:59:06 140545614260992 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            2017-11-28 10:59:06 140545614260992 [Note] WSREP: REPL Protocols: 7 (3, 2)
            2017-11-28 10:59:06 140545614260992 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3
            2017-11-28 10:59:06 140545317926656 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.88.225' --datadir '/home/a/env3/mariadb-environs/m2-10.1.28/dt/'  --defaults-file '/home/a/env3/mariadb-environs/m2-10.1.28/my.cnf'  --parent '613'  '' : 32 (Broken pipe)
            2017-11-28 10:59:06 140545317926656 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script.
            2017-11-28 10:59:06 140545614583360 [ERROR] WSREP: SST failed: 32 (Broken pipe)
            

            but 10.1.29 and current 10.1 source tree succeed with fix for unrelated problem mentioned in https://jira.mariadb.org/browse/MDEV-13969?focusedCommentId=101180&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-101180 (just needs to remove "--defaults-file=" from sst script at https://github.com/MariaDB/server/blob/40756c9151a564b57f351111d7486b4d18ef5e39/scripts/wsrep_sst_xtrabackup-v2.sh#L868 and few lines later).

            Unless somebody can claim that 10.1.29+ doesn't work for them, I will close this with 'Fixed in 10.1.29' resolution.

            anikitin Andrii Nikitin (Inactive) added a comment - I do see error like below in 10.1.28, WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u openssl-listen:4444,reuseaddr,cert=,key=/home/a/env3/mariadb-environs/m2-10.1.28/ssl/client-key.pem,verify=0 stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20171128 10:58:03.902) 2017/11/28 10:58:03 socat[955] E SSL_CTX_use_certificate_file(): error:02001002:system library:fopen:No such file or directory WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 0 (20171128 10:58:03.906) WSREP_SST: [ERROR] Cleanup after exit with status:32 (20171128 10:58:03.907) 2017-11-28 10:58:06 140545364064000 [Note] WSREP: (9f7acfe0, 'ssl://0.0.0.0:4569') turning message relay requesting off 2017-11-28 10:59:06 140545614260992 [Note] WSREP: Prepared SST request: xtrabackup-v2|192.168.88.225:4444/xtrabackup_sst//1 2017-11-28 10:59:06 140545614260992 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2017-11-28 10:59:06 140545614260992 [Note] WSREP: REPL Protocols: 7 (3, 2) 2017-11-28 10:59:06 140545614260992 [Note] WSREP: Assign initial position for certification: 0, protocol version: 3 2017-11-28 10:59:06 140545317926656 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.88.225' --datadir '/home/a/env3/mariadb-environs/m2-10.1.28/dt/' --defaults-file '/home/a/env3/mariadb-environs/m2-10.1.28/my.cnf' --parent '613' '' : 32 (Broken pipe) 2017-11-28 10:59:06 140545317926656 [ERROR] WSREP: Failed to read uuid:seqno and wsrep_gtid_domain_id from joiner script. 2017-11-28 10:59:06 140545614583360 [ERROR] WSREP: SST failed: 32 (Broken pipe) but 10.1.29 and current 10.1 source tree succeed with fix for unrelated problem mentioned in https://jira.mariadb.org/browse/MDEV-13969?focusedCommentId=101180&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-101180 (just needs to remove "--defaults-file=" from sst script at https://github.com/MariaDB/server/blob/40756c9151a564b57f351111d7486b4d18ef5e39/scripts/wsrep_sst_xtrabackup-v2.sh#L868 and few lines later). Unless somebody can claim that 10.1.29+ doesn't work for them, I will close this with 'Fixed in 10.1.29' resolution.

            Closed based on the comment above. Please comment if you still experience the problem with 10.1.29+.

            elenst Elena Stepanova added a comment - Closed based on the comment above. Please comment if you still experience the problem with 10.1.29+.
            claudio.nanni Claudio Nanni added a comment - Seen again in 10.1.31: https://jira.mariadb.org/browse/MDEV-15254
            danblack Daniel Black added a comment -

            All of the code here was restructured by me in https://github.com/MariaDB/server/pull/549. Will look at cause and follow up in MDEV-15254. Apologies for whatever I missed.

            danblack Daniel Black added a comment - All of the code here was restructured by me in https://github.com/MariaDB/server/pull/549 . Will look at cause and follow up in MDEV-15254 . Apologies for whatever I missed.

            People

              Unassigned Unassigned
              claudio.nanni Claudio Nanni
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.