Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-14256

MariaDB 10.2.10 can't SST with xtrabackup-v2

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.2.8
    • 10.2.11
    • Galera SST, wsrep
    • [jg4461@db2 ~]$ cat /etc/redhat-release
      CentOS Linux release 7.4.1708 (Core)
      [jg4461@db2 ~]$ uname -r
      3.10.0-693.5.2.el7.x86_64
    • 10.2.11

    Description

      Following an upgrade to MariaDB-server-10.2.10-1.el7.centos.x86_64 wsrep_sst_xtrabackup-v2 is unable to initiate an SST to join a node to the cluster.

      It fails with the following errors:

      2017-11-01  7:23:33 140136889161472 [Note] WSREP: Flow-control interval: [23, 23]
      2017-11-01  7:23:33 140136889161472 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 4823208097)
      2017-11-01  7:23:33 140136880768768 [Note] WSREP: State transfer required: 
              Group state: c37249ee-cc56-11e3-8839-da7603c8db1b:4823208097
              Local state: c37249ee-cc56-11e3-8839-da7603c8db1b:4822595673
      2017-11-01  7:23:33 140136880768768 [Note] WSREP: New cluster view: global state: c37249ee-cc56-11e3-8839-da7603c8db1b:4823208097, view# 1879: Primary, number of nodes: 2, my index: 0, protocol version 3
      2017-11-01  7:23:33 140136880768768 [Warning] WSREP: Gap in state sequence. Need state transfer.
      2017-11-01  7:23:33 140136880039680 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '137.222.8.66' --datadir '/var/lib/mysql/data/'   --parent '27765'  '' '
      /usr//bin/wsrep_sst_xtrabackup-v2: line 646: WSREP_SST_OPT_PORT: unbound variable
      2017-11-01  7:23:34 140136880039680 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '137.222.8.66' --datadir '/var/lib/mysql/data/'   --parent '27765'  '' 
              Read: '(null)'
      2017-11-01  7:23:34 140136880039680 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '137.222.8.66' --datadir '/var/lib/mysql/data/'   --parent '27765'  '' : 1 (Operation not permitted)
      2017-11-01  7:23:34 140136880768768 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
      2017-11-01  7:23:34 140136880768768 [ERROR] Aborting
      

      The root cause appears to be:

      /usr/bin/wsrep_sst_xtrabackup-v2: line 646: WSREP_SST_OPT_PORT: unbound variable
      

      The WSREP_SST_OPT_PORT doesn't have a default value set, either set in wsrep_sst_xtrabackup-v2 or in wsrep_sst_common

      The following diff sets a default value for the WSREP_SST_OPT_PORT variable, and allows the SST to proceed.

      --- wsrep_sst_common    2017-11-02 11:02:09.561266862 +0000
      +++ wsrep_sst_common_modified   2017-11-02 11:02:00.473166368 +0000
      @@ -27,6 +27,7 @@
       WSREP_SST_OPT_PSWD=${WSREP_SST_OPT_PSWD:-}
       WSREP_SST_OPT_DEFAULT=""
       WSREP_SST_OPT_EXTRA_DEFAULT=""
      +WSREP_SST_OPT_PORT=4444
       
       while [ $# -gt 0 ]; do
       case "$1" in
      

      Attachments

        Issue Links

          Activity

            We have this issue as well on Debian Jessie with Mariadb 10.2.10

            nielsh Niels Hendriks added a comment - We have this issue as well on Debian Jessie with Mariadb 10.2.10

            Hi,

            Any news / plans for this issue?
            BR
            johan

            johan-severalnines Johan Andersson added a comment - Hi, Any news / plans for this issue? BR johan

            johan-severalnines did suggested workaround help you? You can try also remove 'u' letter from the first line of wsrep_sst_xtrabackup-v2 script, so it is just "#!/bin/bash -e" or put `set +u` at start of sst script.

            anikitin Andrii Nikitin (Inactive) added a comment - johan-severalnines did suggested workaround help you? You can try also remove 'u' letter from the first line of wsrep_sst_xtrabackup-v2 script, so it is just "#!/bin/bash -e" or put `set +u` at start of sst script.

            Hi,
            Thanks, but I rather have this fixed and released than resorting to workarounds.
            In any case I changed to use mariabackup instead, which might have been the intention with this .

            BR
            johan

            johan-severalnines Johan Andersson added a comment - Hi, Thanks, but I rather have this fixed and released than resorting to workarounds. In any case I changed to use mariabackup instead, which might have been the intention with this . BR johan
            kolbe Kolbe Kegel (Inactive) added a comment - - edited

            Here's another good workaround that doesn't require editing any files included in the distribution:

            mkdir -p /etc/systemd/system/mariadb.service.d/
            printf '[Service]\nEnvironment="WSREP_SST_OPT_PORT=4444"\n' > /etc/systemd/system/mariadb.service.d/MDEV-14256.conf
            systemctl daemon-reload
            

            kolbe Kolbe Kegel (Inactive) added a comment - - edited Here's another good workaround that doesn't require editing any files included in the distribution: mkdir -p /etc/systemd/system/mariadb.service.d/ printf '[Service]\nEnvironment="WSREP_SST_OPT_PORT=4444"\n' > /etc/systemd/system/mariadb.service.d/MDEV-14256.conf systemctl daemon-reload
            erichowey Eric Howey added a comment -

            Can we take a moment to just reflect on how absurd it is that this bug was released to production? A simple test case which is, invoke an SST transfer with xtrabackup-v2 would have caught this issue. People like me are facing production outages due to this issue. A lot of faith has just been lost in the MariaDB product.

            erichowey Eric Howey added a comment - Can we take a moment to just reflect on how absurd it is that this bug was released to production? A simple test case which is, invoke an SST transfer with xtrabackup-v2 would have caught this issue. People like me are facing production outages due to this issue. A lot of faith has just been lost in the MariaDB product.

            Yeah, I agree with Eric that the tests for mariadb 10.2 could really use some improvements. I expect some bugs in the first tagged stable release of a new major version (10.2.6) but the following releases have also had some bugs that can make it unusable in certain usecases.

            This has also been mentioned by "DEZILLIUM LIMITED" in his description at https://jira.mariadb.org/browse/MDEV-14255 :
            Quote:

            MariaDB is broken (again) on Debian.
            10.2.6: broken libmariadb3
            10.2.7: unreleased for Debian
            10.2.8: broken libmariadb3
            10.2.9: working
            10.2.10: broken sst
            

            At least for Debian 8 and 9 this means we had exactly 1 working MariaDB 10.2 stable release, which was 10.2.9. And now, it's broken again.

            With 10.1 we never had big issues like this, and we still don't. I get that no one puts in these bugs on purpose and I appreciate the effort put in by the dev team, but it would be really re-assuring to have some feedback from the dev team regarding the prevention of these issues. Are there any plans to improve the tests? Does the release of mariadb 10.2 not feel messier than 10.1 to you?

            As a sidenote, since everyone running MariaDB 10.2/Galera now with xtrabackup-v2 will break their installation when they upgrade to the latest version, is this not hotfix worthy?

            nielsh Niels Hendriks added a comment - Yeah, I agree with Eric that the tests for mariadb 10.2 could really use some improvements. I expect some bugs in the first tagged stable release of a new major version (10.2.6) but the following releases have also had some bugs that can make it unusable in certain usecases. This has also been mentioned by "DEZILLIUM LIMITED" in his description at https://jira.mariadb.org/browse/MDEV-14255 : Quote: MariaDB is broken (again) on Debian. 10.2.6: broken libmariadb3 10.2.7: unreleased for Debian 10.2.8: broken libmariadb3 10.2.9: working 10.2.10: broken sst At least for Debian 8 and 9 this means we had exactly 1 working MariaDB 10.2 stable release, which was 10.2.9. And now, it's broken again. With 10.1 we never had big issues like this, and we still don't. I get that no one puts in these bugs on purpose and I appreciate the effort put in by the dev team, but it would be really re-assuring to have some feedback from the dev team regarding the prevention of these issues. Are there any plans to improve the tests? Does the release of mariadb 10.2 not feel messier than 10.1 to you? As a sidenote, since everyone running MariaDB 10.2/Galera now with xtrabackup-v2 will break their installation when they upgrade to the latest version, is this not hotfix worthy?

            This will indeed break it for everyone using xtrabackup-v2 but they might not realise it until it is too late. The upgrade procedure itself might only trigger an IST and they would not realise that SST is broken until they reboot a node later on etc.

            All but one of my production nodes broke upon upgrade (via nightly cron job). The only reason one node survived is because it had a broken yum config which prevented new packages from being updated. I was able to run my infrastructure on one node for 24 hours until I found the problem but this could easily have caused a production outage for me.

            jgazeley Jonathan Gazeley added a comment - This will indeed break it for everyone using xtrabackup-v2 but they might not realise it until it is too late. The upgrade procedure itself might only trigger an IST and they would not realise that SST is broken until they reboot a node later on etc. All but one of my production nodes broke upon upgrade (via nightly cron job). The only reason one node survived is because it had a broken yum config which prevented new packages from being updated. I was able to run my infrastructure on one node for 24 hours until I found the problem but this could easily have caused a production outage for me.

            The bugs affects those systems which don't have explicit port specified in wsrep configuration.
            To demonstrate the problem with docker containers following scripts may be used:

            tearup - will create docker image with installed 10.2 Server and xtrabackup 2.4
            test - will create two containers using image from tearup step and try to setup cluster between them.

            When this line is uncommented - the fix is applied and no problem happens anymore.

            anikitin Andrii Nikitin (Inactive) added a comment - The bugs affects those systems which don't have explicit port specified in wsrep configuration. To demonstrate the problem with docker containers following scripts may be used: tearup - will create docker image with installed 10.2 Server and xtrabackup 2.4 test - will create two containers using image from tearup step and try to setup cluster between them. When this line is uncommented - the fix is applied and no problem happens anymore.

            sachin.setiya.007 please review following patch to address the problem in 10.2

            diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh
            index 40e686d4d6b..f6bda7db499 100644
            --- a/scripts/wsrep_sst_xtrabackup-v2.sh
            +++ b/scripts/wsrep_sst_xtrabackup-v2.sh
            @@ -638,12 +638,12 @@ kill_xtrabackup()
             setup_ports()
             {
                 if [[ "$WSREP_SST_OPT_ROLE"  == "donor" ]];then
            -        SST_PORT=$WSREP_SST_OPT_PORT
            +        SST_PORT=${WSREP_SST_OPT_PORT:-}
                     REMOTEIP=$WSREP_SST_OPT_HOST
                     lsn=$(echo $WSREP_SST_OPT_PATH | awk -F '[/]' '{ print $2 }')
                     sst_ver=$(echo $WSREP_SST_OPT_PATH | awk -F '[/]' '{ print $3 }')
                 else
            -        SST_PORT=$WSREP_SST_OPT_PORT
            +        SST_PORT=${WSREP_SST_OPT_PORT:-}
                 fi
             }
             

            anikitin Andrii Nikitin (Inactive) added a comment - sachin.setiya.007 please review following patch to address the problem in 10.2 diff --git a /scripts/wsrep_sst_xtrabackup-v2 .sh b /scripts/wsrep_sst_xtrabackup-v2 .sh index 40e686d4d6b..f6bda7db499 100644 --- a /scripts/wsrep_sst_xtrabackup-v2 .sh +++ b /scripts/wsrep_sst_xtrabackup-v2 .sh @@ -638,12 +638,12 @@ kill_xtrabackup() setup_ports() { if [[ "$WSREP_SST_OPT_ROLE" == "donor" ]]; then - SST_PORT=$WSREP_SST_OPT_PORT + SST_PORT=${WSREP_SST_OPT_PORT:-} REMOTEIP=$WSREP_SST_OPT_HOST lsn=$( echo $WSREP_SST_OPT_PATH | awk -F '[/]' '{ print $2 }' ) sst_ver=$( echo $WSREP_SST_OPT_PATH | awk -F '[/]' '{ print $3 }' ) else - SST_PORT=$WSREP_SST_OPT_PORT + SST_PORT=${WSREP_SST_OPT_PORT:-} fi }

            anikitin what change introduced this bug? How comes it didn't fail before?

            serg Sergei Golubchik added a comment - anikitin what change introduced this bug? How comes it didn't fail before?

            serg it looks it was merged in this commit, before that it did parse address expression directly:
            https://github.com/MariaDB/server/commit/83664e21e4fb6755c8c0c90d3dee8819d36928c9#diff-cca56af3f0ce3e7f4fbc13dc62cc2823R640

            anikitin Andrii Nikitin (Inactive) added a comment - serg it looks it was merged in this commit, before that it did parse address expression directly: https://github.com/MariaDB/server/commit/83664e21e4fb6755c8c0c90d3dee8819d36928c9#diff-cca56af3f0ce3e7f4fbc13dc62cc2823R640

            I'm not sure that was it. Old code used

                 '--address')
                     readonly WSREP_SST_OPT_ADDR="$2"
            ...
                    SST_PORT=$(echo ${WSREP_SST_OPT_ADDR} | awk -F ':' '{ print $2 }')
            

            That would've thrown an error if --address is not used. New code does

                 '--address')
                    readonly WSREP_SST_OPT_ADDR="$2"
            ...
                    readonly WSREP_SST_OPT_PORT=$(echo $WSREP_SST_OPT_ADDR | \
                            cut -d ']' -f 2 | cut -s -d ':' -f 2 | cut -d '/' -f 1)
            ...
                    SST_PORT=$WSREP_SST_OPT_PORT
            

            Assuming that --address is used (because the old code didn't fail), I don't see how WSREP_SST_OPT_PORT could be unset. There was a later relevant commit — 4c2c057d404 — but I don't see how it could've left WSREP_SST_OPT_PORT unset either.

            serg Sergei Golubchik added a comment - I'm not sure that was it. Old code used '--address' ) readonly WSREP_SST_OPT_ADDR= "$2" ... SST_PORT=$( echo ${WSREP_SST_OPT_ADDR} | awk -F ':' '{ print $2 }' ) That would've thrown an error if --address is not used. New code does '--address' ) readonly WSREP_SST_OPT_ADDR= "$2" ... readonly WSREP_SST_OPT_PORT=$( echo $WSREP_SST_OPT_ADDR | \ cut -d ']' -f 2 | cut -s -d ':' -f 2 | cut -d '/' -f 1) ... SST_PORT=$WSREP_SST_OPT_PORT Assuming that --address is used (because the old code didn't fail), I don't see how WSREP_SST_OPT_PORT could be unset. There was a later relevant commit — 4c2c057d404 — but I don't see how it could've left WSREP_SST_OPT_PORT unset either.
            anikitin Andrii Nikitin (Inactive) added a comment - - edited

            Yes, correct. Then after fix for MDEV-13968 it became WSREP_SST_OPT_ADDR_PORT which was initialized in --address and WSREP_SST_OPT_PORT left unset
            Actually probably patch from duplicate MDEV-14299 is better than mine suggested above

            anikitin Andrii Nikitin (Inactive) added a comment - - edited Yes, correct. Then after fix for MDEV-13968 it became WSREP_SST_OPT_ADDR_PORT which was initialized in --address and WSREP_SST_OPT_PORT left unset Actually probably patch from duplicate MDEV-14299 is better than mine suggested above

            Committed a patch

            serg Sergei Golubchik added a comment - Committed a patch

            serg The patch is good and I've verified it by patching 10.2.10 in docker image from earlier case like this https://github.com/AndriiNikitin/bugs/blob/master/MDEV-14256-test1.sh#L24

            anikitin Andrii Nikitin (Inactive) added a comment - serg The patch is good and I've verified it by patching 10.2.10 in docker image from earlier case like this https://github.com/AndriiNikitin/bugs/blob/master/MDEV-14256-test1.sh#L24

            People

              serg Sergei Golubchik
              jgazeley Jonathan Gazeley
              Votes:
              11 Vote for this issue
              Watchers:
              21 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.