Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15254

10.1.31 does not join an existing cluster with SST xtrabackup-v2

Details

    Description

      10.1.31 does not join any Galera cluster with SST xtrabackup-v2.

      To reproduce: Start a 10.1.31 node and wait it to connect to a cluster, it won't, timing out at:

      WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:12209,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180208 15:37:38.407)
      
      

      The Donor does not receive the SST request.

      Workaround: use wsrep_sst_common and wsrep_sst_xtrabackup-v2 from versions < 10.1.28.

      Scripts from 10.1.28/29 can't be used as they have other bugs.

      Attachments

        1. server1.cnf
          1 kB
        2. server2.cnf
          1 kB
        3. wsrep_sst_common
          5 kB
        4. wsrep_sst_xtrabackup-v2
          31 kB
        5. wsrep_sst_xtrabackup-v2.sh
          38 kB

        Issue Links

          Activity

            Craig Craig bailey added a comment -

            Seeing the exact same issue on multiple clusters

            Craig Craig bailey added a comment - Seeing the exact same issue on multiple clusters

            Node with versions get from MariaDB 10.1 repo: 10.1.31 and percona repo : percona-xtrabackup.x86_64 0:2.3.6-1 joined Galera cluster loaded with the same maridb and percona-xtrabackup versions; checked with the attached serv cnfs on CentOS 7.4 ;

            winstone Zdravelina Sokolovska (Inactive) added a comment - Node with versions get from MariaDB 10.1 repo: 10.1.31 and percona repo : percona-xtrabackup.x86_64 0:2.3.6-1 joined Galera cluster loaded with the same maridb and percona-xtrabackup versions; checked with the attached serv cnfs on CentOS 7.4 ;
            logan Logan V added a comment -

            Seeing the same issue on ubuntu 16.04, package versions as follows:

            # dpkg -l | grep -E '(mariadb|xtrabackup)'
            ii  libmariadbclient-dev        10.1.31+maria-1~xenial                     amd64        MariaDB database development files
            ii  libmariadbclient18          10.1.31+maria-1~xenial                     amd64        MariaDB database client library
            ii  mariadb-client              10.1.31+maria-1~xenial                     all          MariaDB database client (metapackage depending on the latest version)
            ii  mariadb-client-10.1         10.1.31+maria-1~xenial                     amd64        MariaDB database client binaries
            ii  mariadb-client-core-10.1    10.1.31+maria-1~xenial                     amd64        MariaDB database core client binaries
            ii  mariadb-common              10.1.31+maria-1~xenial                     all          MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf)
            ii  mariadb-server-10.1         10.1.31+maria-1~xenial                     amd64        MariaDB database server binaries
            ii  mariadb-server-core-10.1    10.1.31+maria-1~xenial                     amd64        MariaDB database core server files
            ii  percona-xtrabackup-24       2.4.5-1.xenial                             amd64        Open source backup tool for InnoDB and XtraDB
            

            logan Logan V added a comment - Seeing the same issue on ubuntu 16.04, package versions as follows: # dpkg -l | grep -E '(mariadb|xtrabackup)' ii libmariadbclient-dev 10.1.31+maria-1~xenial amd64 MariaDB database development files ii libmariadbclient18 10.1.31+maria-1~xenial amd64 MariaDB database client library ii mariadb-client 10.1.31+maria-1~xenial all MariaDB database client (metapackage depending on the latest version) ii mariadb-client-10.1 10.1.31+maria-1~xenial amd64 MariaDB database client binaries ii mariadb-client-core-10.1 10.1.31+maria-1~xenial amd64 MariaDB database core client binaries ii mariadb-common 10.1.31+maria-1~xenial all MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf) ii mariadb-server-10.1 10.1.31+maria-1~xenial amd64 MariaDB database server binaries ii mariadb-server-core-10.1 10.1.31+maria-1~xenial amd64 MariaDB database core server files ii percona-xtrabackup-24 2.4.5-1.xenial amd64 Open source backup tool for InnoDB and XtraDB
            claudio.nanni Claudio Nanni added a comment -

            The problem is just with wsrep_sst_common and wsrep_sst_xtrabackup-v2.
            You can use the scripts from 10.1.26 which will work with 10.1.31.
            I'm attaching for convenience the two scripts from: mariadb-10.1.26-linux-x86_64.tar.gz wsrep_sst_common wsrep_sst_xtrabackup-v2

            winstone did you trigger SST?

            Please also see this: https://jira.mariadb.org/browse/MDEV-14313

            claudio.nanni Claudio Nanni added a comment - The problem is just with wsrep_sst_common and wsrep_sst_xtrabackup-v2. You can use the scripts from 10.1.26 which will work with 10.1.31. I'm attaching for convenience the two scripts from: mariadb-10.1.26-linux-x86_64.tar.gz wsrep_sst_common wsrep_sst_xtrabackup-v2 winstone did you trigger SST? Please also see this: https://jira.mariadb.org/browse/MDEV-14313
            evrardjp Jean-Philippe Evrard added a comment - - edited

            You can also use the wsrep_sst_common and wsrep_sst_xtrabackup-v2 scripts from 10.1.30 with 10.1.31's packages and it would work.

            It's purely in the script that the problem arises.
            I'm suspecting changes in the listen addresses. By default, we have a list of addresses for the cluster, and the sst should connect to the first node which is localhost in our case, and should reject it. We see the attempt to connect to the donor, but nothing is received from the donor, which makes me think of a networking issue. On top of that, using nc/tar instead of the defaults don't seem to fix the issue.

            evrardjp Jean-Philippe Evrard added a comment - - edited You can also use the wsrep_sst_common and wsrep_sst_xtrabackup-v2 scripts from 10.1.30 with 10.1.31's packages and it would work. It's purely in the script that the problem arises. I'm suspecting changes in the listen addresses. By default, we have a list of addresses for the cluster, and the sst should connect to the first node which is localhost in our case, and should reject it. We see the attempt to connect to the donor, but nothing is received from the donor, which makes me think of a networking issue. On top of that, using nc/tar instead of the defaults don't seem to fix the issue.

            hello ,
            as mentioned above the node loaded with v 10.1.31 succeed join an existing cluster with SST xtrabackup-v2;
            used package xtrabackup version: percona-xtrabackup.x86_64 0:2.3.6
            As they were not server configurations files I attached the used ones.
            note: after adding configuration line wsrep_sst_receive_address=self_address:4444 node joined successfully
            for example wsrep_sst_receive_address=192.168.0.100:4444
            note2 : on the next version 10.2 it's found that's it's not need to set wsrep_sst_receive_address

            winstone Zdravelina Sokolovska (Inactive) added a comment - - edited hello , as mentioned above the node loaded with v 10.1.31 succeed join an existing cluster with SST xtrabackup-v2; used package xtrabackup version: percona-xtrabackup.x86_64 0:2.3.6 As they were not server configurations files I attached the used ones. note: after adding configuration line wsrep_sst_receive_address=self_address:4444 node joined successfully for example wsrep_sst_receive_address=192.168.0.100:4444 note2 : on the next version 10.2 it's found that's it's not need to set wsrep_sst_receive_address
            evrardjp Jean-Philippe Evrard added a comment - - edited

            log: http://paste.ubuntu.com/=BRFvBdFB5x/
            conf of clustering: http://paste.ubuntu.com/=fQkX6JVpnN/

            Adding wsrep_sst_receive_address=10.1.0.4:4444 doesn't change things.
            Adding wsrep_sst_receive_address=10.1.0.4:1234 goes further: http://paste.ubuntu.com/=FdC5z7nJ56/

            Rolling back to 10.1.30 wsrep_ss_common and xtrabackup-v2 scripts, with no wsrep_sst_receive_address:
            http://paste.ubuntu.com/=cv5YQFHZr2/

            evrardjp Jean-Philippe Evrard added a comment - - edited log: http://paste.ubuntu.com/=BRFvBdFB5x/ conf of clustering: http://paste.ubuntu.com/=fQkX6JVpnN/ Adding wsrep_sst_receive_address=10.1.0.4:4444 doesn't change things. Adding wsrep_sst_receive_address=10.1.0.4:1234 goes further: http://paste.ubuntu.com/=FdC5z7nJ56/ Rolling back to 10.1.30 wsrep_ss_common and xtrabackup-v2 scripts, with no wsrep_sst_receive_address: http://paste.ubuntu.com/=cv5YQFHZr2/
            claudio.nanni Claudio Nanni added a comment - - edited

            winstone please let's don't add irrelevant information: wsrep_sst_receive_address does not have anything to do with this issue.

            Jean-Philippe,
            when you used port 1234 it didn't go any further, check well:

            WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180212 13:57:40.929)
            WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180212 13:59:20.933)
            WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180212 13:59:20.935)
            2018-02-12 13:59:20 140062985549568 [ERROR] WSREP: Failed to read 'ready <addr>'
            2018-02-12 13:59:20 140062985549568 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:4444' --datadir '/var/lib/mysql/'   --parent '28980' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe)
            2018-02-12 13:59:20 140063206394624 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
            2018-02-12 13:59:20 140063206394624 [ERROR] Aborting
            
            

            WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:1234,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180212 14:03:08.907)
            WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180212 14:04:48.912)
            WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180212 14:04:48.915)
            2018-02-12 14:04:48 140161341978368 [ERROR] WSREP: Failed to read 'ready <addr>
            2018-02-12 14:04:48 140161341978368 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:1234' --datadir '/var/lib/mysql/'   --parent '29507' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe)
            2018-02-12 14:04:48 140161562221312 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.
            2018-02-12 14:04:48 140161562221312 [ERROR] Aborting
            

            The two extra lines that you saw:

            2018-02-12 14:03:09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567') connection to peer 72d00ba6 with addr tcp://10.1.0.4:4567 timed out, no messages seen in PT3S
            2018-02-12 14:03:09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567') turning message relay requesting off
            

            Are a regular message you see when you specify the same node ip in the cluster address, if you check carefully it's node `72d00ba6` timing out connecting to `72d00ba6`

            Again, the problem is in the wsrep_sst_common, wsrep_sst_xtrabackup-v2 scripts.

            winstone can you share the error log from both Donor and Joiner and innobackup.backup.log from the Donor after the joiner successfully joins with SST?

            claudio.nanni Claudio Nanni added a comment - - edited winstone please let's don't add irrelevant information: wsrep_sst_receive_address does not have anything to do with this issue. Jean-Philippe, when you used port 1234 it didn't go any further, check well: WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN: 4444 ,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) ( 20180212 13 : 57 : 40.929 ) WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage ( 20180212 13 : 59 : 20.933 ) WSREP_SST: [ERROR] Cleanup after exit with status: 32 ( 20180212 13 : 59 : 20.935 ) 2018 - 02 - 12 13 : 59 : 20 140062985549568 [ERROR] WSREP: Failed to read 'ready <addr>' 2018 - 02 - 12 13 : 59 : 20 140062985549568 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:4444' --datadir '/var/lib/mysql/' --parent '28980' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe) 2018 - 02 - 12 13 : 59 : 20 140063206394624 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable. 2018 - 02 - 12 13 : 59 : 20 140063206394624 [ERROR] Aborting WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN: 1234 ,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) ( 20180212 14 : 03 : 08.907 ) WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage ( 20180212 14 : 04 : 48.912 ) WSREP_SST: [ERROR] Cleanup after exit with status: 32 ( 20180212 14 : 04 : 48.915 ) 2018 - 02 - 12 14 : 04 : 48 140161341978368 [ERROR] WSREP: Failed to read 'ready <addr> 2018 - 02 - 12 14 : 04 : 48 140161341978368 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:1234' --datadir '/var/lib/mysql/' --parent '29507' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe) 2018 - 02 - 12 14 : 04 : 48 140161562221312 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable. 2018 - 02 - 12 14 : 04 : 48 140161562221312 [ERROR] Aborting The two extra lines that you saw: 2018 - 02 - 12 14 : 03 : 09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567' ) connection to peer 72d00ba6 with addr tcp:// 10.1 . 0.4 : 4567 timed out, no messages seen in PT3S 2018 - 02 - 12 14 : 03 : 09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567' ) turning message relay requesting off Are a regular message you see when you specify the same node ip in the cluster address, if you check carefully it's node `72d00ba6` timing out connecting to `72d00ba6` Again, the problem is in the wsrep_sst_common, wsrep_sst_xtrabackup-v2 scripts. winstone can you share the error log from both Donor and Joiner and innobackup.backup.log from the Donor after the joiner successfully joins with SST?

            Sorry, wrong wording on my side. Yes it's failing, and the only way to get something to work is to revert back to scripts 10.1.30. That's what I meant.

            evrardjp Jean-Philippe Evrard added a comment - Sorry, wrong wording on my side. Yes it's failing, and the only way to get something to work is to revert back to scripts 10.1.30. That's what I meant.

            Issue confirmed on mdb v10.1.31 /OS CentOS 7.4. :Node fail to join cluster with wsrep_sst_method=xtrabackup-v2

            A workaround like for 10.1.31:
            1. Install percona-xtrabackup version the same as on the other cluster nodes - -I used percona-xtrabackup.x86_64 0:2.3.6
            2. On the joiner Node comment the row: #wsrep_sst_method=xtrabackup-v2 in the server config
            3. Join the Node : service mysql start --wsrep_cluster_address=gcomm://192.168.104.195
            Starting mysql (via systemctl): [ OK ]
            4. Restore the row: wsrep_sst_method=xtrabackup-v2 i n the server config
            5. Restart service mysql on the Joiner Node: service mysql restart
            Restarting mysql (via systemctl): [ OK ]
            6. Issue any DML event on any Node – verify that the Joiner DB is updated
            note: that workaround would take effect in a quite changeless cluster – e.g after p.3 till p.5 cluster state was not being changed

            winstone Zdravelina Sokolovska (Inactive) added a comment - Issue confirmed on mdb v10.1.31 /OS CentOS 7.4. :Node fail to join cluster with wsrep_sst_method=xtrabackup-v2 A workaround like for 10.1.31: 1. Install percona-xtrabackup version the same as on the other cluster nodes - -I used percona-xtrabackup.x86_64 0:2.3.6 2. On the joiner Node comment the row: #wsrep_sst_method=xtrabackup-v2 in the server config 3. Join the Node : service mysql start --wsrep_cluster_address=gcomm://192.168.104.195 Starting mysql (via systemctl): [ OK ] 4. Restore the row: wsrep_sst_method=xtrabackup-v2 i n the server config 5. Restart service mysql on the Joiner Node: service mysql restart Restarting mysql (via systemctl): [ OK ] 6. Issue any DML event on any Node – verify that the Joiner DB is updated note: that workaround would take effect in a quite changeless cluster – e.g after p.3 till p.5 cluster state was not being changed

            Hello,

            This also affects MariaDB 10.2.13 on Debian 8 amd64.
            I just spent hours debugging this issue, very frustrated that this is broken again after the other issues described here: https://jira.mariadb.org/browse/MDEV-14256

            The workaround-scripts provided by Claudio fix the issue.

            Is a (regression) test for SST really that hard?

            nielsh Niels Hendriks added a comment - Hello, This also affects MariaDB 10.2.13 on Debian 8 amd64. I just spent hours debugging this issue, very frustrated that this is broken again after the other issues described here: https://jira.mariadb.org/browse/MDEV-14256 The workaround-scripts provided by Claudio fix the issue. Is a (regression) test for SST really that hard?
            claudio.nanni Claudio Nanni added a comment -

            Winstone,
            That's not a workaround, that's disabling xtrabackup-v2 SST and use rsync for the first join, and then re-enabling xtrabackup-v2 so it will fail at the first occasion.
            The easy and proved workaround is to use the two working scripts from previous versions.

            claudio.nanni Claudio Nanni added a comment - Winstone, That's not a workaround, that's disabling xtrabackup-v2 SST and use rsync for the first join, and then re-enabling xtrabackup-v2 so it will fail at the first occasion. The easy and proved workaround is to use the two working scripts from previous versions.
            planetbeing David Wang added a comment - - edited

            I spent many hours debugging this issue. I thought I was doing something wrong and I was trying to google for solutions. I only discovered this ticket after I figured out what was wrong and was going to submit a bug. Very frustrating.

            The problem (in my setup) is the change in the implementation of wait_for_listen in wsrep_sst_xtrabackup-v2.sh. The new script uses lsof which will always exit with an error code if it can't find all the items, and because the script has the -e option set in the hashbang line (#!/bin/bash -ue), the script will abort right after running lsof if lsof can't find even a single item among all the items listed in its arguments. This will happen even if socat is running and listening, because it can't find nc. The loop in wait_for_listen will therefore always quit after one iteration without writing the "ready" line to signal the parent.

            I'm not sure what the point of changing this function is, since it worked in the past, and I find it hard to imagine that the new code would pass even the most cursory of tests.

            planetbeing David Wang added a comment - - edited I spent many hours debugging this issue. I thought I was doing something wrong and I was trying to google for solutions. I only discovered this ticket after I figured out what was wrong and was going to submit a bug. Very frustrating. The problem (in my setup) is the change in the implementation of wait_for_listen in wsrep_sst_xtrabackup-v2.sh . The new script uses lsof which will always exit with an error code if it can't find all the items, and because the script has the -e option set in the hashbang line ( #!/bin/bash -ue ), the script will abort right after running lsof if lsof can't find even a single item among all the items listed in its arguments. This will happen even if socat is running and listening, because it can't find nc . The loop in wait_for_listen will therefore always quit after one iteration without writing the "ready" line to signal the parent. I'm not sure what the point of changing this function is, since it worked in the past, and I find it hard to imagine that the new code would pass even the most cursory of tests.
            danblack Daniel Black added a comment -

            planetbeing, thank you so much for the detailed diagnosis. I've unit tested the lsof and confirm exactly what you say. As such the patch below should be sufficient. I'm sorry I've run of time to test this today.

            diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh
            index 64dd182e2f2..39c75aaaf9e 100644
            --- a/scripts/wsrep_sst_xtrabackup-v2.sh
            +++ b/scripts/wsrep_sst_xtrabackup-v2.sh
            @@ -644,7 +644,7 @@ wait_for_listen()
             
                 for i in {1..300}
                 do
            -        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c)
            +        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c || :)
                     [ -n "${LSOF_OUT}" ] && break
                     sleep 0.2
                 done
            

            I apologies for not testing this change at a macro level. I'm not sure why I didn't pick it up/test it enough. There is a galera.galera_sst_xtrabackup-v2 test. Maybe sachin.setiya.007 can see why it wasn't picked up in buildbot.

            The need for the change to lsof was to support FreeBSD as ss was unavailable and lsof seems to be compatible on Linux and FreeBSD in the same way.

            Sorry again to everyone affected by this change, I humbly apologise, and hopefully this one line patch will correct this.

            danblack Daniel Black added a comment - planetbeing , thank you so much for the detailed diagnosis. I've unit tested the lsof and confirm exactly what you say. As such the patch below should be sufficient. I'm sorry I've run of time to test this today. diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh index 64dd182e2f2..39c75aaaf9e 100644 --- a/scripts/wsrep_sst_xtrabackup-v2.sh +++ b/scripts/wsrep_sst_xtrabackup-v2.sh @@ -644,7 +644,7 @@ wait_for_listen()   for i in {1..300} do - LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c) + LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c || :) [ -n "${LSOF_OUT}" ] && break sleep 0.2 done I apologies for not testing this change at a macro level. I'm not sure why I didn't pick it up/test it enough. There is a galera.galera_sst_xtrabackup-v2 test. Maybe sachin.setiya.007 can see why it wasn't picked up in buildbot. The need for the change to lsof was to support FreeBSD as ss was unavailable and lsof seems to be compatible on Linux and FreeBSD in the same way. Sorry again to everyone affected by this change, I humbly apologise, and hopefully this one line patch will correct this.

            Ahah I am glad I wasn't wrong with the LISTEN

            I've tested it locally, and it seems to work.
            I've noticed this thing in the log doesnt' disappear with the patch above:

            lsof: WARNING: can't stat() ext4 file system /root/src
                  Output information may be incomplete.
            lsof: WARNING: can't stat() ext4 file system /root/.cache/pip
                  Output information may be incomplete.
            

            (See also full log: http://paste.ubuntu.com/p/FDYwTv83ph/) This is with the patch above applied.

            So while it fixes the issue, there might have some other issues that went through the cracks.

            Please note: We have tests for functional clustering, using ansible, in our openstack-ansible-galera_server role if someone wants to improve collaboration/improve test coverage.

            evrardjp Jean-Philippe Evrard added a comment - Ahah I am glad I wasn't wrong with the LISTEN I've tested it locally, and it seems to work. I've noticed this thing in the log doesnt' disappear with the patch above: lsof: WARNING: can't stat() ext4 file system /root/src Output information may be incomplete. lsof: WARNING: can't stat() ext4 file system /root/.cache/pip Output information may be incomplete. (See also full log: http://paste.ubuntu.com/p/FDYwTv83ph/ ) This is with the patch above applied. So while it fixes the issue, there might have some other issues that went through the cracks. Please note: We have tests for functional clustering, using ansible, in our openstack-ansible-galera_server role if someone wants to improve collaboration/improve test coverage.
            planetbeing David Wang added a comment - - edited

            @danblack Thanks for addressing the issue! Much apologies for my grumpiness when initially reporting my analysis.

            planetbeing David Wang added a comment - - edited @danblack Thanks for addressing the issue! Much apologies for my grumpiness when initially reporting my analysis.

            Hi danblack!

            This issue was not caught on builldbot because , that sst test galera.galera_sst_xtrabackup-v2 is --big test , which is not run by default on buildbot.

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi danblack ! This issue was not caught on builldbot because , that sst test galera.galera_sst_xtrabackup-v2 is --big test , which is not run by default on buildbot.
            danblack Daniel Black added a comment - - edited

            evrardjp, planetbeing, thanks for testing. strerr there to /dev/null too as only stdout is needed.

            diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh
            index 64dd182e2f2..9104daf19bc 100644
            --- a/scripts/wsrep_sst_xtrabackup-v2.sh
            +++ b/scripts/wsrep_sst_xtrabackup-v2.sh
            @@ -644,7 +644,7 @@ wait_for_listen()
             
                 for i in {1..300}
                 do
            -        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c)
            +        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c 2> /dev/null || :)
                     [ -n "${LSOF_OUT}" ] && break
                     sleep 0.2
                 done
            

            planetbeing, no worries with grumpyness. It was well deserved. Not enough testing, mine or automated was done for this critical functional area of galera.

            danblack Daniel Black added a comment - - edited evrardjp , planetbeing , thanks for testing. strerr there to /dev/null too as only stdout is needed. diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh index 64dd182e2f2..9104daf19bc 100644 --- a/scripts/wsrep_sst_xtrabackup-v2.sh +++ b/scripts/wsrep_sst_xtrabackup-v2.sh @@ -644,7 +644,7 @@ wait_for_listen()   for i in {1..300} do - LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c) + LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c 2> /dev/null || :) [ -n "${LSOF_OUT}" ] && break sleep 0.2 done planetbeing , no worries with grumpyness. It was well deserved. Not enough testing, mine or automated was done for this critical functional area of galera.

            Thank you @danblack, the patch works for 10.1.31. Will it make it into the next release?

            hydrapolic Tomáš Mózes added a comment - Thank you @danblack, the patch works for 10.1.31. Will it make it into the next release?
            danblack Daniel Black added a comment -

            sachin.setiya.007 I'm assuming as a blocker bug you're taking care of pushing this?

            danblack Daniel Black added a comment - sachin.setiya.007 I'm assuming as a blocker bug you're taking care of pushing this?

            Hi danblack,

            Yes I will push it.

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi danblack , Yes I will push it.
            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - http://lists.askmonty.org/pipermail/commits/2018-February/012037.html

            Fixed wsrep_sst_xtrabackup-v2.sh wsrep_sst_xtrabackup-v2.sh

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Fixed wsrep_sst_xtrabackup-v2.sh wsrep_sst_xtrabackup-v2.sh

            I confirm I've encoutered this in a production system as well, and that applying the patch from http://lists.askmonty.org/pipermail/commits/2018-February/012037.html seemed to fix it. So this means a simple Galera node start has been broken in different ways in all versions since 10.1.27.

            sergdbartcvicentiu I plan to make a simple Galera cluster test setup in buildbot when I have time to get rid of this class of errors in releases.

            otto Otto Kekäläinen added a comment - I confirm I've encoutered this in a production system as well, and that applying the patch from http://lists.askmonty.org/pipermail/commits/2018-February/012037.html seemed to fix it. So this means a simple Galera node start has been broken in different ways in all versions since 10.1.27. serg dbart cvicentiu I plan to make a simple Galera cluster test setup in buildbot when I have time to get rid of this class of errors in releases.
            elenst Elena Stepanova added a comment - - edited
            elenst Elena Stepanova added a comment - - edited FYI otto : see my comment in MDEV-15409

            you sure it's on wsrep_sst_xtrabackup-v2.sh ?? (you can change from here and fix it but for me the root problem is here : https://jira.mariadb.org/browse/MDEV-15383?focusedCommentId=108624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-108624

            Aurelien_LEQUOY Aurélien LEQUOY added a comment - you sure it's on wsrep_sst_xtrabackup-v2.sh ?? (you can change from here and fix it but for me the root problem is here : https://jira.mariadb.org/browse/MDEV-15383?focusedCommentId=108624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-108624

            People

              sachin.setiya.007 Sachin Setiya (Inactive)
              claudio.nanni Claudio Nanni
              Votes:
              11 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.