[MDEV-15254] 10.1.31 does not join an existing cluster with SST xtrabackup-v2 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.1.31, 10.2.13
Fix Version/s: 10.1.32, 10.2.14
Component/s: Galera SST, wsrep
Labels:
None
Environment:
Linux

Description

10.1.31 does not join any Galera cluster with SST xtrabackup-v2.

To reproduce: Start a 10.1.31 node and wait it to connect to a cluster, it won't, timing out at:

WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:12209,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180208 15:37:38.407)

The Donor does not receive the SST request.

Workaround: use wsrep_sst_common and wsrep_sst_xtrabackup-v2 from versions < 10.1.28.

Scripts from 10.1.28/29 can't be used as they have other bugs.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

server1.cnf
1 kB
2018-02-10 14:09
server2.cnf
1 kB
2018-02-10 14:09
wsrep_sst_common
5 kB
2018-02-12 08:03
wsrep_sst_xtrabackup-v2
31 kB
2018-02-12 08:03
wsrep_sst_xtrabackup-v2.sh
38 kB
2018-02-22 10:00

Issue Links

relates to

MDEV-14313 Joiner fails to SST when upgraded to 10.1.28 from 10.1.23

Closed

MDEV-16032 10.0.34 (probably) package broken on Ubuntu Trusty (unable to join Galera cluster)

Closed

MDEV-15351 wsrep_sst_xtrabackup is broken in 10.1.31

Closed

MDEV-15355 upgrade to 10.1.31 break Galera SST/IST transfer

Closed

MDEV-15365 Cannot upgrade MariaDB Galera Cluster to 10.1.31

Closed

Activity

Ascending order - Click to sort in descending order

Craig bailey added a comment - 2018-02-08 14:47

Seeing the exact same issue on multiple clusters

Craig bailey added a comment - 2018-02-08 14:47 Seeing the exact same issue on multiple clusters

Zdravelina Sokolovska (Inactive) added a comment - 2018-02-10 14:15

Node with versions get from MariaDB 10.1 repo: 10.1.31 and percona repo : percona-xtrabackup.x86_64 0:2.3.6-1 joined Galera cluster loaded with the same maridb and percona-xtrabackup versions; checked with the attached serv cnfs on CentOS 7.4 ;

Zdravelina Sokolovska (Inactive) added a comment - 2018-02-10 14:15 Node with versions get from MariaDB 10.1 repo: 10.1.31 and percona repo : percona-xtrabackup.x86_64 0:2.3.6-1 joined Galera cluster loaded with the same maridb and percona-xtrabackup versions; checked with the attached serv cnfs on CentOS 7.4 ;

Logan V added a comment - 2018-02-12 04:36

Seeing the same issue on ubuntu 16.04, package versions as follows:

# dpkg -l | grep -E '(mariadb|xtrabackup)'

ii  libmariadbclient-dev        10.1.31+maria-1~xenial                     amd64        MariaDB database development files

ii  libmariadbclient18          10.1.31+maria-1~xenial                     amd64        MariaDB database client library

ii  mariadb-client              10.1.31+maria-1~xenial                     all          MariaDB database client (metapackage depending on the latest version)

ii  mariadb-client-10.1         10.1.31+maria-1~xenial                     amd64        MariaDB database client binaries

ii  mariadb-client-core-10.1    10.1.31+maria-1~xenial                     amd64        MariaDB database core client binaries

ii  mariadb-common              10.1.31+maria-1~xenial                     all          MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf)

ii  mariadb-server-10.1         10.1.31+maria-1~xenial                     amd64        MariaDB database server binaries

ii  mariadb-server-core-10.1    10.1.31+maria-1~xenial                     amd64        MariaDB database core server files

ii  percona-xtrabackup-24       2.4.5-1.xenial                             amd64        Open source backup tool for InnoDB and XtraDB

Logan V added a comment - 2018-02-12 04:36 Seeing the same issue on ubuntu 16.04, package versions as follows: # dpkg -l | grep -E '(mariadb|xtrabackup)' ii libmariadbclient-dev 10.1.31+maria-1~xenial amd64 MariaDB database development files ii libmariadbclient18 10.1.31+maria-1~xenial amd64 MariaDB database client library ii mariadb-client 10.1.31+maria-1~xenial all MariaDB database client (metapackage depending on the latest version) ii mariadb-client-10.1 10.1.31+maria-1~xenial amd64 MariaDB database client binaries ii mariadb-client-core-10.1 10.1.31+maria-1~xenial amd64 MariaDB database core client binaries ii mariadb-common 10.1.31+maria-1~xenial all MariaDB database common files (e.g. /etc/mysql/conf.d/mariadb.cnf) ii mariadb-server-10.1 10.1.31+maria-1~xenial amd64 MariaDB database server binaries ii mariadb-server-core-10.1 10.1.31+maria-1~xenial amd64 MariaDB database core server files ii percona-xtrabackup-24 2.4.5-1.xenial amd64 Open source backup tool for InnoDB and XtraDB

Claudio Nanni added a comment - 2018-02-12 08:05

The problem is just with wsrep_sst_common and wsrep_sst_xtrabackup-v2.
You can use the scripts from 10.1.26 which will work with 10.1.31.
I'm attaching for convenience the two scripts from: mariadb-10.1.26-linux-x86_64.tar.gz wsrep_sst_common wsrep_sst_xtrabackup-v2

winstone did you trigger SST?

Please also see this: https://jira.mariadb.org/browse/MDEV-14313

Claudio Nanni added a comment - 2018-02-12 08:05 The problem is just with wsrep_sst_common and wsrep_sst_xtrabackup-v2. You can use the scripts from 10.1.26 which will work with 10.1.31. I'm attaching for convenience the two scripts from: mariadb-10.1.26-linux-x86_64.tar.gz wsrep_sst_common wsrep_sst_xtrabackup-v2 winstone did you trigger SST? Please also see this: https://jira.mariadb.org/browse/MDEV-14313

Jean-Philippe Evrard added a comment - 2018-02-12 12:07 - edited

You can also use the wsrep_sst_common and wsrep_sst_xtrabackup-v2 scripts from 10.1.30 with 10.1.31's packages and it would work.

It's purely in the script that the problem arises.
I'm suspecting changes in the listen addresses. By default, we have a list of addresses for the cluster, and the sst should connect to the first node which is localhost in our case, and should reject it. We see the attempt to connect to the donor, but nothing is received from the donor, which makes me think of a networking issue. On top of that, using nc/tar instead of the defaults don't seem to fix the issue.

Jean-Philippe Evrard added a comment - 2018-02-12 12:07 - edited You can also use the wsrep_sst_common and wsrep_sst_xtrabackup-v2 scripts from 10.1.30 with 10.1.31's packages and it would work. It's purely in the script that the problem arises. I'm suspecting changes in the listen addresses. By default, we have a list of addresses for the cluster, and the sst should connect to the first node which is localhost in our case, and should reject it. We see the attempt to connect to the donor, but nothing is received from the donor, which makes me think of a networking issue. On top of that, using nc/tar instead of the defaults don't seem to fix the issue.

Zdravelina Sokolovska (Inactive) added a comment - 2018-02-12 12:59 - edited

hello ,
as mentioned above the node loaded with v 10.1.31 succeed join an existing cluster with SST xtrabackup-v2;
used package xtrabackup version: percona-xtrabackup.x86_64 0:2.3.6
As they were not server configurations files I attached the used ones.
note: after adding configuration line wsrep_sst_receive_address=self_address:4444 node joined successfully
for example wsrep_sst_receive_address=192.168.0.100:4444
note2 : on the next version 10.2 it's found that's it's not need to set wsrep_sst_receive_address

Zdravelina Sokolovska (Inactive) added a comment - 2018-02-12 12:59 - edited hello , as mentioned above the node loaded with v 10.1.31 succeed join an existing cluster with SST xtrabackup-v2; used package xtrabackup version: percona-xtrabackup.x86_64 0:2.3.6 As they were not server configurations files I attached the used ones. note: after adding configuration line wsrep_sst_receive_address=self_address:4444 node joined successfully for example wsrep_sst_receive_address=192.168.0.100:4444 note2 : on the next version 10.2 it's found that's it's not need to set wsrep_sst_receive_address

Jean-Philippe Evrard added a comment - 2018-02-12 14:02 - edited

log: http://paste.ubuntu.com/=BRFvBdFB5x/
conf of clustering: http://paste.ubuntu.com/=fQkX6JVpnN/

Adding wsrep_sst_receive_address=10.1.0.4:4444 doesn't change things.
Adding wsrep_sst_receive_address=10.1.0.4:1234 goes further: http://paste.ubuntu.com/=FdC5z7nJ56/

Rolling back to 10.1.30 wsrep_ss_common and xtrabackup-v2 scripts, with no wsrep_sst_receive_address:
http://paste.ubuntu.com/=cv5YQFHZr2/

Jean-Philippe Evrard added a comment - 2018-02-12 14:02 - edited log: http://paste.ubuntu.com/=BRFvBdFB5x/ conf of clustering: http://paste.ubuntu.com/=fQkX6JVpnN/ Adding wsrep_sst_receive_address=10.1.0.4:4444 doesn't change things. Adding wsrep_sst_receive_address=10.1.0.4:1234 goes further: http://paste.ubuntu.com/=FdC5z7nJ56/ Rolling back to 10.1.30 wsrep_ss_common and xtrabackup-v2 scripts, with no wsrep_sst_receive_address: http://paste.ubuntu.com/=cv5YQFHZr2/

Claudio Nanni added a comment - 2018-02-12 14:34 - edited

winstone please let's don't add irrelevant information: wsrep_sst_receive_address does not have anything to do with this issue.

Jean-Philippe,
when you used port 1234 it didn't go any further, check well:

WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180212 13:57:40.929)

WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180212 13:59:20.933)

WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180212 13:59:20.935)

2018-02-12 13:59:20 140062985549568 [ERROR] WSREP: Failed to read 'ready <addr>'

2018-02-12 13:59:20 140062985549568 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:4444' --datadir '/var/lib/mysql/'   --parent '28980' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe)

2018-02-12 13:59:20 140063206394624 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.

2018-02-12 13:59:20 140063206394624 [ERROR] Aborting

WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:1234,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20180212 14:03:08.907)

WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage (20180212 14:04:48.912)

WSREP_SST: [ERROR] Cleanup after exit with status:32 (20180212 14:04:48.915)

2018-02-12 14:04:48 140161341978368 [ERROR] WSREP: Failed to read 'ready <addr>

2018-02-12 14:04:48 140161341978368 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:1234' --datadir '/var/lib/mysql/'   --parent '29507' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe)

2018-02-12 14:04:48 140161562221312 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable.

2018-02-12 14:04:48 140161562221312 [ERROR] Aborting

The two extra lines that you saw:

2018-02-12 14:03:09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567') connection to peer 72d00ba6 with addr tcp://10.1.0.4:4567 timed out, no messages seen in PT3S

2018-02-12 14:03:09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567') turning message relay requesting off

Are a regular message you see when you specify the same node ip in the cluster address, if you check carefully it's node `72d00ba6` timing out connecting to `72d00ba6`

Again, the problem is in the wsrep_sst_common, wsrep_sst_xtrabackup-v2 scripts.

winstone can you share the error log from both Donor and Joiner and innobackup.backup.log from the Donor after the joiner successfully joins with SST?

Claudio Nanni added a comment - 2018-02-12 14:34 - edited winstone please let's don't add irrelevant information: wsrep_sst_receive_address does not have anything to do with this issue. Jean-Philippe, when you used port 1234 it didn't go any further, check well: WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN: 4444 ,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) ( 20180212 13 : 57 : 40.929 ) WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage ( 20180212 13 : 59 : 20.933 ) WSREP_SST: [ERROR] Cleanup after exit with status: 32 ( 20180212 13 : 59 : 20.935 ) 2018 - 02 - 12 13 : 59 : 20 140062985549568 [ERROR] WSREP: Failed to read 'ready <addr>' 2018 - 02 - 12 13 : 59 : 20 140062985549568 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:4444' --datadir '/var/lib/mysql/' --parent '28980' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe) 2018 - 02 - 12 13 : 59 : 20 140063206394624 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable. 2018 - 02 - 12 13 : 59 : 20 140063206394624 [ERROR] Aborting WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN: 1234 ,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) ( 20180212 14 : 03 : 08.907 ) WSREP_SST: [ERROR] Possible timeout in receving first data from donor in gtid stage ( 20180212 14 : 04 : 48.912 ) WSREP_SST: [ERROR] Cleanup after exit with status: 32 ( 20180212 14 : 04 : 48.915 ) 2018 - 02 - 12 14 : 04 : 48 140161341978368 [ERROR] WSREP: Failed to read 'ready <addr> 2018 - 02 - 12 14 : 04 : 48 140161341978368 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.0.4:1234' --datadir '/var/lib/mysql/' --parent '29507' --binlog '/var/lib/mysql/mariadb-bin' : 32 (Broken pipe) 2018 - 02 - 12 14 : 04 : 48 140161562221312 [ERROR] WSREP: Failed to prepare for 'xtrabackup-v2' SST. Unrecoverable. 2018 - 02 - 12 14 : 04 : 48 140161562221312 [ERROR] Aborting The two extra lines that you saw: 2018 - 02 - 12 14 : 03 : 09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567' ) connection to peer 72d00ba6 with addr tcp:// 10.1 . 0.4 : 4567 timed out, no messages seen in PT3S 2018 - 02 - 12 14 : 03 : 09 140161388115712 [Note] WSREP: (72d00ba6, 'tcp://0.0.0.0:4567' ) turning message relay requesting off Are a regular message you see when you specify the same node ip in the cluster address, if you check carefully it's node `72d00ba6` timing out connecting to `72d00ba6` Again, the problem is in the wsrep_sst_common, wsrep_sst_xtrabackup-v2 scripts. winstone can you share the error log from both Donor and Joiner and innobackup.backup.log from the Donor after the joiner successfully joins with SST?

Jean-Philippe Evrard added a comment - 2018-02-12 14:36

Sorry, wrong wording on my side. Yes it's failing, and the only way to get something to work is to revert back to scripts 10.1.30. That's what I meant.

Jean-Philippe Evrard added a comment - 2018-02-12 14:36 Sorry, wrong wording on my side. Yes it's failing, and the only way to get something to work is to revert back to scripts 10.1.30. That's what I meant.

Zdravelina Sokolovska (Inactive) added a comment - 2018-02-14 13:12

Issue confirmed on mdb v10.1.31 /OS CentOS 7.4. :Node fail to join cluster with wsrep_sst_method=xtrabackup-v2

A workaround like for 10.1.31:
1. Install percona-xtrabackup version the same as on the other cluster nodes - -I used percona-xtrabackup.x86_64 0:2.3.6
2. On the joiner Node comment the row: #wsrep_sst_method=xtrabackup-v2 in the server config
3. Join the Node : service mysql start --wsrep_cluster_address=gcomm://192.168.104.195
Starting mysql (via systemctl): [ OK ]
4. Restore the row: wsrep_sst_method=xtrabackup-v2 i n the server config
5. Restart service mysql on the Joiner Node: service mysql restart
Restarting mysql (via systemctl): [ OK ]
6. Issue any DML event on any Node – verify that the Joiner DB is updated
note: that workaround would take effect in a quite changeless cluster – e.g after p.3 till p.5 cluster state was not being changed

Zdravelina Sokolovska (Inactive) added a comment - 2018-02-14 13:12 Issue confirmed on mdb v10.1.31 /OS CentOS 7.4. :Node fail to join cluster with wsrep_sst_method=xtrabackup-v2 A workaround like for 10.1.31: 1. Install percona-xtrabackup version the same as on the other cluster nodes - -I used percona-xtrabackup.x86_64 0:2.3.6 2. On the joiner Node comment the row: #wsrep_sst_method=xtrabackup-v2 in the server config 3. Join the Node : service mysql start --wsrep_cluster_address=gcomm://192.168.104.195 Starting mysql (via systemctl): [ OK ] 4. Restore the row: wsrep_sst_method=xtrabackup-v2 i n the server config 5. Restart service mysql on the Joiner Node: service mysql restart Restarting mysql (via systemctl): [ OK ] 6. Issue any DML event on any Node – verify that the Joiner DB is updated note: that workaround would take effect in a quite changeless cluster – e.g after p.3 till p.5 cluster state was not being changed

Niels Hendriks added a comment - 2018-02-14 16:13

Hello,

This also affects MariaDB 10.2.13 on Debian 8 amd64.
I just spent hours debugging this issue, very frustrated that this is broken again after the other issues described here: https://jira.mariadb.org/browse/MDEV-14256

The workaround-scripts provided by Claudio fix the issue.

Is a (regression) test for SST really that hard?

Niels Hendriks added a comment - 2018-02-14 16:13 Hello, This also affects MariaDB 10.2.13 on Debian 8 amd64. I just spent hours debugging this issue, very frustrated that this is broken again after the other issues described here: https://jira.mariadb.org/browse/MDEV-14256 The workaround-scripts provided by Claudio fix the issue. Is a (regression) test for SST really that hard?

Claudio Nanni added a comment - 2018-02-14 20:32

Winstone,
That's not a workaround, that's disabling xtrabackup-v2 SST and use rsync for the first join, and then re-enabling xtrabackup-v2 so it will fail at the first occasion.
The easy and proved workaround is to use the two working scripts from previous versions.

Claudio Nanni added a comment - 2018-02-14 20:32 Winstone, That's not a workaround, that's disabling xtrabackup-v2 SST and use rsync for the first join, and then re-enabling xtrabackup-v2 so it will fail at the first occasion. The easy and proved workaround is to use the two working scripts from previous versions.

David Wang added a comment - 2018-02-14 23:37 - edited

I spent many hours debugging this issue. I thought I was doing something wrong and I was trying to google for solutions. I only discovered this ticket after I figured out what was wrong and was going to submit a bug. Very frustrating.

The problem (in my setup) is the change in the implementation of wait_for_listen in wsrep_sst_xtrabackup-v2.sh. The new script uses lsof which will always exit with an error code if it can't find all the items, and because the script has the -e option set in the hashbang line (#!/bin/bash -ue), the script will abort right after running lsof if lsof can't find even a single item among all the items listed in its arguments. This will happen even if socat is running and listening, because it can't find nc. The loop in wait_for_listen will therefore always quit after one iteration without writing the "ready" line to signal the parent.

I'm not sure what the point of changing this function is, since it worked in the past, and I find it hard to imagine that the new code would pass even the most cursory of tests.

David Wang added a comment - 2018-02-14 23:37 - edited I spent many hours debugging this issue. I thought I was doing something wrong and I was trying to google for solutions. I only discovered this ticket after I figured out what was wrong and was going to submit a bug. Very frustrating. The problem (in my setup) is the change in the implementation of wait_for_listen in wsrep_sst_xtrabackup-v2.sh . The new script uses lsof which will always exit with an error code if it can't find all the items, and because the script has the -e option set in the hashbang line ( #!/bin/bash -ue ), the script will abort right after running lsof if lsof can't find even a single item among all the items listed in its arguments. This will happen even if socat is running and listening, because it can't find nc . The loop in wait_for_listen will therefore always quit after one iteration without writing the "ready" line to signal the parent. I'm not sure what the point of changing this function is, since it worked in the past, and I find it hard to imagine that the new code would pass even the most cursory of tests.

Daniel Black added a comment - 2018-02-15 08:16

planetbeing, thank you so much for the detailed diagnosis. I've unit tested the lsof and confirm exactly what you say. As such the patch below should be sufficient. I'm sorry I've run of time to test this today.

diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh

index 64dd182e2f2..39c75aaaf9e 100644

--- a/scripts/wsrep_sst_xtrabackup-v2.sh

+++ b/scripts/wsrep_sst_xtrabackup-v2.sh

@@ -644,7 +644,7 @@ wait_for_listen()

     for i in {1..300}

do

-        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c)

+        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c || :)

         [ -n "${LSOF_OUT}" ] && break

         sleep 0.2

     done

I apologies for not testing this change at a macro level. I'm not sure why I didn't pick it up/test it enough. There is a galera.galera_sst_xtrabackup-v2 test. Maybe sachin.setiya.007 can see why it wasn't picked up in buildbot.

The need for the change to lsof was to support FreeBSD as ss was unavailable and lsof seems to be compatible on Linux and FreeBSD in the same way.

Sorry again to everyone affected by this change, I humbly apologise, and hopefully this one line patch will correct this.

Daniel Black added a comment - 2018-02-15 08:16 planetbeing , thank you so much for the detailed diagnosis. I've unit tested the lsof and confirm exactly what you say. As such the patch below should be sufficient. I'm sorry I've run of time to test this today. diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh index 64dd182e2f2..39c75aaaf9e 100644 --- a/scripts/wsrep_sst_xtrabackup-v2.sh +++ b/scripts/wsrep_sst_xtrabackup-v2.sh @@ -644,7 +644,7 @@ wait_for_listen() for i in {1..300} do - LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c) + LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c || :) [ -n "${LSOF_OUT}" ] && break sleep 0.2 done I apologies for not testing this change at a macro level. I'm not sure why I didn't pick it up/test it enough. There is a galera.galera_sst_xtrabackup-v2 test. Maybe sachin.setiya.007 can see why it wasn't picked up in buildbot. The need for the change to lsof was to support FreeBSD as ss was unavailable and lsof seems to be compatible on Linux and FreeBSD in the same way. Sorry again to everyone affected by this change, I humbly apologise, and hopefully this one line patch will correct this.

Jean-Philippe Evrard added a comment - 2018-02-15 12:59

Ahah I am glad I wasn't wrong with the LISTEN

I've tested it locally, and it seems to work.
I've noticed this thing in the log doesnt' disappear with the patch above:

lsof: WARNING: can't stat() ext4 file system /root/src

      Output information may be incomplete.

lsof: WARNING: can't stat() ext4 file system /root/.cache/pip

      Output information may be incomplete.

(See also full log: http://paste.ubuntu.com/p/FDYwTv83ph/) This is with the patch above applied.

So while it fixes the issue, there might have some other issues that went through the cracks.

Please note: We have tests for functional clustering, using ansible, in our openstack-ansible-galera_server role if someone wants to improve collaboration/improve test coverage.

Jean-Philippe Evrard added a comment - 2018-02-15 12:59 Ahah I am glad I wasn't wrong with the LISTEN I've tested it locally, and it seems to work. I've noticed this thing in the log doesnt' disappear with the patch above: lsof: WARNING: can't stat() ext4 file system /root/src Output information may be incomplete. lsof: WARNING: can't stat() ext4 file system /root/.cache/pip Output information may be incomplete. (See also full log: http://paste.ubuntu.com/p/FDYwTv83ph/ ) This is with the patch above applied. So while it fixes the issue, there might have some other issues that went through the cracks. Please note: We have tests for functional clustering, using ansible, in our openstack-ansible-galera_server role if someone wants to improve collaboration/improve test coverage.

David Wang added a comment - 2018-02-15 17:43 - edited

@danblack Thanks for addressing the issue! Much apologies for my grumpiness when initially reporting my analysis.

David Wang added a comment - 2018-02-15 17:43 - edited @danblack Thanks for addressing the issue! Much apologies for my grumpiness when initially reporting my analysis.

Sachin Setiya (Inactive) added a comment - 2018-02-15 18:36

Hi danblack!

This issue was not caught on builldbot because , that sst test galera.galera_sst_xtrabackup-v2 is --big test , which is not run by default on buildbot.

Sachin Setiya (Inactive) added a comment - 2018-02-15 18:36 Hi danblack ! This issue was not caught on builldbot because , that sst test galera.galera_sst_xtrabackup-v2 is --big test , which is not run by default on buildbot.

Daniel Black added a comment - 2018-02-15 21:47 - edited

evrardjp, planetbeing, thanks for testing. strerr there to /dev/null too as only stdout is needed.

diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh

index 64dd182e2f2..9104daf19bc 100644

--- a/scripts/wsrep_sst_xtrabackup-v2.sh

+++ b/scripts/wsrep_sst_xtrabackup-v2.sh

@@ -644,7 +644,7 @@ wait_for_listen()

     for i in {1..300}

do

-        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c)

+        LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c 2> /dev/null || :)

         [ -n "${LSOF_OUT}" ] && break

         sleep 0.2

     done

planetbeing, no worries with grumpyness. It was well deserved. Not enough testing, mine or automated was done for this critical functional area of galera.

Daniel Black added a comment - 2018-02-15 21:47 - edited evrardjp , planetbeing , thanks for testing. strerr there to /dev/null too as only stdout is needed. diff --git a/scripts/wsrep_sst_xtrabackup-v2.sh b/scripts/wsrep_sst_xtrabackup-v2.sh index 64dd182e2f2..9104daf19bc 100644 --- a/scripts/wsrep_sst_xtrabackup-v2.sh +++ b/scripts/wsrep_sst_xtrabackup-v2.sh @@ -644,7 +644,7 @@ wait_for_listen() for i in {1..300} do - LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c) + LSOF_OUT=$(lsof -sTCP:LISTEN -i TCP:${PORT} -a -c nc -c socat -F c 2> /dev/null || :) [ -n "${LSOF_OUT}" ] && break sleep 0.2 done planetbeing , no worries with grumpyness. It was well deserved. Not enough testing, mine or automated was done for this critical functional area of galera.

Tomáš Mózes added a comment - 2018-02-20 08:44

Thank you @danblack, the patch works for 10.1.31. Will it make it into the next release?

Tomáš Mózes added a comment - 2018-02-20 08:44 Thank you @danblack, the patch works for 10.1.31. Will it make it into the next release?

Daniel Black added a comment - 2018-02-20 22:42

sachin.setiya.007 I'm assuming as a blocker bug you're taking care of pushing this?

Daniel Black added a comment - 2018-02-20 22:42 sachin.setiya.007 I'm assuming as a blocker bug you're taking care of pushing this?

Sachin Setiya (Inactive) added a comment - 2018-02-21 11:22

Hi danblack,

Yes I will push it.

Sachin Setiya (Inactive) added a comment - 2018-02-21 11:22 Hi danblack , Yes I will push it.

Sachin Setiya (Inactive) added a comment - 2018-02-21 14:35

http://lists.askmonty.org/pipermail/commits/2018-February/012037.html

Sachin Setiya (Inactive) added a comment - 2018-02-21 14:35 http://lists.askmonty.org/pipermail/commits/2018-February/012037.html

Sachin Setiya (Inactive) added a comment - 2018-02-22 10:00

Fixed wsrep_sst_xtrabackup-v2.sh wsrep_sst_xtrabackup-v2.sh

Sachin Setiya (Inactive) added a comment - 2018-02-22 10:00 Fixed wsrep_sst_xtrabackup-v2.sh wsrep_sst_xtrabackup-v2.sh

Otto Kekäläinen added a comment - 2018-03-10 15:02

I confirm I've encoutered this in a production system as well, and that applying the patch from http://lists.askmonty.org/pipermail/commits/2018-February/012037.html seemed to fix it. So this means a simple Galera node start has been broken in different ways in all versions since 10.1.27.

serg dbart cvicentiu I plan to make a simple Galera cluster test setup in buildbot when I have time to get rid of this class of errors in releases.

Otto Kekäläinen added a comment - 2018-03-10 15:02 I confirm I've encoutered this in a production system as well, and that applying the patch from http://lists.askmonty.org/pipermail/commits/2018-February/012037.html seemed to fix it. So this means a simple Galera node start has been broken in different ways in all versions since 10.1.27. serg dbart cvicentiu I plan to make a simple Galera cluster test setup in buildbot when I have time to get rid of this class of errors in releases.

Elena Stepanova added a comment - 2018-03-12 01:49 - edited

FYI otto: see my comment in MDEV-15409

Elena Stepanova added a comment - 2018-03-12 01:49 - edited FYI otto : see my comment in MDEV-15409

Aurélien LEQUOY added a comment - 2018-03-20 17:40

you sure it's on wsrep_sst_xtrabackup-v2.sh ?? (you can change from here and fix it but for me the root problem is here : https://jira.mariadb.org/browse/MDEV-15383?focusedCommentId=108624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-108624

Aurélien LEQUOY added a comment - 2018-03-20 17:40 you sure it's on wsrep_sst_xtrabackup-v2.sh ?? (you can change from here and fix it but for me the root problem is here : https://jira.mariadb.org/browse/MDEV-15383?focusedCommentId=108624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-108624

People

Assignee:: Sachin Setiya (Inactive)

Reporter:: Claudio Nanni

Votes:: 11 Vote for this issue

Watchers:: 33 Start watching this issue

Dates

Created:: 2018-02-08 14:41

Updated:: 2024-07-08 00:22

Resolved:: 2018-02-21 14:36

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration