[MDEV-15254] 10.1.31 does not join an existing cluster with SST xtrabackup-v2 Created: 2018-02-08 Updated: 2020-08-25 Resolved: 2018-02-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera SST, wsrep |
| Affects Version/s: | 10.1.31, 10.2.13 |
| Fix Version/s: | 10.1.32, 10.2.14 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Claudio Nanni | Assignee: | Sachin Setiya (Inactive) |
| Resolution: | Fixed | Votes: | 11 |
| Labels: | None | ||
| Environment: |
Linux |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
10.1.31 does not join any Galera cluster with SST xtrabackup-v2. To reproduce: Start a 10.1.31 node and wait it to connect to a cluster, it won't, timing out at:
The Donor does not receive the SST request. Workaround: use wsrep_sst_common and wsrep_sst_xtrabackup-v2 from versions < 10.1.28. Scripts from 10.1.28/29 can't be used as they have other bugs. |
| Comments |
| Comment by Craig bailey [ 2018-02-08 ] | |||||||||||||||||
|
Seeing the exact same issue on multiple clusters | |||||||||||||||||
| Comment by Zdravelina Sokolovska (Inactive) [ 2018-02-10 ] | |||||||||||||||||
|
Node with versions get from MariaDB 10.1 repo: 10.1.31 and percona repo : percona-xtrabackup.x86_64 0:2.3.6-1 joined Galera cluster loaded with the same maridb and percona-xtrabackup versions; checked with the attached serv cnfs on CentOS 7.4 ; | |||||||||||||||||
| Comment by Logan V [ 2018-02-12 ] | |||||||||||||||||
|
Seeing the same issue on ubuntu 16.04, package versions as follows:
| |||||||||||||||||
| Comment by Claudio Nanni [ 2018-02-12 ] | |||||||||||||||||
|
The problem is just with wsrep_sst_common and wsrep_sst_xtrabackup-v2. winstone did you trigger SST? Please also see this: https://jira.mariadb.org/browse/MDEV-14313 | |||||||||||||||||
| Comment by Jean-Philippe Evrard [ 2018-02-12 ] | |||||||||||||||||
|
You can also use the wsrep_sst_common and wsrep_sst_xtrabackup-v2 scripts from 10.1.30 with 10.1.31's packages and it would work. It's purely in the script that the problem arises. | |||||||||||||||||
| Comment by Zdravelina Sokolovska (Inactive) [ 2018-02-12 ] | |||||||||||||||||
|
hello , | |||||||||||||||||
| Comment by Jean-Philippe Evrard [ 2018-02-12 ] | |||||||||||||||||
|
log: http://paste.ubuntu.com/=BRFvBdFB5x/ Adding wsrep_sst_receive_address=10.1.0.4:4444 doesn't change things. Rolling back to 10.1.30 wsrep_ss_common and xtrabackup-v2 scripts, with no wsrep_sst_receive_address: | |||||||||||||||||
| Comment by Claudio Nanni [ 2018-02-12 ] | |||||||||||||||||
|
winstone please let's don't add irrelevant information: wsrep_sst_receive_address does not have anything to do with this issue. Jean-Philippe,
The two extra lines that you saw:
Are a regular message you see when you specify the same node ip in the cluster address, if you check carefully it's node `72d00ba6` timing out connecting to `72d00ba6` Again, the problem is in the wsrep_sst_common, wsrep_sst_xtrabackup-v2 scripts. winstone can you share the error log from both Donor and Joiner and innobackup.backup.log from the Donor after the joiner successfully joins with SST? | |||||||||||||||||
| Comment by Jean-Philippe Evrard [ 2018-02-12 ] | |||||||||||||||||
|
Sorry, wrong wording on my side. Yes it's failing, and the only way to get something to work is to revert back to scripts 10.1.30. That's what I meant. | |||||||||||||||||
| Comment by Zdravelina Sokolovska (Inactive) [ 2018-02-14 ] | |||||||||||||||||
|
Issue confirmed on mdb v10.1.31 /OS CentOS 7.4. :Node fail to join cluster with wsrep_sst_method=xtrabackup-v2 A workaround like for 10.1.31: | |||||||||||||||||
| Comment by Niels Hendriks [ 2018-02-14 ] | |||||||||||||||||
|
Hello, This also affects MariaDB 10.2.13 on Debian 8 amd64. The workaround-scripts provided by Claudio fix the issue. Is a (regression) test for SST really that hard? | |||||||||||||||||
| Comment by Claudio Nanni [ 2018-02-14 ] | |||||||||||||||||
|
Winstone, | |||||||||||||||||
| Comment by David Wang [ 2018-02-14 ] | |||||||||||||||||
|
I spent many hours debugging this issue. I thought I was doing something wrong and I was trying to google for solutions. I only discovered this ticket after I figured out what was wrong and was going to submit a bug. Very frustrating. The problem (in my setup) is the change in the implementation of wait_for_listen in wsrep_sst_xtrabackup-v2.sh. The new script uses lsof which will always exit with an error code if it can't find all the items, and because the script has the -e option set in the hashbang line (#!/bin/bash -ue), the script will abort right after running lsof if lsof can't find even a single item among all the items listed in its arguments. This will happen even if socat is running and listening, because it can't find nc. The loop in wait_for_listen will therefore always quit after one iteration without writing the "ready" line to signal the parent. I'm not sure what the point of changing this function is, since it worked in the past, and I find it hard to imagine that the new code would pass even the most cursory of tests. | |||||||||||||||||
| Comment by Daniel Black [ 2018-02-15 ] | |||||||||||||||||
|
planetbeing, thank you so much for the detailed diagnosis. I've unit tested the lsof and confirm exactly what you say. As such the patch below should be sufficient. I'm sorry I've run of time to test this today.
I apologies for not testing this change at a macro level. I'm not sure why I didn't pick it up/test it enough. There is a galera.galera_sst_xtrabackup-v2 test. Maybe sachin.setiya.007 can see why it wasn't picked up in buildbot. The need for the change to lsof was to support FreeBSD as ss was unavailable and lsof seems to be compatible on Linux and FreeBSD in the same way. Sorry again to everyone affected by this change, I humbly apologise, and hopefully this one line patch will correct this. | |||||||||||||||||
| Comment by Jean-Philippe Evrard [ 2018-02-15 ] | |||||||||||||||||
|
Ahah I am glad I wasn't wrong with the LISTEN I've tested it locally, and it seems to work.
(See also full log: http://paste.ubuntu.com/p/FDYwTv83ph/) This is with the patch above applied. So while it fixes the issue, there might have some other issues that went through the cracks. Please note: We have tests for functional clustering, using ansible, in our openstack-ansible-galera_server role if someone wants to improve collaboration/improve test coverage. | |||||||||||||||||
| Comment by David Wang [ 2018-02-15 ] | |||||||||||||||||
|
@danblack Thanks for addressing the issue! Much apologies for my grumpiness when initially reporting my analysis. | |||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-02-15 ] | |||||||||||||||||
|
Hi danblack! This issue was not caught on builldbot because , that sst test galera.galera_sst_xtrabackup-v2 is --big test , which is not run by default on buildbot. | |||||||||||||||||
| Comment by Daniel Black [ 2018-02-15 ] | |||||||||||||||||
|
evrardjp, planetbeing, thanks for testing. strerr there to /dev/null too as only stdout is needed.
planetbeing, no worries with grumpyness. It was well deserved. Not enough testing, mine or automated was done for this critical functional area of galera. | |||||||||||||||||
| Comment by Tomas Mozes [ 2018-02-20 ] | |||||||||||||||||
|
Thank you @danblack, the patch works for 10.1.31. Will it make it into the next release? | |||||||||||||||||
| Comment by Daniel Black [ 2018-02-20 ] | |||||||||||||||||
|
sachin.setiya.007 I'm assuming as a blocker bug you're taking care of pushing this? | |||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-02-21 ] | |||||||||||||||||
|
Hi danblack, Yes I will push it. | |||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-02-21 ] | |||||||||||||||||
|
http://lists.askmonty.org/pipermail/commits/2018-February/012037.html | |||||||||||||||||
| Comment by Sachin Setiya (Inactive) [ 2018-02-22 ] | |||||||||||||||||
|
Fixed wsrep_sst_xtrabackup-v2.sh wsrep_sst_xtrabackup-v2.sh | |||||||||||||||||
| Comment by Otto Kekäläinen [ 2018-03-10 ] | |||||||||||||||||
|
I confirm I've encoutered this in a production system as well, and that applying the patch from http://lists.askmonty.org/pipermail/commits/2018-February/012037.html seemed to fix it. So this means a simple Galera node start has been broken in different ways in all versions since 10.1.27. sergdbartcvicentiu I plan to make a simple Galera cluster test setup in buildbot when I have time to get rid of this class of errors in releases. | |||||||||||||||||
| Comment by Elena Stepanova [ 2018-03-12 ] | |||||||||||||||||
|
FYI otto: see my comment in MDEV-15409 | |||||||||||||||||
| Comment by Aurélien LEQUOY [ 2018-03-20 ] | |||||||||||||||||
|
you sure it's on wsrep_sst_xtrabackup-v2.sh ?? (you can change from here and fix it but for me the root problem is here : https://jira.mariadb.org/browse/MDEV-15383?focusedCommentId=108624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-108624 |