[MDEV-14256] MariaDB 10.2.10 can't SST with xtrabackup-v2 Created: 2017-11-02 Updated: 2020-08-25 Resolved: 2017-11-22 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera SST, wsrep |
| Affects Version/s: | 10.2.8 |
| Fix Version/s: | 10.2.11 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jonathan Gazeley | Assignee: | Sergei Golubchik |
| Resolution: | Fixed | Votes: | 11 |
| Labels: | sst, wsrep | ||
| Environment: |
[jg4461@db2 ~]$ cat /etc/redhat-release |
||
| Issue Links: |
|
||||||||||||||||||||
| Sprint: | 10.2.11 | ||||||||||||||||||||
| Description |
|
Following an upgrade to MariaDB-server-10.2.10-1.el7.centos.x86_64 wsrep_sst_xtrabackup-v2 is unable to initiate an SST to join a node to the cluster. It fails with the following errors:
The root cause appears to be:
The WSREP_SST_OPT_PORT doesn't have a default value set, either set in wsrep_sst_xtrabackup-v2 or in wsrep_sst_common The following diff sets a default value for the WSREP_SST_OPT_PORT variable, and allows the SST to proceed.
|
| Comments |
| Comment by Niels Hendriks [ 2017-11-02 ] | |||||||||||||||||||
|
We have this issue as well on Debian Jessie with Mariadb 10.2.10 | |||||||||||||||||||
| Comment by Johan Andersson [ 2017-11-06 ] | |||||||||||||||||||
|
Hi, Any news / plans for this issue? | |||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-11-06 ] | |||||||||||||||||||
|
johan-severalnines did suggested workaround help you? You can try also remove 'u' letter from the first line of wsrep_sst_xtrabackup-v2 script, so it is just "#!/bin/bash -e" or put `set +u` at start of sst script. | |||||||||||||||||||
| Comment by Johan Andersson [ 2017-11-06 ] | |||||||||||||||||||
|
Hi, BR | |||||||||||||||||||
| Comment by Kolbe Kegel (Inactive) [ 2017-11-06 ] | |||||||||||||||||||
|
Here's another good workaround that doesn't require editing any files included in the distribution:
| |||||||||||||||||||
| Comment by Eric Howey [ 2017-11-06 ] | |||||||||||||||||||
|
Can we take a moment to just reflect on how absurd it is that this bug was released to production? A simple test case which is, invoke an SST transfer with xtrabackup-v2 would have caught this issue. People like me are facing production outages due to this issue. A lot of faith has just been lost in the MariaDB product. | |||||||||||||||||||
| Comment by Niels Hendriks [ 2017-11-06 ] | |||||||||||||||||||
|
Yeah, I agree with Eric that the tests for mariadb 10.2 could really use some improvements. I expect some bugs in the first tagged stable release of a new major version (10.2.6) but the following releases have also had some bugs that can make it unusable in certain usecases. This has also been mentioned by "DEZILLIUM LIMITED" in his description at https://jira.mariadb.org/browse/MDEV-14255 :
At least for Debian 8 and 9 this means we had exactly 1 working MariaDB 10.2 stable release, which was 10.2.9. And now, it's broken again. With 10.1 we never had big issues like this, and we still don't. I get that no one puts in these bugs on purpose and I appreciate the effort put in by the dev team, but it would be really re-assuring to have some feedback from the dev team regarding the prevention of these issues. Are there any plans to improve the tests? Does the release of mariadb 10.2 not feel messier than 10.1 to you? As a sidenote, since everyone | |||||||||||||||||||
| Comment by Jonathan Gazeley [ 2017-11-07 ] | |||||||||||||||||||
|
This will indeed break it for everyone using xtrabackup-v2 but they might not realise it until it is too late. The upgrade procedure itself might only trigger an IST and they would not realise that SST is broken until they reboot a node later on etc. All but one of my production nodes broke upon upgrade (via nightly cron job). The only reason one node survived is because it had a broken yum config which prevented new packages from being updated. I was able to run my infrastructure on one node for 24 hours until I found the problem but this could easily have caused a production outage for me. | |||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-11-09 ] | |||||||||||||||||||
|
The bugs affects those systems which don't have explicit port specified in wsrep configuration. tearup - will create docker image with installed 10.2 Server and xtrabackup 2.4 When this line is uncommented - the fix is applied and no problem happens anymore. | |||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-11-09 ] | |||||||||||||||||||
|
sachin.setiya.007 please review following patch to address the problem in 10.2
| |||||||||||||||||||
| Comment by Sergei Golubchik [ 2017-11-10 ] | |||||||||||||||||||
|
anikitin what change introduced this bug? How comes it didn't fail before? | |||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-11-10 ] | |||||||||||||||||||
|
serg it looks it was merged in this commit, before that it did parse address expression directly: | |||||||||||||||||||
| Comment by Sergei Golubchik [ 2017-11-12 ] | |||||||||||||||||||
|
I'm not sure that was it. Old code used
That would've thrown an error if --address is not used. New code does
Assuming that --address is used (because the old code didn't fail), I don't see how WSREP_SST_OPT_PORT could be unset. There was a later relevant commit — 4c2c057d404 — but I don't see how it could've left WSREP_SST_OPT_PORT unset either. | |||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-11-12 ] | |||||||||||||||||||
|
Yes, correct. Then after fix for | |||||||||||||||||||
| Comment by Sergei Golubchik [ 2017-11-13 ] | |||||||||||||||||||
|
Committed a patch | |||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-11-14 ] | |||||||||||||||||||
|
serg The patch is good and I've verified it by patching 10.2.10 in docker image from earlier case like this https://github.com/AndriiNikitin/bugs/blob/master/MDEV-14256-test1.sh#L24 |