[MDEV-27437] Galera snapshot transfer fails to upgrade between some major versions Created: 2022-01-07 Updated: 2023-11-29 Resolved: 2022-04-13 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera SST |
| Affects Version/s: | 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8 |
| Fix Version/s: | 10.3.35, 10.4.25, 10.5.16, 10.6.8, 10.7.4, 10.8.3, 10.9.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ramesh Sivaraman | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
mariabackup SST is failing sporadically in prepare stage. The problem only occurs when node1 and node2 have an active read / write workload.
Testcase
|
| Comments |
| Comment by Marko Mäkelä [ 2022-01-07 ] | |||||||||||||||||||||
|
I implemented an improvement in The following messages might be made less verbose for the mariadb-backup --prepare in 10.8, but they are correctly identifying the cause:
I will rephrase the first message to say mariadb-backup --prepare instead of InnoDB, and I will suppress the second one. An attempt to run a newer mariadb-backup --backup against an older server should result in a message like this:
| |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-01-14 ] | |||||||||||||||||||||
|
As far as I understand, Galera snapshot transfer does not work for a major upgrade at least in the following cases:
To fix this, I suggest that the snapshot transfer (SST) scripts (both rsync and mariabackup or mariadb-backup variants) be improved in one of the following ways:
Note that the XA PREPARE transactions whose existence would prevent an upgrade from 10.2 or earlier will not be automatically removed by a server startup and clean shutdown (something that the rsync method normally skips), nor by mariabackup --prepare or mariadb-backup --prepare. | |||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-03-21 ] | |||||||||||||||||||||
|
seppo Can we really reliable do something about this or should we just document that rolling upgrade is not supported between major releases? | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-07 ] | |||||||||||||||||||||
|
I expect that | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-07 ] | |||||||||||||||||||||
|
ramesh, can you please test the upgrades again to confirm or refute my claim? I analyzed one test failure that was with rsync SST, from 10.4 cbdf62ae907ad42ceb7a65e070b821bb45e07be9 (including the fix of When it comes to upgrading from 10.2, there are two problems: neither | |||||||||||||||||||||
| Comment by Seppo Jaakola [ 2022-04-07 ] | |||||||||||||||||||||
|
The planned use case for cluster upgrading, is to carry out the node upgrade fast enough so that it can join back through IST. gcache should be adjusted to allow enough headroom for the upgrading. If upgraded node joins back through SST, it will pollute the once upgraded data directory, I feel it is pointless even to try to support SST in cluster upgrading. | |||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-04-07 ] | |||||||||||||||||||||
|
marko mariabackup SST works fine in 10.4 > 10.5 upgrade using latest builds. Also no SST failures were found in the 10.5> 10.6 and 10.6> 10.7 upgrades. But mariabackup --prepare does not in 10.7 > 10.8 upgrade
| |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-08 ] | |||||||||||||||||||||
|
Thank you, ramesh. A | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-08 ] | |||||||||||||||||||||
|
ramesh, I see that there is an typo in the message:
But, the root cause of that failure should be that the old-format redo log was not logically empty. Can you provide a copy of that log, and double-check that you tested the currently latest 10.7 revision 2d8e38bc9477aa00b371ed14d95390bede70c5cb that is the oldest 10.7 revision to contain the fix of | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-08 ] | |||||||||||||||||||||
|
So far, the reason for the failures appears to be that writes are not being blocked in 10.5 to 10.7 during the Galera snapshot transfer. I will need a copy of a logically nonempty ib_logfile0 as well as an rr record trace of the donor in order to debug this. | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-11 ] | |||||||||||||||||||||
|
ramesh, the rr replay trace that you claimed to be for a Galera SST donor is not showing any sign of writes being disabled:
jplindst, please determine and fix the root cause. I believe that already 10.5 should be affected by this. | |||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2022-04-11 ] | |||||||||||||||||||||
|
marko Root cause is
| |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-12 ] | |||||||||||||||||||||
|
jplindst, I understand that for normal Galera mariabackup SST it makes sense not to block writes, because allowing concurrent writes should be the only benefit over the simpler rsync based snapshot transfer. If writes are not blocked, I do not think that mariabackup based SST ugprade can possibly work when upgrading from an earlier major version to 10.2 or later, 10.5 or later ( A minimal solution could be to simply document that mariabackup SST cannot be used for upgrades. This is already enforced by mariabackup or mariadb-backup. An alternative would be to implement an environment variable in the mariabackup SST script that allows an older-version mariabackup --prepare to be invoked. Something like this:
If the environment variable BACKUP_PREPARE_BIN is set for invoking the joiner, the correct executable would be used for applying the log. | |||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-04-12 ] | |||||||||||||||||||||
|
marko with older version mariabackup(hard-coded older binary location in 10.8 sst script) SST worked fine in 10.7 > 10.8 upgrade
| |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-12 ] | |||||||||||||||||||||
|
ramesh, thank you. Automating the solution in something like a package installation script would seem to be somewhat challenging, so maybe this work-around should only be documented with a detailed example. An easier way could be that users must simply enable rsync based snapshot transfer before major-version upgrades. Theoretically, an upgrade with mariabackup could work if the user made a copy (or hard link) of the old backup executable and then pointed an environment variable or a configuration parameter to it, for the duration of upgrading the server. An interface for specifying the full path to the executable would be needed in wsrep_sst_mariabackup, as demonstrated by the patch that I posted. After the upgrade, the old executable could be removed and the configuration restored to normal (so that a snapshot transfer will continue to work from a current-version donor). I quickly checked the Debian packaging scripts. Baking such upgrade logic directly into those scripts would be very challenging to implement and test. Something like the following would happen:
| |||||||||||||||||||||
| Comment by Ramesh Sivaraman [ 2022-04-12 ] | |||||||||||||||||||||
|
marko Agree with using rsync SST during major upgrades. The 10.7> 10.8 upgrade works well on rsync SST. | |||||||||||||||||||||
| Comment by Otto Kekäläinen [ 2022-04-12 ] | |||||||||||||||||||||
|
I read this issue on request by marko but I am not sure if I understood it fully. Event with potentially lacking understanding I would strongly advice the following:
But indeed, priority number 1 is to document such changes in release notes of major releases. Those are read by sysadmins and that is the place when sysadmins expect they might have to occasionally do something extra to make the upgrade work right. | |||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-13 ] | |||||||||||||||||||||
|
Thank you, otto. This ticket is indeed about upgrades between major versions. We do avoid changes to file formats after a release series has been declared generally available (GA). There are some exceptions, such as I see that the default setting is wsrep_sst_method=rsync. According to the tests conducted by ramesh, major version upgrades using that method will work reliably if the donor (running the older major version) includes the fixes of I updated the following pages to document the limitations around major version upgrades: |