[MDEV-30541] IST always fails -- wsrep_sst_mariabackup does not handle "secret" correctly when doing an IST Created: 2023-02-01  Updated: 2023-12-01

Status: Open
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.18
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: Xan Charbonnet Assignee: Julius Goryavsky
Resolution: Unresolved Votes: 1
Labels: None
Environment:

Debian x86_64



 Description   

I am rewriting this description completely, having learned a lot more. I have also changed the title of this report.

I have a Galera cluster using [sst] encrypt=3 with SST mode mariabackup. Whenever a node gracefully shuts down and then comes back up, it fails. Retrying (which by default happens automatically because of systemd) ultimately succeeds, but only after a full SST is done.

That is, IST always fails. Trying again results in an SST which succeeds.

My database is big enough that this is not really acceptable, and it doesn't seem to be the intended behavior. I narrowed it down to an error in syslog "Donor does not know my secret!".

Sure enough, in wsrep_sst_mariabackup, when we are NOT bypassing (that is, in full SST mode), there is the following:

if [ -n "$WSREP_SST_OPT_REMOTE_PSWD" ]; then

  1. Let joiner know that we know its secret
    echo "$SECRET_TAG $WSREP_SST_OPT_REMOTE_PSWD" >> "$MAGIC_FILE"
    fi

And when we ARE bypassing (that is, in IST mode) it is missing.

I've modified wsrep_sst_mariabackup to add that statement in bypass mode, just after the $MAGIC_FILE is initially written, and now my nodes can come up with a quick IST rather than a long SST.



 Comments   
Comment by Miika Kankare [ 2023-05-14 ]

We're having the same issue. IST completes successfully, but then it goes to the SST secret check, bails out and runs full SST.

The database is small, so running SST isn't a huge issue for us usually.

But it is blocking us from upgrading from the soon to be EOL 10.3 to a newer version as SST doesn't seem to be supported for major version upgrades.

Comment by Xan Charbonnet [ 2023-05-19 ]

Miika, you can copy the missing lines from the SST portion of wsrep_sst_mariabackup into the IST portion to make things work. That should allow you to upgrade at least.

Comment by Miika Kankare [ 2023-05-19 ]

Yup, I found the place. I'll probably do that.

I wonder if part of the problem is here:

https://github.com/MariaDB/server/blob/10.3/scripts/wsrep_sst_mariabackup.sh#L790

It's in recv_joiner() and looks like it should jump out if IST is running and wouldn't then run the check for the secret.

But that directory is created just before the function is called so I don't think it'll ever work:

https://github.com/MariaDB/server/blob/10.3/scripts/wsrep_sst_mariabackup.sh#L1361

Comment by Miika Kankare [ 2023-05-19 ]

Not sure about that, but it does look a bit weird. Because we should always get past that to run socat to actually transfer something. The check may have made sense at some point, but probably does not any more.

Any way, after the recv_joiner() call here's a check for IST: https://github.com/MariaDB/server/blob/10.3/scripts/wsrep_sst_mariabackup.sh#L1372

I added a similar one to recv_joiner() to skip the secret check: https://github.com/MariaDB/server/blob/10.3/scripts/wsrep_sst_mariabackup.sh#L840

if [ ! -r $STATDIR/$IST_FILE -a -n "$MY_SECRET" ]; then

Now it seems to be happy running IST. Didn't try upgrading yet, but hopefully this will fix that as well.

Comment by Miika Kankare [ 2023-05-19 ]

So, not sure what should've been the logic in this.

Ie. should the secret be written for IST (Xan's fix), or should the check be skipped (my workaround above).

Comment by Miika Kankare [ 2023-08-26 ]

I have since upgraded to 10.11. First .3 and now .5, seems to be broken still.

Comment by Xan Charbonnet [ 2023-08-31 ]

Seems like IST is a pretty core feature of Galera clusters which is fairly badly broken here.

Comment by Xan Charbonnet [ 2023-12-01 ]

10.11.6 still is unable to do a mariabackup IST out of the box.

Generated at Thu Feb 08 10:17:02 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.