Details
Description
I am rewriting this description completely, having learned a lot more. I have also changed the title of this report.
I have a Galera cluster using [sst] encrypt=3 with SST mode mariabackup. Whenever a node gracefully shuts down and then comes back up, it fails. Retrying (which by default happens automatically because of systemd) ultimately succeeds, but only after a full SST is done.
That is, IST always fails. Trying again results in an SST which succeeds.
My database is big enough that this is not really acceptable, and it doesn't seem to be the intended behavior. I narrowed it down to an error in syslog "Donor does not know my secret!".
Sure enough, in wsrep_sst_mariabackup, when we are NOT bypassing (that is, in full SST mode), there is the following:
if [ -n "$WSREP_SST_OPT_REMOTE_PSWD" ]; then
- Let joiner know that we know its secret
echo "$SECRET_TAG $WSREP_SST_OPT_REMOTE_PSWD" >> "$MAGIC_FILE"
fi
And when we ARE bypassing (that is, in IST mode) it is missing.
I've modified wsrep_sst_mariabackup to add that statement in bypass mode, just after the $MAGIC_FILE is initially written, and now my nodes can come up with a quick IST rather than a long SST.
Attachments
Issue Links
- duplicates
-
MDEV-32344 IST "Donor does not know my secret" with ssl-mode=VERIFY_CA
- Closed
It looks like wsrep_sst_mariabackup in 10.11.7 centralizes the logic for handling the MAGIC_FILE. From what I can tell looking at the code, this should fix the issue, but I haven't yet tried it.