Example cluster with only two nodes: (normally would also have a third node)
mysql1 = IP 10.0.0.2 (JOINER)
mysql2 = IP 10.0.0.3 (DONOR)
(database has to be big enough for SST to take well over three minutes)
DONOR node started with "galera_new_cluster", JOINER node started with "systemctl start mariadb.service" after deleting grastate.dat to make sure node needs SST.
After SST has started, run "systemctl stop mariadb.service" on JOINER node and wait three minutes:
No errors, thus you would assume the node has been stopped successfully, right?
In reality, what happens is SST keeps on running (confirmed by "ps aux | grep mysql") until fully finished, which could take a looong time depending on database size, after which JOINER node completes its shutdown and DONOR finally returns to SYNCED state.
Now since the shutdown was not actually completed and no error was returned by systemctl, the user could try running "systemctl start mariadb.service" and end up with the new mysqld process failing to start. (I didn't try this but I assume it will fail gracefully like any other time)
Knowing that mysqld has not actually stopped, one can of course kill it manually to interrupt the SST.
When I run "systemctl stop", I would expect the node to be stopped (and thus SST to be interrupted), so the donor could return back to SYNCED state ASAP. Since interrupting an SST, the joining node's database consistency doesn't really matter as it will get a new SST anyway when started next time. If I was stopping a node in the middle of an IST, the current "non-interrupting" way could be preferred to possibly avoid SST on next startup in case a busy cluster. (didn't test this but I assume currently also IST wouldn't be interrupted)
Attached logs from both nodes showing what happened.
Added line "-- waiting for SST to finish, 12 minutes in this case" for clarity. Now imagine finding the cluster at that point in logs without knowing what has happened. You see one node tried to make an SST and apparently shutting down, donor node having formed new quorum alone and "doing nothing" (since it's still in DONOR/DESYNCED state although not that obvious with the new quorum). Without checking what processes are running, it's not that clear that the SST is still running.
TL;DR; I would suggest the systemctl command should at least return failure notice similar to when startup fails:
Thus the user would immediately know to check status, which correctly shows something has failed:
It would also be nice to get a message in logs stating what happens to the SST (or IST), e.g. "waiting until SST has finished" or "SST interrupted, node in inconsistent state, new SST required".