To re-iterate what was already discussed on Slack and what Sergei's comment above refers to: we are still waiting for actual examples of the problems which this patch is trying to solve. Without that, I cannot see how this
the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number
can be true.
The client doesn't "join the cluster", I assume you mean it tries to connect to a node.
Yes, that's exactly what it does, and the check is so primitive on purpose. The test waits till the client can actually perform operations on the newly joined node. Not when the port is available or other low-level stages have been passed, but when a client can quite literally connect to the node and execute queries. It cannot happen before the node started listening on the port, as the client won't even connect; and, unlike low-level checks for port and pid, it isn't prone to a race condition when/if the server is already up and technically listening, but can't perform operations yet.
At best, what this complicated change may achieve (I'm not sure it does, but it's possible) is earlier detection of a failed node startup – when there is no point in waiting further, but the client-based check would still try to connect to the node till the timeout exceeded. For this purpose, we won't make the change, because it's a positive scenario test, it's meant to pass in the vast majority of times, so the cases when it fails aren't worth optimizing.
The checks for ports being taken by other programs are pointless – it is not a general-purpose test, it's a CI test which is run in the specific controlled environment. If there are cases when the port is taken by something else, the reason for this should be fixed (but again, we need to see examples of that).
For this part
rsync is asynchronously killed prematurely.
the test kills rsync before it even starts the node. So, it cannot happen during SST.
The only explanation I can think of which would make both claims – the wait ends "too early", and rsync is killed during SST – possible is that Galera nodes start serving DDL/DML queries before SST has finished. Then yes, the wait for the second node would finish as soon as it processed the queries, then the loop would move on, killed the remaining rsync and started the third node. But if that's what happens, it's not the check that needs to be fixed but the Galera side.
This change adds functionality to wait for the server to enter the listening state for a given port: wait-port.diff
Pull request: https://github.com/MariaDB/mariadb.org-tools/pull/101