[MDEV-27692] Waiting for a port before trying to connect in buildbot scripts Created: 2022-01-31 Updated: 2023-11-28 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Tests |
| Fix Version/s: | 10.4, 10.5, 10.6, 10.11, 11.0, 11.1 |
| Type: | Task | Priority: | Major |
| Reporter: | Julius Goryavsky | Assignee: | Julius Goryavsky |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Currently, the buildbot scripts that check SST in Galera regularly give us false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number. And in the case of SST, rsync also calls kill on rsync, resulting in an unhealthy cluster, since SST crashes on some nodes when rsync is asynchronously killed prematurely. Therefore, we need to add a function to the buildbot scripts for Galera to wait for the server to enter the listening state on a given port number before trying to connect and remove kill on rsync from the connection attempt loop. |
| Comments |
| Comment by Julius Goryavsky [ 2022-01-31 ] |
|
This change adds functionality to wait for the server to enter the listening state for a given port: wait-port.diff Pull request: https://github.com/MariaDB/mariadb.org-tools/pull/101 |
| Comment by Sergei Golubchik [ 2022-02-05 ] |
|
do you have examples of these
? |
| Comment by Elena Stepanova [ 2022-03-13 ] |
|
To re-iterate what was already discussed on Slack and what Sergei's comment above refers to: we are still waiting for actual examples of the problems which this patch is trying to solve. Without that, I cannot see how this
can be true. At best, what this complicated change may achieve (I'm not sure it does, but it's possible) is earlier detection of a failed node startup – when there is no point in waiting further, but the client-based check would still try to connect to the node till the timeout exceeded. For this purpose, we won't make the change, because it's a positive scenario test, it's meant to pass in the vast majority of times, so the cases when it fails aren't worth optimizing. The checks for ports being taken by other programs are pointless – it is not a general-purpose test, it's a CI test which is run in the specific controlled environment. If there are cases when the port is taken by something else, the reason for this should be fixed (but again, we need to see examples of that). For this part
the test kills rsync before it even starts the node. So, it cannot happen during SST. The only explanation I can think of which would make both claims – the wait ends "too early", and rsync is killed during SST – possible is that Galera nodes start serving DDL/DML queries before SST has finished. Then yes, the wait for the second node would finish as soon as it processed the queries, then the loop would move on, killed the remaining rsync and started the third node. But if that's what happens, it's not the check that needs to be fixed but the Galera side. |