[MDEV-27692] Waiting for a port before trying to connect in buildbot scripts - Jira

Details

Type: Task
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 10.5, 10.6, 10.11
Component/s: Tests
Labels:
None

Description

Currently, the buildbot scripts that check SST in Galera regularly give us false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number. And in the case of SST, rsync also calls kill on rsync, resulting in an unhealthy cluster, since SST crashes on some nodes when rsync is asynchronously killed prematurely. Therefore, we need to add a function to the buildbot scripts for Galera to wait for the server to enter the listening state on a given port number before trying to connect and remove kill on rsync from the connection attempt loop.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

wait-port.diff
2022-01-31 13:11
6 kB
Julius Goryavsky

Issue Links

split from

MDEV-27524 Incorrect binlogs after Galera SST using rsync and mariabackup

Closed

links to

Waiting for a port before trying to connect in buildbot scripts #101

Activity

Ascending order - Click to sort in descending order

Julius Goryavsky added a comment - 2022-01-31 13:14 - edited

This change adds functionality to wait for the server to enter the listening state for a given port: wait-port.diff

Pull request: https://github.com/MariaDB/mariadb.org-tools/pull/101

Julius Goryavsky added a comment - 2022-01-31 13:14 - edited This change adds functionality to wait for the server to enter the listening state for a given port: wait-port.diff Pull request: https://github.com/MariaDB/mariadb.org-tools/pull/101

Sergei Golubchik added a comment - 2022-02-05 20:25

do you have examples of these

false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number

?

Sergei Golubchik added a comment - 2022-02-05 20:25 do you have examples of these false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number ?

Elena Stepanova added a comment - 2022-03-13 23:31

To re-iterate what was already discussed on Slack and what Sergei's comment above refers to: we are still waiting for actual examples of the problems which this patch is trying to solve. Without that, I cannot see how this

the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number

can be true.
The client doesn't "join the cluster", I assume you mean it tries to connect to a node.
Yes, that's exactly what it does, and the check is so primitive on purpose. The test waits till the client can actually perform operations on the newly joined node. Not when the port is available or other low-level stages have been passed, but when a client can quite literally connect to the node and execute queries. It cannot happen before the node started listening on the port, as the client won't even connect; and, unlike low-level checks for port and pid, it isn't prone to a race condition when/if the server is already up and technically listening, but can't perform operations yet.

At best, what this complicated change may achieve (I'm not sure it does, but it's possible) is earlier detection of a failed node startup – when there is no point in waiting further, but the client-based check would still try to connect to the node till the timeout exceeded. For this purpose, we won't make the change, because it's a positive scenario test, it's meant to pass in the vast majority of times, so the cases when it fails aren't worth optimizing.

The checks for ports being taken by other programs are pointless – it is not a general-purpose test, it's a CI test which is run in the specific controlled environment. If there are cases when the port is taken by something else, the reason for this should be fixed (but again, we need to see examples of that).

For this part

rsync is asynchronously killed prematurely.

the test kills rsync before it even starts the node. So, it cannot happen during SST.

The only explanation I can think of which would make both claims – the wait ends "too early", and rsync is killed during SST – possible is that Galera nodes start serving DDL/DML queries before SST has finished. Then yes, the wait for the second node would finish as soon as it processed the queries, then the loop would move on, killed the remaining rsync and started the third node. But if that's what happens, it's not the check that needs to be fixed but the Galera side.

Elena Stepanova added a comment - 2022-03-13 23:31 To re-iterate what was already discussed on Slack and what Sergei's comment above refers to: we are still waiting for actual examples of the problems which this patch is trying to solve. Without that, I cannot see how this the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number can be true. The client doesn't "join the cluster", I assume you mean it tries to connect to a node. Yes, that's exactly what it does, and the check is so primitive on purpose. The test waits till the client can actually perform operations on the newly joined node. Not when the port is available or other low-level stages have been passed, but when a client can quite literally connect to the node and execute queries. It cannot happen before the node started listening on the port, as the client won't even connect; and, unlike low-level checks for port and pid, it isn't prone to a race condition when/if the server is already up and technically listening, but can't perform operations yet. At best, what this complicated change may achieve (I'm not sure it does, but it's possible) is earlier detection of a failed node startup – when there is no point in waiting further, but the client-based check would still try to connect to the node till the timeout exceeded. For this purpose, we won't make the change, because it's a positive scenario test, it's meant to pass in the vast majority of times, so the cases when it fails aren't worth optimizing. The checks for ports being taken by other programs are pointless – it is not a general-purpose test, it's a CI test which is run in the specific controlled environment. If there are cases when the port is taken by something else, the reason for this should be fixed (but again, we need to see examples of that). For this part rsync is asynchronously killed prematurely. the test kills rsync before it even starts the node. So, it cannot happen during SST. The only explanation I can think of which would make both claims – the wait ends "too early", and rsync is killed during SST – possible is that Galera nodes start serving DDL/DML queries before SST has finished. Then yes, the wait for the second node would finish as soon as it processed the queries, then the loop would move on, killed the remaining rsync and started the third node. But if that's what happens, it's not the check that needs to be fixed but the Galera side.

MariaDB Server

Waiting for a port before trying to connect in buildbot scripts

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration