Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27692

Waiting for a port before trying to connect in buildbot scripts

Details

    Description

      Currently, the buildbot scripts that check SST in Galera regularly give us false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number. And in the case of SST, rsync also calls kill on rsync, resulting in an unhealthy cluster, since SST crashes on some nodes when rsync is asynchronously killed prematurely. Therefore, we need to add a function to the buildbot scripts for Galera to wait for the server to enter the listening state on a given port number before trying to connect and remove kill on rsync from the connection attempt loop.

      Attachments

        Issue Links

          Activity

            sysprg Julius Goryavsky added a comment - - edited

            This change adds functionality to wait for the server to enter the listening state for a given port: wait-port.diff

            Pull request: https://github.com/MariaDB/mariadb.org-tools/pull/101

            sysprg Julius Goryavsky added a comment - - edited This change adds functionality to wait for the server to enter the listening state for a given port: wait-port.diff Pull request: https://github.com/MariaDB/mariadb.org-tools/pull/101

            do you have examples of these

            false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number

            ?

            serg Sergei Golubchik added a comment - do you have examples of these false red cells in the test results tables with non-existent failures due to the fact that the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number ?

            To re-iterate what was already discussed on Slack and what Sergei's comment above refers to: we are still waiting for actual examples of the problems which this patch is trying to solve. Without that, I cannot see how this

            the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number

            can be true.
            The client doesn't "join the cluster", I assume you mean it tries to connect to a node.
            Yes, that's exactly what it does, and the check is so primitive on purpose. The test waits till the client can actually perform operations on the newly joined node. Not when the port is available or other low-level stages have been passed, but when a client can quite literally connect to the node and execute queries. It cannot happen before the node started listening on the port, as the client won't even connect; and, unlike low-level checks for port and pid, it isn't prone to a race condition when/if the server is already up and technically listening, but can't perform operations yet.

            At best, what this complicated change may achieve (I'm not sure it does, but it's possible) is earlier detection of a failed node startup – when there is no point in waiting further, but the client-based check would still try to connect to the node till the timeout exceeded. For this purpose, we won't make the change, because it's a positive scenario test, it's meant to pass in the vast majority of times, so the cases when it fails aren't worth optimizing.

            The checks for ports being taken by other programs are pointless – it is not a general-purpose test, it's a CI test which is run in the specific controlled environment. If there are cases when the port is taken by something else, the reason for this should be fixed (but again, we need to see examples of that).

            For this part

            rsync is asynchronously killed prematurely.

            the test kills rsync before it even starts the node. So, it cannot happen during SST.

            The only explanation I can think of which would make both claims – the wait ends "too early", and rsync is killed during SST – possible is that Galera nodes start serving DDL/DML queries before SST has finished. Then yes, the wait for the second node would finish as soon as it processed the queries, then the loop would move on, killed the remaining rsync and started the third node. But if that's what happens, it's not the check that needs to be fixed but the Galera side.

            elenst Elena Stepanova added a comment - To re-iterate what was already discussed on Slack and what Sergei's comment above refers to: we are still waiting for actual examples of the problems which this patch is trying to solve. Without that, I cannot see how this the client (mysql) tries to join the cluster before the next server has passed to the listening state on the specified port number can be true. The client doesn't "join the cluster", I assume you mean it tries to connect to a node. Yes, that's exactly what it does, and the check is so primitive on purpose. The test waits till the client can actually perform operations on the newly joined node. Not when the port is available or other low-level stages have been passed, but when a client can quite literally connect to the node and execute queries. It cannot happen before the node started listening on the port, as the client won't even connect; and, unlike low-level checks for port and pid, it isn't prone to a race condition when/if the server is already up and technically listening, but can't perform operations yet. At best, what this complicated change may achieve (I'm not sure it does, but it's possible) is earlier detection of a failed node startup – when there is no point in waiting further, but the client-based check would still try to connect to the node till the timeout exceeded. For this purpose, we won't make the change, because it's a positive scenario test, it's meant to pass in the vast majority of times, so the cases when it fails aren't worth optimizing. The checks for ports being taken by other programs are pointless – it is not a general-purpose test, it's a CI test which is run in the specific controlled environment. If there are cases when the port is taken by something else, the reason for this should be fixed (but again, we need to see examples of that). For this part rsync is asynchronously killed prematurely. the test kills rsync before it even starts the node. So, it cannot happen during SST. The only explanation I can think of which would make both claims – the wait ends "too early", and rsync is killed during SST – possible is that Galera nodes start serving DDL/DML queries before SST has finished. Then yes, the wait for the second node would finish as soon as it processed the queries, then the loop would move on, killed the remaining rsync and started the third node. But if that's what happens, it's not the check that needs to be fixed but the Galera side.

            People

              sysprg Julius Goryavsky
              sysprg Julius Goryavsky
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.