Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33965

Additions to main.status are non-deterministic and inefficient (and fail in buildbot)

    XMLWordPrintable

Details

    • Bug
    • Status: Confirmed (View Workflow)
    • Major
    • Resolution: Unresolved
    • N/A
    • 11.7, 11.8
    • Tests
    • None

    Description

      In the scope of MDEV-33145, the following fragment was added to main.status test:

      connection default;
       
      --echo # Test Uptime_since_flush_status
      --echo # All results below should be 1
       
      flush global status;
      flush session status;
       
      --sleep 1
       
      let $global_flush_status= `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME LIKE 'Uptime_since_flush_status'`;
      let $local_flush_status= `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.SESSION_STATUS WHERE VARIABLE_NAME LIKE 'Uptime_since_flush_status'`;
       
      --sleep 2
       
      --disable_query_log
      --eval select $global_flush_status >= $local_flush_status as "1"
      flush session status;
       
      let $new_global_flush_status= `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME LIKE 'Uptime_since_flush_status'`;
      let $new_local_flush_status= `SELECT VARIABLE_VALUE FROM NFORMATION_SCHEMA.SESSION_STATUS WHERE VARIABLE_NAME LIKE 'Uptime_since_flush_status'`;
       
      --eval select $new_global_flush_status >= $global_flush_status as "1"
      --eval select $new_local_flush_status < $global_flush_status as "1"
       
      --sleep 2
      flush global status;
       
      let $local_flush_status=$new_local_flush_status;
      let $new_global_flush_status= `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_STATUS WHERE VARIABLE_NAME LIKE 'Uptime_since_flush_status'`;
      let $new_local_flush_status= `SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.SESSION_STATUS WHERE VARIABLE_NAME LIKE 'Uptime_since_flush_status'`;
       
      --eval select $new_local_flush_status  >= $local_flush_status as "1"
      --eval select $new_global_flush_status < $new_local_flush_status as "1"
       
      --enable_query_log
      --echo #
      --echo # end of 11.5 tests
      --echo #
      

      There are several problems with it.

      Most importantly, it's non-deterministic. In just a few builds in buildbots, it has already failed at least 4 times in buildbots and once (after only a couple attempts) on my machine. It's going to be an additional problem in the future to our already bad situation with unstable tests.

      Starting from the beginning:

      • after the first two flush-es (local and global) and the first sleep 1, global Uptime_since_flush_status and session Uptime_since_flush_status (I will further call them "global uptime" and "session uptime") can be anything starting from 1 – on slow builders either both or the second one can be greater than 1.
      • after the next sleep 2 the previously saved global uptime is checked to be greater or equal than the previously saved session uptime. It's already an unreasonable assumption. The global value was stored first, and it can very well be that on a slow builder the session one is bigger than the global one. It happened in one of test runs.
      • then session status is flushed, and both the global and the session uptime are stored again. Since there was no global flush, the new global value can be anything starting from 3, while the new session one can be anything at all.
      • the next check is that the new global uptime is greater or equal than the previous global uptime, which is a safe assumption, and it hasn't failed yet.
      • the next check is that the new session uptime is less than the previous global uptime, which is again unreasonable. There is no direct relation between them. Apparently the idea is that we waited 1 second after the global flush, but we didn't wait after the second session flush. But of course it's not a guarantee, the test could very well take its time between the second session flush and the variable assignment, and then the condition will be false. It has already happened in buildbot, too.
      • then we have another sleep 2, and a global flush;
      • then we check that the new session uptime is greater or equal than the previous session uptime. This is a safe bet.
      • then we check that the new global uptime (the one after the last global flush) is strictly less than the new local uptime. It is once again unreliable, although to a lesser extent than the previous checks. Still, while the new global uptime cannot be greater than the new local uptime, in some unfortunate circumstances they can be equal, specifically if in the previous block the test took really long time (2+ seconds) between the session flush and the variable assignment. It hasn't happened yet, but it might.

      All in all, as it often happens with sleep-based tests, there are several points of failure and unnecessary additional slowness without any certainty in the outcome.

      Also, the test fragment is made so that it's very difficult to interpret a failed result.
      It selects a boolean value in the usual form of (0|1), and also names it as "1". Every time, in 5 checks in a row.
      So, when a failure happens, it looks like this:

      CURRENT_TEST: main.status
      --- /mnt8t/bld/bb-11.5-MDEV-33145-flush-global-status-valgrind/mysql-test/main/status.result	2024-04-22 23:54:08.917996732 +0300
      +++ /mnt8t/bld/bb-11.5-MDEV-33145-flush-global-status-valgrind/mysql-test/main/status.reject	2024-04-23 00:55:06.003356222 +0300
      @@ -476,7 +476,7 @@
       1
       1
       1
      -1
      +0
       1
       1
       1
      

      To understand what actually failed, one needs to find the lines in the result file by the starting number, and then count the "1" lines to match them with the checks to figure which one went wrong. There is no good reason for this, instead, each check result could have been given a distinct name – even if it's just "check 1", "check 2", etc, it would have been much more convenient.

      Examples of failures in buildbot:
      https://buildbot.mariadb.org/#/builders/551/builds/12940
      https://buildbot.mariadb.org/#/builders/619/builds/2655
      https://buildbot.mariadb.org/#/builders/644/builds/282
      https://buildbot.mariadb.net/buildbot/builders/kvm-debug-asan-ubsan/builds/2539

      Attachments

        Issue Links

          Activity

            People

              monty Michael Widenius
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.