Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8518

rpl.sec_behind_master-5114 fails sporadically in buildbot

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0(EOL), 10.1(EOL), 10.2(EOL)
    • 10.0.29, 10.1.21, 10.2.4
    • Tests
    • None
    • 10.2.1-5, 10.0.29

    Description

      http://buildbot.askmonty.org/buildbot/builders/p8-trusty-bintar/builds/494/steps/test/logs/stdio

      rpl.sec_behind_master-5114 'stmt'        w2 [ fail ]
              Test ended at 2015-07-21 06:30:48
       
      CURRENT_TEST: rpl.sec_behind_master-5114
      --- /var/lib/buildbot/maria-slave/power8-vlp04-bintar/build/mysql-test/suite/rpl/r/sec_behind_master-5114.result	2015-07-21 02:29:21.581055813 -0400
      +++ /var/lib/buildbot/maria-slave/power8-vlp04-bintar/build/mysql-test/suite/rpl/r/sec_behind_master-5114.reject	2015-07-21 06:30:47.677517323 -0400
      @@ -3,7 +3,7 @@
       call mtr.add_suppression("Unsafe statement written to the binary log");
       CREATE TABLE t1 (a int);
       INSERT INTO t1 VALUES(SLEEP(2));
      -Seconds_Behind_Master: 1
      +Seconds_Behind_Master: 2
       Warnings:
       Note	1592	Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it uses a system function that may return a different value on the slave.
       drop table t1;
       
      mysqltest: Result content mismatch

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - Newer link: http://buildbot.askmonty.org/buildbot/builders/kvm-deb-utopic-x86/builds/762/steps/test_4/logs/stdio
            elenst Elena Stepanova added a comment - - edited

            The failure is caused by a simple race condition in the test. But it's not enough to fix it, the test itself doesn't really test the bugfix.
            The test flow, if it runs normally without any delays, is this:

            • replication is turned on;
            • master executes a fast event e1 with timestamp e1_start_time;
            • master executes a slow (2-second long) event e2 with timestamp e2_start_time;
            • as soon as the second even is started executing on the master, the test starts checking Seconds_Behind_Master on slave;
            • as soon as Seconds_Behind_Master value is above zero, the test saves it and stops checking;
            • the test expects that the saved value would be 1.

            The problem is that if the test runs without delays, since the event e1 is fast, we will normally have e2_start_time == e1_start_time.
            When the slave executes event e1, it's fast, so Seconds_Behind_Master remains 0.
            Then it starts executing e2. The bug MDEV-5114 was that while the slave executes e2, Seconds_Behind_Master was calculated as an offset from e1_start_time, while it should be from e2_start_time.
            But in this test, e1_start_time == e2_start_time, so the test would pass even without MDEV-5114 fix.

            The test is very similar to the draft suggested in MDEV-5114 for debugging purposes, but it misses the most important component – a delay between e1 and e2 which would ensure that e1_start_time != e2_start_time, and the difference which the fix makes is visible.

            Now, the race condition which makes the test fail as pasted in the description is simply that on a slow server it might happen that the first show slave status which the test uses to monitor Seconds_Behind_Master will be executed when the slave is already over 1 second within executing the slow event (sleep), the value is correct, it's just not expected.

            elenst Elena Stepanova added a comment - - edited The failure is caused by a simple race condition in the test. But it's not enough to fix it, the test itself doesn't really test the bugfix. The test flow, if it runs normally without any delays, is this: replication is turned on; master executes a fast event e1 with timestamp e1_start_time ; master executes a slow (2-second long) event e2 with timestamp e2_start_time ; as soon as the second even is started executing on the master, the test starts checking Seconds_Behind_Master on slave; as soon as Seconds_Behind_Master value is above zero, the test saves it and stops checking; the test expects that the saved value would be 1 . The problem is that if the test runs without delays, since the event e1 is fast, we will normally have e2_start_time == e1_start_time . When the slave executes event e1 , it's fast, so Seconds_Behind_Master remains 0 . Then it starts executing e2 . The bug MDEV-5114 was that while the slave executes e2 , Seconds_Behind_Master was calculated as an offset from e1_start_time , while it should be from e2_start_time . But in this test, e1_start_time == e2_start_time , so the test would pass even without MDEV-5114 fix. The test is very similar to the draft suggested in MDEV-5114 for debugging purposes, but it misses the most important component – a delay between e1 and e2 which would ensure that e1_start_time != e2_start_time , and the difference which the fix makes is visible. Now, the race condition which makes the test fail as pasted in the description is simply that on a slow server it might happen that the first show slave status which the test uses to monitor Seconds_Behind_Master will be executed when the slave is already over 1 second within executing the slow event (sleep), the value is correct, it's just not expected.
            elenst Elena Stepanova added a comment - https://github.com/MariaDB/server/commit/c4c29523bbb475d49aab9d1f5afe52a049b7b501
            elenst Elena Stepanova added a comment - - edited

            serg,
            Since it was your test initially, could you please take a quick look to see if you have any objections?

            elenst Elena Stepanova added a comment - - edited serg , Since it was your test initially, could you please take a quick look to see if you have any objections?

            As far as I can see, the test does test the bug fix. The test succeeds with a bug fix and fails without it.
            With the bug fix Seconds_Behind_Master grows steadily 0,1,2,.... Without the bug fix Seconds_Behind_Master jumps from 0 to 2, and the test verifies that Seconds_Behind_Master was 1, didn't jump over it.

            If the test misses 1 on a slow server... The simple way to fix it and preserve the logic of a test would be to add some tolerance for slow servers and expect, say, 1 or 2, in this case the event should take longer than that, like, SLEEP(3) at least. I don't like increasing the test execution time, though, so let's use your fix, it seems ok. Please verify that it fails without the bug fix, though.

            serg Sergei Golubchik added a comment - As far as I can see, the test does test the bug fix. The test succeeds with a bug fix and fails without it. With the bug fix Seconds_Behind_Master grows steadily 0,1,2,.... Without the bug fix Seconds_Behind_Master jumps from 0 to 2, and the test verifies that Seconds_Behind_Master was 1, didn't jump over it. If the test misses 1 on a slow server... The simple way to fix it and preserve the logic of a test would be to add some tolerance for slow servers and expect, say, 1 or 2, in this case the event should take longer than that, like, SLEEP(3) at least. I don't like increasing the test execution time, though, so let's use your fix, it seems ok. Please verify that it fails without the bug fix, though.

            Okay, sorry, I was wrong about the existing test not testing the fix, it is just prone to false negatives.

            Here is how the normal execution goes on 10.0.17 (before MDEV-5114 fix):

            $ perl ./mtr rpl.sec_behind_master-5114 
            Logging: ./mtr  rpl.sec_behind_master-5114
            vardir: /dev/shm/var_auto_592B
            Checking leftover processes...
            Removing old var directory...
            Creating var directory '/dev/shm/var_auto_592B'...
            Checking supported features...
            MariaDB Version 10.0.17-MariaDB-debug
            ...
            

            CREATE TABLE t1 (a int);
            INSERT INTO t1 VALUES(SLEEP(2));
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 2
            Seconds_Behind_Master: 3
            Seconds_Behind_Master: 3
            Seconds_Behind_Master: 4
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            

            That's what the test expects, and so, it would indeed fail.

            But every other time, it goes like this:

            CREATE TABLE t1 (a int);
            INSERT INTO t1 VALUES(SLEEP(2));
            Seconds_Behind_Master: 1
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 2
            Seconds_Behind_Master: 2
            Seconds_Behind_Master: 3
            Seconds_Behind_Master: 3
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            Seconds_Behind_Master: 0
            

            So, the test would pass, even though the bug is not fixed yet.

            The first raise of Seconds_Behind_Master apparently comes from CREATE TABLE which the proposed test change also addresses by synchronizing with master after CREATE TABLE – so yes, I'll stick with it, because increasing the sleep time is not very reliable anyway, some builders can be extremely slow.

            elenst Elena Stepanova added a comment - Okay, sorry, I was wrong about the existing test not testing the fix, it is just prone to false negatives. Here is how the normal execution goes on 10.0.17 (before MDEV-5114 fix): $ perl ./mtr rpl.sec_behind_master-5114 Logging: ./mtr rpl.sec_behind_master-5114 vardir: /dev/shm/var_auto_592B Checking leftover processes... Removing old var directory... Creating var directory '/dev/shm/var_auto_592B'... Checking supported features... MariaDB Version 10.0.17-MariaDB-debug ... CREATE TABLE t1 (a int); INSERT INTO t1 VALUES(SLEEP(2)); Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 2 Seconds_Behind_Master: 3 Seconds_Behind_Master: 3 Seconds_Behind_Master: 4 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 That's what the test expects, and so, it would indeed fail. But every other time, it goes like this: CREATE TABLE t1 (a int); INSERT INTO t1 VALUES(SLEEP(2)); Seconds_Behind_Master: 1 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 2 Seconds_Behind_Master: 2 Seconds_Behind_Master: 3 Seconds_Behind_Master: 3 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 So, the test would pass, even though the bug is not fixed yet. The first raise of Seconds_Behind_Master apparently comes from CREATE TABLE which the proposed test change also addresses by synchronizing with master after CREATE TABLE – so yes, I'll stick with it, because increasing the sleep time is not very reliable anyway, some builders can be extremely slow.

            The new version of the test fails before the patch like this:

            rpl.sec_behind_master-5114-new 'stmt'    [ fail ]
                    Test ended at 2017-01-04 14:38:54
             
            CURRENT_TEST: rpl.sec_behind_master-5114-new
            --- /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.result	2017-01-01 19:35:04.276244732 +0200
            +++ /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.reject	2017-01-04 14:38:53.921677316 +0200
            @@ -4,7 +4,7 @@
             CREATE TABLE t1 (a int);
             INSERT INTO t1 VALUES(SLEEP(2));
             Seconds_Behind_Master_is_less_than_100
            -1
            +0
             Warnings:
             Note	1592	Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it uses a system function that may return a different value on the slave.
             drop table t1;
             
            mysqltest: Result content mismatch
            

            elenst Elena Stepanova added a comment - The new version of the test fails before the patch like this: rpl.sec_behind_master-5114-new 'stmt' [ fail ] Test ended at 2017-01-04 14:38:54   CURRENT_TEST: rpl.sec_behind_master-5114-new --- /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.result 2017-01-01 19:35:04.276244732 +0200 +++ /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.reject 2017-01-04 14:38:53.921677316 +0200 @@ -4,7 +4,7 @@ CREATE TABLE t1 (a int); INSERT INTO t1 VALUES(SLEEP(2)); Seconds_Behind_Master_is_less_than_100 -1 +0 Warnings: Note 1592 Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it uses a system function that may return a different value on the slave. drop table t1;   mysqltest: Result content mismatch
            elenst Elena Stepanova added a comment - https://github.com/MariaDB/server/commit/9bf92706d19761722b46d66a671734466cb6e98e

            People

              elenst Elena Stepanova
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.