[MDEV-8518] rpl.sec_behind_master-5114 fails sporadically in buildbot - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0(EOL), 10.1(EOL), 10.2(EOL)
Fix Version/s: 10.0.29, 10.1.21, 10.2.4
Component/s: Tests
Labels:
None

Sprint:
10.2.1-5, 10.0.29

Description

http://buildbot.askmonty.org/buildbot/builders/p8-trusty-bintar/builds/494/steps/test/logs/stdio

rpl.sec_behind_master-5114 'stmt'        w2 [ fail ]

        Test ended at 2015-07-21 06:30:48

CURRENT_TEST: rpl.sec_behind_master-5114

--- /var/lib/buildbot/maria-slave/power8-vlp04-bintar/build/mysql-test/suite/rpl/r/sec_behind_master-5114.result	2015-07-21 02:29:21.581055813 -0400

+++ /var/lib/buildbot/maria-slave/power8-vlp04-bintar/build/mysql-test/suite/rpl/r/sec_behind_master-5114.reject	2015-07-21 06:30:47.677517323 -0400

@@ -3,7 +3,7 @@

 call mtr.add_suppression("Unsafe statement written to the binary log");

 CREATE TABLE t1 (a int);

 INSERT INTO t1 VALUES(SLEEP(2));

-Seconds_Behind_Master: 1

+Seconds_Behind_Master: 2

 Warnings:

 Note	1592	Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it uses a system function that may return a different value on the slave.

 drop table t1;

mysqltest: Result content mismatch

Attachments

Issue Links

relates to

MDEV-7069 Fix buildbot failures in main server trees

Closed

MDEV-13878 rpl.sec_behind_master-5114 failed in buildbot with wrong result

Open

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2015-09-07 00:03

Newer link: http://buildbot.askmonty.org/buildbot/builders/kvm-deb-utopic-x86/builds/762/steps/test_4/logs/stdio

Elena Stepanova added a comment - 2015-09-07 00:03 Newer link: http://buildbot.askmonty.org/buildbot/builders/kvm-deb-utopic-x86/builds/762/steps/test_4/logs/stdio

Elena Stepanova added a comment - 2017-01-01 17:01 - edited

The failure is caused by a simple race condition in the test. But it's not enough to fix it, the test itself doesn't really test the bugfix.
The test flow, if it runs normally without any delays, is this:

replication is turned on;
master executes a fast event e1 with timestamp e1_start_time;
master executes a slow (2-second long) event e2 with timestamp e2_start_time;
as soon as the second even is started executing on the master, the test starts checking Seconds_Behind_Master on slave;
as soon as Seconds_Behind_Master value is above zero, the test saves it and stops checking;
the test expects that the saved value would be 1.

The problem is that if the test runs without delays, since the event e1 is fast, we will normally have e2_start_time == e1_start_time.
When the slave executes event e1, it's fast, so Seconds_Behind_Master remains 0.
Then it starts executing e2. The bug ~~MDEV-5114~~ was that while the slave executes e2, Seconds_Behind_Master was calculated as an offset from e1_start_time, while it should be from e2_start_time.
But in this test, e1_start_time == e2_start_time, so the test would pass even without ~~MDEV-5114~~ fix.

The test is very similar to the draft suggested in ~~MDEV-5114~~ for debugging purposes, but it misses the most important component – a delay between e1 and e2 which would ensure that e1_start_time != e2_start_time, and the difference which the fix makes is visible.

Now, the race condition which makes the test fail as pasted in the description is simply that on a slow server it might happen that the first show slave status which the test uses to monitor Seconds_Behind_Master will be executed when the slave is already over 1 second within executing the slow event (sleep), the value is correct, it's just not expected.

Elena Stepanova added a comment - 2017-01-01 17:01 - edited The failure is caused by a simple race condition in the test. But it's not enough to fix it, the test itself doesn't really test the bugfix. The test flow, if it runs normally without any delays, is this: replication is turned on; master executes a fast event e1 with timestamp e1_start_time ; master executes a slow (2-second long) event e2 with timestamp e2_start_time ; as soon as the second even is started executing on the master, the test starts checking Seconds_Behind_Master on slave; as soon as Seconds_Behind_Master value is above zero, the test saves it and stops checking; the test expects that the saved value would be 1 . The problem is that if the test runs without delays, since the event e1 is fast, we will normally have e2_start_time == e1_start_time . When the slave executes event e1 , it's fast, so Seconds_Behind_Master remains 0 . Then it starts executing e2 . The bug MDEV-5114 was that while the slave executes e2 , Seconds_Behind_Master was calculated as an offset from e1_start_time , while it should be from e2_start_time . But in this test, e1_start_time == e2_start_time , so the test would pass even without MDEV-5114 fix. The test is very similar to the draft suggested in MDEV-5114 for debugging purposes, but it misses the most important component – a delay between e1 and e2 which would ensure that e1_start_time != e2_start_time , and the difference which the fix makes is visible. Now, the race condition which makes the test fail as pasted in the description is simply that on a slow server it might happen that the first show slave status which the test uses to monitor Seconds_Behind_Master will be executed when the slave is already over 1 second within executing the slow event (sleep), the value is correct, it's just not expected.

Elena Stepanova added a comment - 2017-01-01 17:37

https://github.com/MariaDB/server/commit/c4c29523bbb475d49aab9d1f5afe52a049b7b501

Elena Stepanova added a comment - 2017-01-01 17:37 https://github.com/MariaDB/server/commit/c4c29523bbb475d49aab9d1f5afe52a049b7b501

Elena Stepanova added a comment - 2017-01-02 11:16 - edited

serg,
Since it was your test initially, could you please take a quick look to see if you have any objections?

Elena Stepanova added a comment - 2017-01-02 11:16 - edited serg , Since it was your test initially, could you please take a quick look to see if you have any objections?

Sergei Golubchik added a comment - 2017-01-04 12:03

As far as I can see, the test does test the bug fix. The test succeeds with a bug fix and fails without it.
With the bug fix Seconds_Behind_Master grows steadily 0,1,2,.... Without the bug fix Seconds_Behind_Master jumps from 0 to 2, and the test verifies that Seconds_Behind_Master was 1, didn't jump over it.

If the test misses 1 on a slow server... The simple way to fix it and preserve the logic of a test would be to add some tolerance for slow servers and expect, say, 1 or 2, in this case the event should take longer than that, like, SLEEP(3) at least. I don't like increasing the test execution time, though, so let's use your fix, it seems ok. Please verify that it fails without the bug fix, though.

Sergei Golubchik added a comment - 2017-01-04 12:03 As far as I can see, the test does test the bug fix. The test succeeds with a bug fix and fails without it. With the bug fix Seconds_Behind_Master grows steadily 0,1,2,.... Without the bug fix Seconds_Behind_Master jumps from 0 to 2, and the test verifies that Seconds_Behind_Master was 1, didn't jump over it. If the test misses 1 on a slow server... The simple way to fix it and preserve the logic of a test would be to add some tolerance for slow servers and expect, say, 1 or 2, in this case the event should take longer than that, like, SLEEP(3) at least. I don't like increasing the test execution time, though, so let's use your fix, it seems ok. Please verify that it fails without the bug fix, though.

Elena Stepanova added a comment - 2017-01-04 12:42

Okay, sorry, I was wrong about the existing test not testing the fix, it is just prone to false negatives.

Here is how the normal execution goes on 10.0.17 (before ~~MDEV-5114~~ fix):

$ perl ./mtr rpl.sec_behind_master-5114

Logging: ./mtr  rpl.sec_behind_master-5114

vardir: /dev/shm/var_auto_592B

Checking leftover processes...

Removing old var directory...

Creating var directory '/dev/shm/var_auto_592B'...

Checking supported features...

MariaDB Version 10.0.17-MariaDB-debug

...

CREATE TABLE t1 (a int);

INSERT INTO t1 VALUES(SLEEP(2));

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

Seconds_Behind_Master: 2

Seconds_Behind_Master: 3

Seconds_Behind_Master: 3

Seconds_Behind_Master: 4

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

That's what the test expects, and so, it would indeed fail.

But every other time, it goes like this:

CREATE TABLE t1 (a int);

INSERT INTO t1 VALUES(SLEEP(2));

Seconds_Behind_Master: 1

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

Seconds_Behind_Master: 2

Seconds_Behind_Master: 2

Seconds_Behind_Master: 3

Seconds_Behind_Master: 3

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

Seconds_Behind_Master: 0

So, the test would pass, even though the bug is not fixed yet.

The first raise of Seconds_Behind_Master apparently comes from CREATE TABLE which the proposed test change also addresses by synchronizing with master after CREATE TABLE – so yes, I'll stick with it, because increasing the sleep time is not very reliable anyway, some builders can be extremely slow.

Elena Stepanova added a comment - 2017-01-04 12:42 Okay, sorry, I was wrong about the existing test not testing the fix, it is just prone to false negatives. Here is how the normal execution goes on 10.0.17 (before MDEV-5114 fix): $ perl ./mtr rpl.sec_behind_master-5114 Logging: ./mtr rpl.sec_behind_master-5114 vardir: /dev/shm/var_auto_592B Checking leftover processes... Removing old var directory... Creating var directory '/dev/shm/var_auto_592B'... Checking supported features... MariaDB Version 10.0.17-MariaDB-debug ... CREATE TABLE t1 (a int); INSERT INTO t1 VALUES(SLEEP(2)); Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 2 Seconds_Behind_Master: 3 Seconds_Behind_Master: 3 Seconds_Behind_Master: 4 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 That's what the test expects, and so, it would indeed fail. But every other time, it goes like this: CREATE TABLE t1 (a int); INSERT INTO t1 VALUES(SLEEP(2)); Seconds_Behind_Master: 1 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 2 Seconds_Behind_Master: 2 Seconds_Behind_Master: 3 Seconds_Behind_Master: 3 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 Seconds_Behind_Master: 0 So, the test would pass, even though the bug is not fixed yet. The first raise of Seconds_Behind_Master apparently comes from CREATE TABLE which the proposed test change also addresses by synchronizing with master after CREATE TABLE – so yes, I'll stick with it, because increasing the sleep time is not very reliable anyway, some builders can be extremely slow.

Elena Stepanova added a comment - 2017-01-04 12:46

The new version of the test fails before the patch like this:

rpl.sec_behind_master-5114-new 'stmt'    [ fail ]

        Test ended at 2017-01-04 14:38:54

CURRENT_TEST: rpl.sec_behind_master-5114-new

--- /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.result	2017-01-01 19:35:04.276244732 +0200

+++ /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.reject	2017-01-04 14:38:53.921677316 +0200

@@ -4,7 +4,7 @@

 CREATE TABLE t1 (a int);

 INSERT INTO t1 VALUES(SLEEP(2));

 Seconds_Behind_Master_is_less_than_100

-1

+0

 Warnings:

 Note	1592	Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it uses a system function that may return a different value on the slave.

 drop table t1;

mysqltest: Result content mismatch

Elena Stepanova added a comment - 2017-01-04 12:46 The new version of the test fails before the patch like this: rpl.sec_behind_master-5114-new 'stmt' [ fail ] Test ended at 2017-01-04 14:38:54 CURRENT_TEST: rpl.sec_behind_master-5114-new --- /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.result 2017-01-01 19:35:04.276244732 +0200 +++ /data/src/10.0-bug/mysql-test/suite/rpl/r/sec_behind_master-5114-new.reject 2017-01-04 14:38:53.921677316 +0200 @@ -4,7 +4,7 @@ CREATE TABLE t1 (a int); INSERT INTO t1 VALUES(SLEEP(2)); Seconds_Behind_Master_is_less_than_100 -1 +0 Warnings: Note 1592 Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it uses a system function that may return a different value on the slave. drop table t1; mysqltest: Result content mismatch

Elena Stepanova added a comment - 2017-01-04 12:52

https://github.com/MariaDB/server/commit/9bf92706d19761722b46d66a671734466cb6e98e

Elena Stepanova added a comment - 2017-01-04 12:52 https://github.com/MariaDB/server/commit/9bf92706d19761722b46d66a671734466cb6e98e

People

Assignee:: Elena Stepanova

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2015-07-21 21:56

Updated:: 2020-11-04 12:59

Resolved:: 2017-01-04 12:52

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server