[MDEV-11668] rpl.rpl_heartbeat_basic fails sporadically in buildbot - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 5.5(EOL), 10.0(EOL), 10.1(EOL), 10.2(EOL)
Fix Version/s: 5.5.55, 10.0.30, 10.1.22, 10.2.5
Component/s: Tests
Labels:
None

Sprint:
5.5.55

Description

Apparently happens on slow builders.
http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/432/steps/test/logs/stdio

rpl.rpl_heartbeat_basic 'mix'            w3 [ fail ]

        Test ended at 2016-12-24 19:36:33

CURRENT_TEST: rpl.rpl_heartbeat_basic

--- /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result	2016-12-24 05:00:48.630019948 -0500

+++ /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject	2016-12-24 19:36:32.850019948 -0500

@@ -245,7 +245,7 @@

 connection master;

 SET @@global.event_scheduler=1;

 connection slave;

-Number of received heartbeat events: 0

+Number of received heartbeat events: 1

 connection master;

 DELETE FROM t1;

 DROP EVENT e1;

mysqltest: Result content mismatch

10.1 on p8-trusty-bintar (logs not available):

rpl.rpl_heartbeat_basic 'mix'            w4 [ fail ]

        Test ended at 2016-07-01 14:01:06

CURRENT_TEST: rpl.rpl_heartbeat_basic

--- /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result	2016-07-01 08:21:15.687777719 -0400

+++ /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject	2016-07-01 14:01:05.642250919 -0400

@@ -221,7 +221,7 @@

 CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=MASTER_PORT, MASTER_USER='root', MASTER_CONNECT_RETRY=20, MASTER_HEARTBEAT_PERIOD=5;

 include/start_slave.inc

 SET @@global.event_scheduler=1;

-Number of received heartbeat events: 0

+Number of received heartbeat events: 1

 DELETE FROM t1;

 DROP EVENT e1;

Attachments

Issue Links

relates to

MDEV-7069 Fix buildbot failures in main server trees

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2017-02-16 18:16 - edited

This fragment of the test is reasonably accurate. The flow is this:

the slave is stopped;
a scheduled event is created on master. The event should update a table every 1 second;
the slave is reset and configured with heartbeat_period = 5 sec;
the slave is started;
event scheduler gets enabled on master;
slave stores the current value of slave_received_heartbeats;
slave waits for 5 updates to come from the master;
slave compares the current value of slave_received_heartbeats with the previously stored one, it should not have changed (although the test prints "Number of received heartbeat events", the numeric value is in fact the difference between the new and old value, 0 is the expected result);
the table and event are dropped on master;
slave is synchronized with master;
slave is stopped;

So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats.

Apparently, on a very slow builder this assumption might be wrong.

The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this:

2016-12-24 19:35:13 17589762620768 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16020',replication started in log 'FIRST' at position 4

2016-12-24 19:35:13 17589762227552 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4

2016-12-24 19:35:43 17589762227552 [Note] Error reading relay log event: slave SQL thread was killed

2016-12-24 19:35:43 17589762227552 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422

That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent.

Here is another log with the same failure:
http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/537

It shows that 30 seconds is irrelevant. In this case it's only 8 seconds:

2017-02-03  8:27:32 17590239460704 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4

2017-02-03  8:27:32 17590239853920 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4

2017-02-03  8:27:40 17590239460704 [Note] Error reading relay log event: slave SQL thread was killed

2017-02-03  8:27:40 17590239460704 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422

8 seconds seems to be a more common case:

2017-01-29 23:02:31 70366991580080 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4

2017-01-29 23:02:31 70366831279024 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4

2017-01-29 23:02:39 70366991580080 [Note] Error reading relay log event: slave SQL thread was killed

2017-01-29 23:02:39 70366991580080 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645

2017-01-29 23:02:39 70366831279024 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2645

2017-02-09  0:58:40 70366552489280 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4

2017-02-09  0:58:40 70366552882496 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16040',replication started in log 'FIRST' at position 4

2017-02-09  0:58:48 70366552489280 [Note] Error reading relay log event: slave SQL thread was killed

2017-02-09  0:58:48 70366552489280 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645

Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing.

Elena Stepanova added a comment - 2017-02-16 18:16 - edited This fragment of the test is reasonably accurate. The flow is this: the slave is stopped; a scheduled event is created on master. The event should update a table every 1 second; the slave is reset and configured with heartbeat_period = 5 sec; the slave is started; event scheduler gets enabled on master; slave stores the current value of slave_received_heartbeats ; slave waits for 5 updates to come from the master; slave compares the current value of slave_received_heartbeats with the previously stored one, it should not have changed (although the test prints "Number of received heartbeat events", the numeric value is in fact the difference between the new and old value, 0 is the expected result); the table and event are dropped on master; slave is synchronized with master; slave is stopped; So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats. Apparently, on a very slow builder this assumption might be wrong. The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this: 2016-12-24 19:35:13 17589762620768 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16020',replication started in log 'FIRST' at position 4 2016-12-24 19:35:13 17589762227552 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2016-12-24 19:35:43 17589762227552 [Note] Error reading relay log event: slave SQL thread was killed 2016-12-24 19:35:43 17589762227552 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422 That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent. Here is another log with the same failure: http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/537 It shows that 30 seconds is irrelevant. In this case it's only 8 seconds: 2017-02-03 8:27:32 17590239460704 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2017-02-03 8:27:32 17590239853920 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4 2017-02-03 8:27:40 17590239460704 [Note] Error reading relay log event: slave SQL thread was killed 2017-02-03 8:27:40 17590239460704 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422 8 seconds seems to be a more common case: 2017-01-29 23:02:31 70366991580080 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2017-01-29 23:02:31 70366831279024 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4 2017-01-29 23:02:39 70366991580080 [Note] Error reading relay log event: slave SQL thread was killed 2017-01-29 23:02:39 70366991580080 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645 2017-01-29 23:02:39 70366831279024 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2645 2017-02-09 0:58:40 70366552489280 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2017-02-09 0:58:40 70366552882496 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16040',replication started in log 'FIRST' at position 4 2017-02-09 0:58:48 70366552489280 [Note] Error reading relay log event: slave SQL thread was killed 2017-02-09 0:58:48 70366552489280 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645 Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing.

Elena Stepanova added a comment - 2017-02-16 22:27

8 seconds is actually understandable, it's the minimal value at which the problem might occur.

0 sec: when scheduler is enabled, it performs 1st update, a=1 => a=2
1 sec: a=2 => a=3
2 sec: a=3 => a=4
3 sec: a=4 => a=5
4 sec: a=5 => a=6

Slave is waiting for a>5, so it should take at least 4 seconds. It cannot be less because the events are triggered once a second, so binlog events are also written not faster than once a second. There are 4 intervals to wait, thus 1 + 1 + 1 + 1 = 4 seconds.
But to trigger the heartbeat, at least once there should be 5+ seconds between two consequent binlog events. So, we will have at least 5 + 1 + 1 + 1 = 8 seconds.

Elena Stepanova added a comment - 2017-02-16 22:27 8 seconds is actually understandable, it's the minimal value at which the problem might occur. 0 sec: when scheduler is enabled, it performs 1st update, a=1 => a=2 1 sec: a=2 => a=3 2 sec: a=3 => a=4 3 sec: a=4 => a=5 4 sec: a=5 => a=6 Slave is waiting for a>5, so it should take at least 4 seconds. It cannot be less because the events are triggered once a second, so binlog events are also written not faster than once a second. There are 4 intervals to wait, thus 1 + 1 + 1 + 1 = 4 seconds. But to trigger the heartbeat, at least once there should be 5+ seconds between two consequent binlog events. So, we will have at least 5 + 1 + 1 + 1 = 8 seconds.

Elena Stepanova added a comment - 2017-02-16 22:32

The failure is reproducible on 10.x easily with this addition to the test:

--- /data/src/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test	2016-06-14 19:03:43.615987384 +0300

+++ /data/bld/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test	2017-02-17 00:20:47.142558293 +0200

@@ -351,6 +351,10 @@

 --source include/start_slave.inc

 --connection master

 # Enable scheduler

+--disable_query_log

+set global binlog_commit_wait_count=2;

+set global binlog_commit_wait_usec=5100000;

+--enable_query_log

 SET @@global.event_scheduler=1;

 --sync_slave_with_master

 let $rcvd_heartbeats_before= query_get_value(SHOW STATUS LIKE 'slave_received_heartbeats', Value, 1);

@@ -361,6 +365,10 @@

 let $result= query_get_value(SELECT ($rcvd_heartbeats_after - $rcvd_heartbeats_before) > 0 AS Result, Result, 1);

 --echo Number of received heartbeat events: $result

 --connection master

+--disable_query_log

+set global binlog_commit_wait_count=0;

+set global binlog_commit_wait_usec=100000;

+--enable_query_log

 DELETE FROM t1;

 DROP EVENT e1;

 --sync_slave_with_master

The addition does not change the logic of the test, it simply delays binary logging for long enough so that the heartbeat events are triggered.
On a slow builder, such a delay can occur naturally, so it's just a test problem.

The diff above is not applicable to 5.5 because it does not have binlog_commit_wait* options, but the idea is the same, it can be imitated by adding delay to binary logging code.

Elena Stepanova added a comment - 2017-02-16 22:32 The failure is reproducible on 10.x easily with this addition to the test: --- /data/src/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test 2016-06-14 19:03:43.615987384 +0300 +++ /data/bld/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test 2017-02-17 00:20:47.142558293 +0200 @@ -351,6 +351,10 @@ --source include/start_slave.inc --connection master # Enable scheduler +--disable_query_log +set global binlog_commit_wait_count=2; +set global binlog_commit_wait_usec=5100000; +--enable_query_log SET @@global.event_scheduler=1; --sync_slave_with_master let $rcvd_heartbeats_before= query_get_value(SHOW STATUS LIKE 'slave_received_heartbeats', Value, 1); @@ -361,6 +365,10 @@ let $result= query_get_value(SELECT ($rcvd_heartbeats_after - $rcvd_heartbeats_before) > 0 AS Result, Result, 1); --echo Number of received heartbeat events: $result --connection master +--disable_query_log +set global binlog_commit_wait_count=0; +set global binlog_commit_wait_usec=100000; +--enable_query_log DELETE FROM t1; DROP EVENT e1; --sync_slave_with_master The addition does not change the logic of the test, it simply delays binary logging for long enough so that the heartbeat events are triggered. On a slow builder, such a delay can occur naturally, so it's just a test problem. The diff above is not applicable to 5.5 because it does not have binlog_commit_wait* options, but the idea is the same, it can be imitated by adding delay to binary logging code.

Elena Stepanova added a comment - 2017-02-16 22:40

We cannot make the master to write binary logs (or execute SQL) in a timely manner. What we can do is to check that they were executed in a timely manner, and if they weren't, neutralize the check on the slave.

So, instead of waiting for the updates on the slave, we'll be checking 'Binlog_commits' on master. We'll wait for the same 5 updates, so it won't be longer than it is now, but we will also check how much time has passed between the updates. We can only be sure that no heartbeats should be sent if the interval was less than 5 seconds (the configured HEARTBEAT_PERIOD). If at least one update took longer, we will tell the slave to ignore the heartbeats (we cannot make it check that the heartbeat was received, because we'll run into an opposite race condition, when we think the interval was more than 5 seconds while in fact it was still slightly less, and thus no heartbeat was issued).

Since such a big delay is a relatively rare event, there is no danger of losing the test coverage in buildbot.

Elena Stepanova added a comment - 2017-02-16 22:40 We cannot make the master to write binary logs (or execute SQL) in a timely manner. What we can do is to check that they were executed in a timely manner, and if they weren't, neutralize the check on the slave. So, instead of waiting for the updates on the slave, we'll be checking 'Binlog_commits' on master. We'll wait for the same 5 updates, so it won't be longer than it is now, but we will also check how much time has passed between the updates. We can only be sure that no heartbeats should be sent if the interval was less than 5 seconds (the configured HEARTBEAT_PERIOD). If at least one update took longer, we will tell the slave to ignore the heartbeats (we cannot make it check that the heartbeat was received, because we'll run into an opposite race condition, when we think the interval was more than 5 seconds while in fact it was still slightly less, and thus no heartbeat was issued). Since such a big delay is a relatively rare event, there is no danger of losing the test coverage in buildbot.

Elena Stepanova added a comment - 2017-02-16 23:07

https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

Elena Stepanova added a comment - 2017-02-16 23:07 https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

Elena Stepanova added a comment - 2017-02-17 13:46

https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

Elena Stepanova added a comment - 2017-02-17 13:46 https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

People

Assignee:: Elena Stepanova

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2016-12-27 00:18

Updated:: 2017-02-17 13:46

Resolved:: 2017-02-17 13:46

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server