[MDEV-11668] rpl.rpl_heartbeat_basic fails sporadically in buildbot Created: 2016-12-27  Updated: 2017-02-17  Resolved: 2017-02-17

Status: Closed
Project: MariaDB Server
Component/s: Tests
Affects Version/s: 5.5, 10.0, 10.1, 10.2
Fix Version/s: 5.5.55, 10.0.30, 10.1.22, 10.2.5

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Elena Stepanova
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-7069 Fix buildbot failures in main server ... Stalled
Sprint: 5.5.55

 Description   

Apparently happens on slow builders.
http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/432/steps/test/logs/stdio

rpl.rpl_heartbeat_basic 'mix'            w3 [ fail ]
        Test ended at 2016-12-24 19:36:33
 
CURRENT_TEST: rpl.rpl_heartbeat_basic
--- /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result	2016-12-24 05:00:48.630019948 -0500
+++ /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject	2016-12-24 19:36:32.850019948 -0500
@@ -245,7 +245,7 @@
 connection master;
 SET @@global.event_scheduler=1;
 connection slave;
-Number of received heartbeat events: 0
+Number of received heartbeat events: 1
 connection master;
 DELETE FROM t1;
 DROP EVENT e1;
 
mysqltest: Result content mismatch

10.1 on p8-trusty-bintar (logs not available):

rpl.rpl_heartbeat_basic 'mix'            w4 [ fail ]
        Test ended at 2016-07-01 14:01:06
 
CURRENT_TEST: rpl.rpl_heartbeat_basic
--- /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result	2016-07-01 08:21:15.687777719 -0400
+++ /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject	2016-07-01 14:01:05.642250919 -0400
@@ -221,7 +221,7 @@
 CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=MASTER_PORT, MASTER_USER='root', MASTER_CONNECT_RETRY=20, MASTER_HEARTBEAT_PERIOD=5;
 include/start_slave.inc
 SET @@global.event_scheduler=1;
-Number of received heartbeat events: 0
+Number of received heartbeat events: 1
 DELETE FROM t1;
 DROP EVENT e1;



 Comments   
Comment by Elena Stepanova [ 2017-02-16 ]

This fragment of the test is reasonably accurate. The flow is this:

  • the slave is stopped;
  • a scheduled event is created on master. The event should update a table every 1 second;
  • the slave is reset and configured with heartbeat_period = 5 sec;
  • the slave is started;
  • event scheduler gets enabled on master;
  • slave stores the current value of slave_received_heartbeats;
  • slave waits for 5 updates to come from the master;
  • slave compares the current value of slave_received_heartbeats with the previously stored one, it should not have changed (although the test prints "Number of received heartbeat events", the numeric value is in fact the difference between the new and old value, 0 is the expected result);
  • the table and event are dropped on master;
  • slave is synchronized with master;
  • slave is stopped;

So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats.

Apparently, on a very slow builder this assumption might be wrong.

The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this:

2016-12-24 19:35:13 17589762620768 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16020',replication started in log 'FIRST' at position 4
2016-12-24 19:35:13 17589762227552 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2016-12-24 19:35:43 17589762227552 [Note] Error reading relay log event: slave SQL thread was killed
2016-12-24 19:35:43 17589762227552 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422

That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent.

Here is another log with the same failure:
http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/537

It shows that 30 seconds is irrelevant. In this case it's only 8 seconds:

2017-02-03  8:27:32 17590239460704 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2017-02-03  8:27:32 17590239853920 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
2017-02-03  8:27:40 17590239460704 [Note] Error reading relay log event: slave SQL thread was killed
2017-02-03  8:27:40 17590239460704 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422

8 seconds seems to be a more common case:

2017-01-29 23:02:31 70366991580080 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2017-01-29 23:02:31 70366831279024 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
2017-01-29 23:02:39 70366991580080 [Note] Error reading relay log event: slave SQL thread was killed
2017-01-29 23:02:39 70366991580080 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645
2017-01-29 23:02:39 70366831279024 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2645

2017-02-09  0:58:40 70366552489280 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2017-02-09  0:58:40 70366552882496 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16040',replication started in log 'FIRST' at position 4
2017-02-09  0:58:48 70366552489280 [Note] Error reading relay log event: slave SQL thread was killed
2017-02-09  0:58:48 70366552489280 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645

Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing.

Comment by Elena Stepanova [ 2017-02-16 ]

8 seconds is actually understandable, it's the minimal value at which the problem might occur.

  • 0 sec: when scheduler is enabled, it performs 1st update, a=1 => a=2
  • 1 sec: a=2 => a=3
  • 2 sec: a=3 => a=4
  • 3 sec: a=4 => a=5
  • 4 sec: a=5 => a=6

Slave is waiting for a>5, so it should take at least 4 seconds. It cannot be less because the events are triggered once a second, so binlog events are also written not faster than once a second. There are 4 intervals to wait, thus 1 + 1 + 1 + 1 = 4 seconds.
But to trigger the heartbeat, at least once there should be 5+ seconds between two consequent binlog events. So, we will have at least 5 + 1 + 1 + 1 = 8 seconds.

Comment by Elena Stepanova [ 2017-02-16 ]

The failure is reproducible on 10.x easily with this addition to the test:

--- /data/src/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test	2016-06-14 19:03:43.615987384 +0300
+++ /data/bld/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test	2017-02-17 00:20:47.142558293 +0200
@@ -351,6 +351,10 @@
 --source include/start_slave.inc
 --connection master
 # Enable scheduler
+--disable_query_log
+set global binlog_commit_wait_count=2;
+set global binlog_commit_wait_usec=5100000;
+--enable_query_log
 SET @@global.event_scheduler=1;
 --sync_slave_with_master
 let $rcvd_heartbeats_before= query_get_value(SHOW STATUS LIKE 'slave_received_heartbeats', Value, 1);
@@ -361,6 +365,10 @@
 let $result= query_get_value(SELECT ($rcvd_heartbeats_after - $rcvd_heartbeats_before) > 0 AS Result, Result, 1);
 --echo Number of received heartbeat events: $result
 --connection master
+--disable_query_log
+set global binlog_commit_wait_count=0;
+set global binlog_commit_wait_usec=100000;
+--enable_query_log
 DELETE FROM t1;
 DROP EVENT e1;
 --sync_slave_with_master

The addition does not change the logic of the test, it simply delays binary logging for long enough so that the heartbeat events are triggered.
On a slow builder, such a delay can occur naturally, so it's just a test problem.

The diff above is not applicable to 5.5 because it does not have binlog_commit_wait* options, but the idea is the same, it can be imitated by adding delay to binary logging code.

Comment by Elena Stepanova [ 2017-02-16 ]

We cannot make the master to write binary logs (or execute SQL) in a timely manner. What we can do is to check that they were executed in a timely manner, and if they weren't, neutralize the check on the slave.

So, instead of waiting for the updates on the slave, we'll be checking 'Binlog_commits' on master. We'll wait for the same 5 updates, so it won't be longer than it is now, but we will also check how much time has passed between the updates. We can only be sure that no heartbeats should be sent if the interval was less than 5 seconds (the configured HEARTBEAT_PERIOD). If at least one update took longer, we will tell the slave to ignore the heartbeats (we cannot make it check that the heartbeat was received, because we'll run into an opposite race condition, when we think the interval was more than 5 seconds while in fact it was still slightly less, and thus no heartbeat was issued).

Since such a big delay is a relatively rare event, there is no danger of losing the test coverage in buildbot.

Comment by Elena Stepanova [ 2017-02-16 ]

https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

Comment by Elena Stepanova [ 2017-02-17 ]

https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

Generated at Thu Feb 08 07:51:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.