Details
Description
Apparently happens on slow builders.
http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/432/steps/test/logs/stdio
rpl.rpl_heartbeat_basic 'mix' w3 [ fail ]
|
Test ended at 2016-12-24 19:36:33
|
|
CURRENT_TEST: rpl.rpl_heartbeat_basic
|
--- /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result 2016-12-24 05:00:48.630019948 -0500
|
+++ /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject 2016-12-24 19:36:32.850019948 -0500
|
@@ -245,7 +245,7 @@
|
connection master;
|
SET @@global.event_scheduler=1;
|
connection slave;
|
-Number of received heartbeat events: 0
|
+Number of received heartbeat events: 1
|
connection master;
|
DELETE FROM t1;
|
DROP EVENT e1;
|
|
mysqltest: Result content mismatch
|
10.1 on p8-trusty-bintar (logs not available):
rpl.rpl_heartbeat_basic 'mix' w4 [ fail ]
|
Test ended at 2016-07-01 14:01:06
|
|
CURRENT_TEST: rpl.rpl_heartbeat_basic
|
--- /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result 2016-07-01 08:21:15.687777719 -0400
|
+++ /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject 2016-07-01 14:01:05.642250919 -0400
|
@@ -221,7 +221,7 @@
|
CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=MASTER_PORT, MASTER_USER='root', MASTER_CONNECT_RETRY=20, MASTER_HEARTBEAT_PERIOD=5;
|
include/start_slave.inc
|
SET @@global.event_scheduler=1;
|
-Number of received heartbeat events: 0
|
+Number of received heartbeat events: 1
|
DELETE FROM t1;
|
DROP EVENT e1;
|
Attachments
Issue Links
- relates to
-
MDEV-7069 Fix buildbot failures in main server trees
-
- Closed
-
This fragment of the test is reasonably accurate. The flow is this:
So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats.
Apparently, on a very slow builder this assumption might be wrong.
The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this:
2016-12-24 19:35:13 17589762620768 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16020',replication started in log 'FIRST' at position 4
2016-12-24 19:35:13 17589762227552 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2016-12-24 19:35:43 17589762227552 [Note] Error reading relay log event: slave SQL thread was killed
2016-12-24 19:35:43 17589762227552 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422
That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent.
Here is another log with the same failure:
http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/537
It shows that 30 seconds is irrelevant. In this case it's only 8 seconds:
2017-02-03 8:27:32 17590239460704 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2017-02-03 8:27:32 17590239853920 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
2017-02-03 8:27:40 17590239460704 [Note] Error reading relay log event: slave SQL thread was killed
2017-02-03 8:27:40 17590239460704 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422
8 seconds seems to be a more common case:
2017-01-29 23:02:31 70366991580080 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2017-01-29 23:02:31 70366831279024 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
2017-01-29 23:02:39 70366991580080 [Note] Error reading relay log event: slave SQL thread was killed
2017-01-29 23:02:39 70366991580080 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645
2017-01-29 23:02:39 70366831279024 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2645
2017-02-09 0:58:40 70366552489280 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2017-02-09 0:58:40 70366552882496 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16040',replication started in log 'FIRST' at position 4
2017-02-09 0:58:48 70366552489280 [Note] Error reading relay log event: slave SQL thread was killed
2017-02-09 0:58:48 70366552489280 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645
Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing.