Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11668

rpl.rpl_heartbeat_basic fails sporadically in buildbot

Details

    • 5.5.55

    Description

      Apparently happens on slow builders.
      http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/432/steps/test/logs/stdio

      rpl.rpl_heartbeat_basic 'mix'            w3 [ fail ]
              Test ended at 2016-12-24 19:36:33
       
      CURRENT_TEST: rpl.rpl_heartbeat_basic
      --- /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result	2016-12-24 05:00:48.630019948 -0500
      +++ /home/buildbot/maria-slave/p8-rhel6-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject	2016-12-24 19:36:32.850019948 -0500
      @@ -245,7 +245,7 @@
       connection master;
       SET @@global.event_scheduler=1;
       connection slave;
      -Number of received heartbeat events: 0
      +Number of received heartbeat events: 1
       connection master;
       DELETE FROM t1;
       DROP EVENT e1;
       
      mysqltest: Result content mismatch
      

      10.1 on p8-trusty-bintar (logs not available):

      rpl.rpl_heartbeat_basic 'mix'            w4 [ fail ]
              Test ended at 2016-07-01 14:01:06
       
      CURRENT_TEST: rpl.rpl_heartbeat_basic
      --- /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.result	2016-07-01 08:21:15.687777719 -0400
      +++ /var/lib/buildbot/maria-slave/p8-trusty-bintar/build/mysql-test/suite/rpl/r/rpl_heartbeat_basic.reject	2016-07-01 14:01:05.642250919 -0400
      @@ -221,7 +221,7 @@
       CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=MASTER_PORT, MASTER_USER='root', MASTER_CONNECT_RETRY=20, MASTER_HEARTBEAT_PERIOD=5;
       include/start_slave.inc
       SET @@global.event_scheduler=1;
      -Number of received heartbeat events: 0
      +Number of received heartbeat events: 1
       DELETE FROM t1;
       DROP EVENT e1;
      

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - - edited

            This fragment of the test is reasonably accurate. The flow is this:

            • the slave is stopped;
            • a scheduled event is created on master. The event should update a table every 1 second;
            • the slave is reset and configured with heartbeat_period = 5 sec;
            • the slave is started;
            • event scheduler gets enabled on master;
            • slave stores the current value of slave_received_heartbeats;
            • slave waits for 5 updates to come from the master;
            • slave compares the current value of slave_received_heartbeats with the previously stored one, it should not have changed (although the test prints "Number of received heartbeat events", the numeric value is in fact the difference between the new and old value, 0 is the expected result);
            • the table and event are dropped on master;
            • slave is synchronized with master;
            • slave is stopped;

            So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats.

            Apparently, on a very slow builder this assumption might be wrong.

            The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this:

            2016-12-24 19:35:13 17589762620768 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16020',replication started in log 'FIRST' at position 4
            2016-12-24 19:35:13 17589762227552 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
            2016-12-24 19:35:43 17589762227552 [Note] Error reading relay log event: slave SQL thread was killed
            2016-12-24 19:35:43 17589762227552 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422
            

            That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent.

            Here is another log with the same failure:
            http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/537

            It shows that 30 seconds is irrelevant. In this case it's only 8 seconds:

            2017-02-03  8:27:32 17590239460704 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
            2017-02-03  8:27:32 17590239853920 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
            2017-02-03  8:27:40 17590239460704 [Note] Error reading relay log event: slave SQL thread was killed
            2017-02-03  8:27:40 17590239460704 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422
            

            8 seconds seems to be a more common case:

            2017-01-29 23:02:31 70366991580080 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
            2017-01-29 23:02:31 70366831279024 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4
            2017-01-29 23:02:39 70366991580080 [Note] Error reading relay log event: slave SQL thread was killed
            2017-01-29 23:02:39 70366991580080 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645
            2017-01-29 23:02:39 70366831279024 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2645
            

            2017-02-09  0:58:40 70366552489280 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
            2017-02-09  0:58:40 70366552882496 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16040',replication started in log 'FIRST' at position 4
            2017-02-09  0:58:48 70366552489280 [Note] Error reading relay log event: slave SQL thread was killed
            2017-02-09  0:58:48 70366552489280 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645
            

            Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing.

            elenst Elena Stepanova added a comment - - edited This fragment of the test is reasonably accurate. The flow is this: the slave is stopped; a scheduled event is created on master. The event should update a table every 1 second; the slave is reset and configured with heartbeat_period = 5 sec; the slave is started; event scheduler gets enabled on master; slave stores the current value of slave_received_heartbeats ; slave waits for 5 updates to come from the master; slave compares the current value of slave_received_heartbeats with the previously stored one, it should not have changed (although the test prints "Number of received heartbeat events", the numeric value is in fact the difference between the new and old value, 0 is the expected result); the table and event are dropped on master; slave is synchronized with master; slave is stopped; So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats. Apparently, on a very slow builder this assumption might be wrong. The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this: 2016-12-24 19:35:13 17589762620768 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16020',replication started in log 'FIRST' at position 4 2016-12-24 19:35:13 17589762227552 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2016-12-24 19:35:43 17589762227552 [Note] Error reading relay log event: slave SQL thread was killed 2016-12-24 19:35:43 17589762227552 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422 That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent. Here is another log with the same failure: http://buildbot.askmonty.org/buildbot/builders/p8-rhel6-bintar/builds/537 It shows that 30 seconds is irrelevant. In this case it's only 8 seconds: 2017-02-03 8:27:32 17590239460704 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2017-02-03 8:27:32 17590239853920 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4 2017-02-03 8:27:40 17590239460704 [Note] Error reading relay log event: slave SQL thread was killed 2017-02-03 8:27:40 17590239460704 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2422 8 seconds seems to be a more common case: 2017-01-29 23:02:31 70366991580080 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2017-01-29 23:02:31 70366831279024 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16000',replication started in log 'FIRST' at position 4 2017-01-29 23:02:39 70366991580080 [Note] Error reading relay log event: slave SQL thread was killed 2017-01-29 23:02:39 70366991580080 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645 2017-01-29 23:02:39 70366831279024 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2645 2017-02-09 0:58:40 70366552489280 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4 2017-02-09 0:58:40 70366552882496 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:16040',replication started in log 'FIRST' at position 4 2017-02-09 0:58:48 70366552489280 [Note] Error reading relay log event: slave SQL thread was killed 2017-02-09 0:58:48 70366552489280 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 2645 Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing.

            8 seconds is actually understandable, it's the minimal value at which the problem might occur.

            • 0 sec: when scheduler is enabled, it performs 1st update, a=1 => a=2
            • 1 sec: a=2 => a=3
            • 2 sec: a=3 => a=4
            • 3 sec: a=4 => a=5
            • 4 sec: a=5 => a=6

            Slave is waiting for a>5, so it should take at least 4 seconds. It cannot be less because the events are triggered once a second, so binlog events are also written not faster than once a second. There are 4 intervals to wait, thus 1 + 1 + 1 + 1 = 4 seconds.
            But to trigger the heartbeat, at least once there should be 5+ seconds between two consequent binlog events. So, we will have at least 5 + 1 + 1 + 1 = 8 seconds.

            elenst Elena Stepanova added a comment - 8 seconds is actually understandable, it's the minimal value at which the problem might occur. 0 sec: when scheduler is enabled, it performs 1st update, a=1 => a=2 1 sec: a=2 => a=3 2 sec: a=3 => a=4 3 sec: a=4 => a=5 4 sec: a=5 => a=6 Slave is waiting for a>5, so it should take at least 4 seconds. It cannot be less because the events are triggered once a second, so binlog events are also written not faster than once a second. There are 4 intervals to wait, thus 1 + 1 + 1 + 1 = 4 seconds. But to trigger the heartbeat, at least once there should be 5+ seconds between two consequent binlog events. So, we will have at least 5 + 1 + 1 + 1 = 8 seconds.

            The failure is reproducible on 10.x easily with this addition to the test:

            --- /data/src/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test	2016-06-14 19:03:43.615987384 +0300
            +++ /data/bld/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test	2017-02-17 00:20:47.142558293 +0200
            @@ -351,6 +351,10 @@
             --source include/start_slave.inc
             --connection master
             # Enable scheduler
            +--disable_query_log
            +set global binlog_commit_wait_count=2;
            +set global binlog_commit_wait_usec=5100000;
            +--enable_query_log
             SET @@global.event_scheduler=1;
             --sync_slave_with_master
             let $rcvd_heartbeats_before= query_get_value(SHOW STATUS LIKE 'slave_received_heartbeats', Value, 1);
            @@ -361,6 +365,10 @@
             let $result= query_get_value(SELECT ($rcvd_heartbeats_after - $rcvd_heartbeats_before) > 0 AS Result, Result, 1);
             --echo Number of received heartbeat events: $result
             --connection master
            +--disable_query_log
            +set global binlog_commit_wait_count=0;
            +set global binlog_commit_wait_usec=100000;
            +--enable_query_log
             DELETE FROM t1;
             DROP EVENT e1;
             --sync_slave_with_master
            

            The addition does not change the logic of the test, it simply delays binary logging for long enough so that the heartbeat events are triggered.
            On a slow builder, such a delay can occur naturally, so it's just a test problem.

            The diff above is not applicable to 5.5 because it does not have binlog_commit_wait* options, but the idea is the same, it can be imitated by adding delay to binary logging code.

            elenst Elena Stepanova added a comment - The failure is reproducible on 10.x easily with this addition to the test: --- /data/src/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test 2016-06-14 19:03:43.615987384 +0300 +++ /data/bld/10.0/mysql-test/suite/rpl/t/rpl_heartbeat_basic.test 2017-02-17 00:20:47.142558293 +0200 @@ -351,6 +351,10 @@ --source include/start_slave.inc --connection master # Enable scheduler +--disable_query_log +set global binlog_commit_wait_count=2; +set global binlog_commit_wait_usec=5100000; +--enable_query_log SET @@global.event_scheduler=1; --sync_slave_with_master let $rcvd_heartbeats_before= query_get_value(SHOW STATUS LIKE 'slave_received_heartbeats', Value, 1); @@ -361,6 +365,10 @@ let $result= query_get_value(SELECT ($rcvd_heartbeats_after - $rcvd_heartbeats_before) > 0 AS Result, Result, 1); --echo Number of received heartbeat events: $result --connection master +--disable_query_log +set global binlog_commit_wait_count=0; +set global binlog_commit_wait_usec=100000; +--enable_query_log DELETE FROM t1; DROP EVENT e1; --sync_slave_with_master The addition does not change the logic of the test, it simply delays binary logging for long enough so that the heartbeat events are triggered. On a slow builder, such a delay can occur naturally, so it's just a test problem. The diff above is not applicable to 5.5 because it does not have binlog_commit_wait* options, but the idea is the same, it can be imitated by adding delay to binary logging code.

            We cannot make the master to write binary logs (or execute SQL) in a timely manner. What we can do is to check that they were executed in a timely manner, and if they weren't, neutralize the check on the slave.

            So, instead of waiting for the updates on the slave, we'll be checking 'Binlog_commits' on master. We'll wait for the same 5 updates, so it won't be longer than it is now, but we will also check how much time has passed between the updates. We can only be sure that no heartbeats should be sent if the interval was less than 5 seconds (the configured HEARTBEAT_PERIOD). If at least one update took longer, we will tell the slave to ignore the heartbeats (we cannot make it check that the heartbeat was received, because we'll run into an opposite race condition, when we think the interval was more than 5 seconds while in fact it was still slightly less, and thus no heartbeat was issued).

            Since such a big delay is a relatively rare event, there is no danger of losing the test coverage in buildbot.

            elenst Elena Stepanova added a comment - We cannot make the master to write binary logs (or execute SQL) in a timely manner. What we can do is to check that they were executed in a timely manner, and if they weren't, neutralize the check on the slave. So, instead of waiting for the updates on the slave, we'll be checking 'Binlog_commits' on master. We'll wait for the same 5 updates, so it won't be longer than it is now, but we will also check how much time has passed between the updates. We can only be sure that no heartbeats should be sent if the interval was less than 5 seconds (the configured HEARTBEAT_PERIOD). If at least one update took longer, we will tell the slave to ignore the heartbeats (we cannot make it check that the heartbeat was received, because we'll run into an opposite race condition, when we think the interval was more than 5 seconds while in fact it was still slightly less, and thus no heartbeat was issued). Since such a big delay is a relatively rare event, there is no danger of losing the test coverage in buildbot.
            elenst Elena Stepanova added a comment - https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8
            elenst Elena Stepanova added a comment - https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8

            People

              elenst Elena Stepanova
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.