[MDEV-11668] rpl.rpl_heartbeat_basic fails sporadically in buildbot Created: 2016-12-27 Updated: 2017-02-17 Resolved: 2017-02-17 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Tests |
| Affects Version/s: | 5.5, 10.0, 10.1, 10.2 |
| Fix Version/s: | 5.5.55, 10.0.30, 10.1.22, 10.2.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Elena Stepanova | Assignee: | Elena Stepanova |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Sprint: | 5.5.55 | ||||||||
| Description |
|
Apparently happens on slow builders.
10.1 on p8-trusty-bintar (logs not available):
|
| Comments |
| Comment by Elena Stepanova [ 2017-02-16 ] | ||||||||||||||||||||||||
|
This fragment of the test is reasonably accurate. The flow is this:
So, the idea is that if an event is fired on master every 1 second, it surely shouldn't take longer than 5 seconds between binlog updates, and thus master should not send any heartbeats. Apparently, on a very slow builder this assumption might be wrong. The logs in the initial description can be confusing, as the slave had a coincidence which is apparently irrelevant to the case. The relevant fragment of the slave error log there is this:
That is, it took exactly 30 seconds for the slave to wait for 5 updates from the master. 30 seconds coincide with the wait_condition timeout; but there is no timeout there, so the slave got its updates at the end. Naturally, if it took 30 seconds to get 5 updates, there was bound to be a more than 5 seconds delay between at least two of them, hence the heartbeat could be sent. Here is another log with the same failure: It shows that 30 seconds is irrelevant. In this case it's only 8 seconds:
8 seconds seems to be a more common case:
Master logs don't show any errors or warnings after event scheduler gets started, so there is no serious problem with the scheduling or execution, apparently just timing. | ||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-02-16 ] | ||||||||||||||||||||||||
|
8 seconds is actually understandable, it's the minimal value at which the problem might occur.
Slave is waiting for a>5, so it should take at least 4 seconds. It cannot be less because the events are triggered once a second, so binlog events are also written not faster than once a second. There are 4 intervals to wait, thus 1 + 1 + 1 + 1 = 4 seconds. | ||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-02-16 ] | ||||||||||||||||||||||||
|
The failure is reproducible on 10.x easily with this addition to the test:
The addition does not change the logic of the test, it simply delays binary logging for long enough so that the heartbeat events are triggered. The diff above is not applicable to 5.5 because it does not have binlog_commit_wait* options, but the idea is the same, it can be imitated by adding delay to binary logging code. | ||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-02-16 ] | ||||||||||||||||||||||||
|
We cannot make the master to write binary logs (or execute SQL) in a timely manner. What we can do is to check that they were executed in a timely manner, and if they weren't, neutralize the check on the slave. So, instead of waiting for the updates on the slave, we'll be checking 'Binlog_commits' on master. We'll wait for the same 5 updates, so it won't be longer than it is now, but we will also check how much time has passed between the updates. We can only be sure that no heartbeats should be sent if the interval was less than 5 seconds (the configured HEARTBEAT_PERIOD). If at least one update took longer, we will tell the slave to ignore the heartbeats (we cannot make it check that the heartbeat was received, because we'll run into an opposite race condition, when we think the interval was more than 5 seconds while in fact it was still slightly less, and thus no heartbeat was issued). Since such a big delay is a relatively rare event, there is no danger of losing the test coverage in buildbot. | ||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-02-16 ] | ||||||||||||||||||||||||
|
https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8 | ||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2017-02-17 ] | ||||||||||||||||||||||||
|
https://github.com/MariaDB/server/commit/b70cd26d73d727ad871c109b47a8a2645c553fd8 |