Every default run of mysql-test-run script takes a lot of time (tens of minutes to many hours, depending on the build and computer configuration).
But most of that time is wasted by running wrong tests. Some of the tests are related, if one of the fails, others will fail too. Some of the tests just almost never fail, while others fail much more often. Some of the test execute a specific part of the source code and if that part of the code isn't changed in a particular revision, these tests will certainly not fail.
I would like to know what tests are most useful to run for every particular revision on every particular test platform. In my experiments one can catch 90% of the problems by only running 10% of the tests.
It doesn't mean we will always run 10% of the tests only. It would make sense to run the complete big test suite before releases or on specific builders. But many builders can test must faster with only a small reduction of the test coverage.
There is no need to run tests on many platforms for this. We have historical data from the buildbot for many years. They contain the information what revisions were tested on what builders, what files were modified in what revision, what tests failed where and so on. One can use these data to analyze and select the best test running strategy.
The goal is to run as little tests as possible, while still being able to detect as many test failures as possible.
- probability of a test to fail
- depending on the builder, on the combination
- depending on the changed files, changed lines/functions/etc
- inter-test correlations
- individual tests within a big test file
- what to do what a new builder/test/combination is added? we don't have prior probabilities yes
- don't use all the data, instead use a sliding window — the failure rates may change over time
- average over different combinations or builders
- or don't average and treat triplets (test,combination,builder) as individual "tests"
- optimize for time, not for a number of tests — differen builders run with different speed, different tests take different time too
- emulate the filter bubble (ignore not predicted falures), have a solution to break it