[MDEV-31790] Extremely slow tests rpl.rpl_non_direct_mixed_mixing_engines and rpl.rpl_stm_mixing_engines Created: 2023-07-28 Updated: 2023-11-28 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication, Tests |
| Affects Version/s: | 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2 |
| Fix Version/s: | 10.4, 10.5, 10.6, 10.11, 11.0, 11.1, 11.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Andrei Elkin |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Description |
|
The replication tests rpl.rpl_non_direct_mixed_mixing_engines and rpl.rpl_stm_mixing_engines execute very slowly: over 6 minutes (240 seconds) on my system with an optimized debug build, and almost 8 minutes (480 seconds) on the mandatory staging builder that runs MemorySanitizer with --big-test. I think that the tests must be split or cleaned up so that they spend less time waiting for something and more time executing interesting steps. Because the tests are marked as big, they run very seldomly, and despite that, they fail pretty often. As a temporary work around, I will disable the tests on the MSAN builder, hoping to reduce the per-push test run time by a couple of minutes. |
| Comments |
| Comment by Marko Mäkelä [ 2023-07-28 ] | |||||||||||||||||||||||||||||||||
|
Actually, the test run-time varies a lot. Two recent examples:
I tried running these locally on the same commit, in a debug build without MSAN instrumentation:
During the execution, I see that 6 of the 12 mariadbd processes are consuming some 20% to 30% of a single CPU core, while the other 6 are almost idle (a few per cent of a single CPU core). About 20% to 30% of a single CPU core is being spent by each mariadb-test client. The (user+sys)/real ratio is 6.23, so it appears that each of the 6 concurrent test instances is effectively running on a single CPU core (instead of 3: client and 2 servers). It may be that there is not that much excessive sleeping or waiting for timeouts going on, but simply context switching and synchronous communication. I repeated the exercise with a local clang-15 MSAN build of the same commit:
Again, the (user+sys)/real ratio is about 6. The MSAN overhead for these 6 tests appears to be (238s/42s)=5⅔, much more than the typical overhead. For the test innodb.innodb on my system, the execution times are 10942ms and 6217ms, that is, MSAN only takes 176% the time of a non-MSAN debug build. These tests were run on a RAM disk, occupying in average only 6 of the 40 available hardware threads. In the buildbot environment or when running the whole test suite, the times will be considerably longer. | |||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-07-28 ] | |||||||||||||||||||||||||||||||||
|
What motivated me to file this was this 11.0 run:
The last two tests were running for a couple of minutes after the previous tests completed. More specifically, the w5 server error log ends in:
The w1 server error log for the last test spans whopping 7 minutes and 25 seconds (actually 3 seconds more than the 442716ms reported in the test client output):
This test ended almost 3 minutes after the previous test (the 160-second stress.ddl_myisam). In other words, not running these extremely slow tests could help the mandatory MSAN staging builds finish 3 minutes faster. |