[MDEV-29710] Valgrind tests massively fail due to silently killing server on shutdown timeout Created: 2022-10-05 Updated: 2022-11-10 Resolved: 2022-10-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Tests |
| Affects Version/s: | 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11 |
| Fix Version/s: | 10.3.37, 10.4.27, 10.5.18, 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Marko Mäkelä | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Valgrind |
||
| Issue Links: |
|
||||||||
| Description |
|
A number of tests very often fail on the Valgrind builder in our CI system. A prime example is the test innodb.table_flags, which performs a slow shutdown (innodb_fast_shutdown=0) before injecting some corruption using Perl code. On Valgrind, the SELECT statements after server restart would fail to return an error, because the slow shutdown did not proceed to the end as intended, but the server was silently and forcibly killed after an apparent 2-minute timeout was exceeded. As a result, crash recovery would apparently "heal" the corruption that was injected. Some tests, such as main.log_slow fail because under Valgrind, some steps would not complete in a few tenths of seconds as expected. Some replication tests would occasionally due to something related to the STOP SLAVE statement. In my experience, whatever Valgrind tests, AddressSanitizer (ASAN) and MemorySanitizer (MSAN, I think that we should simply skip these tests on Valgrind (mostly via no_valgrind_unless_big.inc) to avoid lots of bogus failures. |
| Comments |
| Comment by Marko Mäkelä [ 2022-10-06 ] |
|
I applied some changes up to 10.6. Later major version branches may require further tests to be disabled. |
| Comment by Michael Widenius [ 2022-10-18 ] |
|
I would like to know why there is an extra 2 minute timeout for valgrind during kill. I have not noticed that on my system don't understand why that would happen in buildbot either (for short running tests). In my experience, it is easier to use valgrind than ASAN and MSAM and it does uncover issues that nether of the above does by themself. The test disabled by this MDEV are:
It is of course correct to disable slow running tests, but we need to What I dislike is that we disabled all master-slave test with valgrind This is not right as there is many issues in the slave code that |
| Comment by Marko Mäkelä [ 2022-10-18 ] |
|
The tests that restart InnoDB and I disabled under Valgrind would typically fail either due to a forcibly killed shutdown, or a timeout during recovery. InnoDB recovery is multi-threaded, because data pages will be read asynchronously and log will be applied as part of invoking the read completion callback function. This may have become much slower under Valgrind due to I observed sporadic failures of a large number of replication tests, apparently related to server shutdown timing out. Some trouble was around the STOP SLAVE command. I did not diagnose the cause of those failures in detail, but I could believe that it was related to unfair scheduling. What I did not disable was tests that use DEBUG_SYNC. As I noted in I do not see any value in a builder that is constantly failing. Any new failures would typically be ignored by all developers, because the builder is reliably failing all the time. Before this effort by me, this actually was the case with the Valgrind builder. I think that it is better to remove extremely frequent failures, because currently the cross-reference does not allow filtering out a particular builder. |