[MDEV-29710] Valgrind tests massively fail due to silently killing server on shutdown timeout Created: 2022-10-05  Updated: 2022-11-10  Resolved: 2022-10-06

Status: Closed
Project: MariaDB Server
Component/s: Tests
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.3.37, 10.4.27, 10.5.18, 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Marko Mäkelä
Resolution: Fixed Votes: 0
Labels: None
Environment:

Valgrind


Issue Links:
Relates
relates to MDBF-406 Disable valgrind for 10.5+ branches Closed

 Description   

A number of tests very often fail on the Valgrind builder in our CI system.

A prime example is the test innodb.table_flags, which performs a slow shutdown (innodb_fast_shutdown=0) before injecting some corruption using Perl code. On Valgrind, the SELECT statements after server restart would fail to return an error, because the slow shutdown did not proceed to the end as intended, but the server was silently and forcibly killed after an apparent 2-minute timeout was exceeded. As a result, crash recovery would apparently "heal" the corruption that was injected.

Some tests, such as main.log_slow fail because under Valgrind, some steps would not complete in a few tenths of seconds as expected.

Some replication tests would occasionally due to something related to the STOP SLAVE statement.

In my experience, whatever Valgrind tests, AddressSanitizer (ASAN) and MemorySanitizer (MSAN, MDEV-20377) cover better. Because Valgrind employs a single-threaded JIT based CPU emulator, ASAN and MSAN are much more likely to find bugs related to race conditions, such as MDEV-23097 or MDEV-25064.

I think that we should simply skip these tests on Valgrind (mostly via no_valgrind_unless_big.inc) to avoid lots of bogus failures.



 Comments   
Comment by Marko Mäkelä [ 2022-10-06 ]

I applied some changes up to 10.6. Later major version branches may require further tests to be disabled.

Comment by Michael Widenius [ 2022-10-18 ]

I would like to know why there is an extra 2 minute timeout for valgrind during kill. I have not noticed that on my system don't understand why that would happen in buildbot either (for short running tests).

In my experience, it is easier to use valgrind than ASAN and MSAM and it does uncover issues that nether of the above does by themself.

The test disabled by this MDEV are:

  • innodb.innodb-index-online
  • innodb.innodb_buffer_pool_dump_abort_loads
  • innodb innodb_bug53290
  • innodb.alter_copy
  • innodb.autoinc_persist
  • innodb.innodb-stats-initialize-failure
  • innodb.innodb-stats-sample
  • innodb.innodb-trim
  • innodb.innodb_bug30423
  • innodb.log_corruption
  • innodb.log_file
  • innodb.log_file_name
  • innodb.read_only_recover_committed
  • innodb.recovery_shutdown
  • innodb.undo_truncate
  • innodb.alter_missing_tablespace
  • innodb_fts.innodb_fts_plugin
  • innodb.ibuf_not_empty
  • innodb.innodb-get-fk
  • innodb.missing_tablespaces
  • innodb.table_flags
  • innodb.xa_recovery
  • mariabackup/log_page_corruption
  • rpl.rpl_semi_sync_shutdown_await_ack
  • rpl.rpl_gtid_stop_start
  • rpl.rpl_mdev12179
  • federated.federatedx_create_handlers
  • sys_vars.innodb_flush_method_func
  • sys_vars.innodb_buffer_pool_dump_at_shutdown_basic
  • main.update_use_source
  • main.func_json_notembedded
  • main.rowid_filter_innodb_debug
  • main.mysql_upgrade
  • main.plugin_auth
  • main.log_slow
  • All tests that include master-slave.inc !!!!

It is of course correct to disable slow running tests, but we need to
ensure that tests that are running differently under valgrind are not
just disabled as the timing differences that valgrind causes can show
bugs we don't know of.

What I dislike is that we disabled all master-slave test with valgrind
with the change to master-slave.inc! This means that nothing in the 'rpl'
suite is tested with valgrind.
What is worse, is that we now an warning for each test in rpl when
we try to run them with the --valgrind flag!

This is not right as there is many issues in the slave code that
should be tested also with valgrind!

Comment by Marko Mäkelä [ 2022-10-18 ]

The tests that restart InnoDB and I disabled under Valgrind would typically fail either due to a forcibly killed shutdown, or a timeout during recovery. InnoDB recovery is multi-threaded, because data pages will be read asynchronously and log will be applied as part of invoking the read completion callback function. This may have become much slower under Valgrind due to MDEV-21351 and other performance improvements in MariaDB Server 10.5.

I observed sporadic failures of a large number of replication tests, apparently related to server shutdown timing out. Some trouble was around the STOP SLAVE command. I did not diagnose the cause of those failures in detail, but I could believe that it was related to unfair scheduling.

What I did not disable was tests that use DEBUG_SYNC. As I noted in MDBF-406, those can fail sporadically too. The failure probability is considerably lower than with the tests that I disabled for Valgrind.

I do not see any value in a builder that is constantly failing. Any new failures would typically be ignored by all developers, because the builder is reliably failing all the time. Before this effort by me, this actually was the case with the Valgrind builder. I think that it is better to remove extremely frequent failures, because currently the cross-reference does not allow filtering out a particular builder.

Generated at Thu Feb 08 10:10:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.