[MDBF-406] Disable valgrind for 10.5+ branches - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: N/A
Fix Version/s: N/A
Component/s: Buildbot
Labels:
None

Description

marko suggested disabling the Valgrind builder for 10.5+ branches because we have working ASAN and MSAN, and (on the old buildbot only) UBSAN on these branches.

Attachments

Issue Links

is part of

MDBF-41 Milestone 5: Desirable fixes

Closed

relates to

MDEV-29710 Valgrind tests massively fail due to silently killing server on shutdown timeout

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2022-05-16 10:21

ASAN shadow bytes are roughly equivalent to the A bits of Valgrind memcheck. One ASAN shadow byte covers 64 bytes, while the Valgrind A bits have 1-byte granularity.
MSAN shadow bytes should be strictly equivalent to the V bits of Valgrind memcheck.

Before 10.5, the code is too broken for MSAN, mainly because before ~~MDEV-14024~~ updated the Perl Compatible Regular Expression library, it was massively broken (also for ASAN if you let it instrument that code).

For branches up to 10.4, a Valgrind builder might still be useful.

Marko Mäkelä added a comment - 2022-05-16 10:21 ASAN shadow bytes are roughly equivalent to the A bits of Valgrind memcheck . One ASAN shadow byte covers 64 bytes, while the Valgrind A bits have 1-byte granularity. MSAN shadow bytes should be strictly equivalent to the V bits of Valgrind memcheck . Before 10.5, the code is too broken for MSAN, mainly because before MDEV-14024 updated the Perl Compatible Regular Expression library, it was massively broken (also for ASAN if you let it instrument that code). For branches up to 10.4, a Valgrind builder might still be useful.

Marko Mäkelä added a comment - 2022-10-06 05:25 - edited

I made some observations while addressing what I consider bogus failures with Valgrind (~~MDEV-29710~~).

When run under Valgrind, the server can be a lot slower, because Valgrind is a single-threaded JIT based CPU that emulates multiple threads by interleaving them. InnoDB is a multi-threaded storage engine. Shutdown or crash recovery can take very long under Valgrind when it chooses an unfortunate scheduling. For example, the mtr framework would silently and forcibly kill the server in the middle of STOP SLAVE or while InnoDB is shutting down. The test could subsequently fail in surprising ways.

Moreover, as lock-free or std::atomic based performance improvements have been added to InnoDB, the Valgrind tests could be taking longer to run. Also new tests are being added to later branches. 10.3 typically completes tests in less than 1 hour, while later branches require 1 to 2 hours, depending on the assigned worker. ~~MDEV-29508~~ was causing additional 2-hour timeouts in 10.5 and later until I disabled that test under Valgrind.

I think that running Valgrind for 10.5 or later is a serious waste of resources. On the currently latest available 10.5 push, the ASAN build+test would complete in 36 minutes, and the MSAN build+test in 22 minutes.

Unlike Valgrind, ASAN and MSAN cover all code (not just the server) and they are much more likely to catch errors that involve race conditions. And unlike Valgrind, the overhead of ASAN or MSAN is fairly low for multi-threaded programs (I think about 250% or 350%). Valgrind can make many things run hundreds or thousands of times slower.

Marko Mäkelä added a comment - 2022-10-06 05:25 - edited I made some observations while addressing what I consider bogus failures with Valgrind ( MDEV-29710 ). When run under Valgrind, the server can be a lot slower, because Valgrind is a single-threaded JIT based CPU that emulates multiple threads by interleaving them. InnoDB is a multi-threaded storage engine. Shutdown or crash recovery can take very long under Valgrind when it chooses an unfortunate scheduling. For example, the mtr framework would silently and forcibly kill the server in the middle of STOP SLAVE or while InnoDB is shutting down. The test could subsequently fail in surprising ways. Moreover, as lock-free or std::atomic based performance improvements have been added to InnoDB, the Valgrind tests could be taking longer to run. Also new tests are being added to later branches. 10.3 typically completes tests in less than 1 hour, while later branches require 1 to 2 hours, depending on the assigned worker. MDEV-29508 was causing additional 2-hour timeouts in 10.5 and later until I disabled that test under Valgrind. I think that running Valgrind for 10.5 or later is a serious waste of resources. On the currently latest available 10.5 push, the ASAN build+test would complete in 36 minutes, and the MSAN build+test in 22 minutes. Unlike Valgrind, ASAN and MSAN cover all code (not just the server) and they are much more likely to catch errors that involve race conditions. And unlike Valgrind, the overhead of ASAN or MSAN is fairly low for multi-threaded programs (I think about 250% or 350%). Valgrind can make many things run hundreds or thousands of times slower.

Marko Mäkelä added a comment - 2022-10-06 11:26 - edited

Here is an example of a test that would run in 4 seconds without Valgrind:

10.6 8872d2ee71cc65826909f6a64413ee46d1c5218d
innodb.innodb-index-online '16k,crypt,innodb' w18 [ fail ]
Test ended at 2022-10-06 10:47:49
CURRENT_TEST: innodb.innodb-index-online
--- /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.result~ 2022-10-06 10:33:27.155808599 +0000
+++ /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.reject 2022-10-06 10:47:42.955415916 +0000
@@ -445,11 +445,12 @@
ALTER TABLE t1 ADD UNIQUE INDEX c3p5(c3(5));
connection default;
SET DEBUG_SYNC = 'now WAIT_FOR c3p5_created';
+Warnings:
+Warning 1639 debug sync point wait timed out
SELECT sf.name, sf.pos FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES si
INNER JOIN INFORMATION_SCHEMA.INNODB_SYS_FIELDS sf
ON si.index_id = sf.index_id WHERE si.name = '?c3p5';
name pos
-c3 0
SET DEBUG_SYNC = 'ib_after_row_insert SIGNAL ins_done WAIT_FOR ddl_timed_out';
INSERT INTO t1 VALUES(347,33101,NULL);
connection con1;
…

This suggests that any DEBUG_SYNC test could fail in the same way when running under Valgrind, from time to time. I believe that the reason is that Valgrind’s built-in scheduler that emulates multiple threads may choose very unfair schedulings, analogous to rr record --chaos of the rr tool, which uses ptrace() to run the tracee one thread at a time.

Marko Mäkelä added a comment - 2022-10-06 11:26 - edited Here is an example of a test that would run in 4 seconds without Valgrind: 10.6 8872d2ee71cc65826909f6a64413ee46d1c5218d innodb.innodb-index-online '16k,crypt,innodb' w18 [ fail ] Test ended at 2022-10-06 10:47:49 CURRENT_TEST: innodb.innodb-index-online --- /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.result~ 2022-10-06 10:33:27.155808599 +0000 +++ /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.reject 2022-10-06 10:47:42.955415916 +0000 @@ -445,11 +445,12 @@ ALTER TABLE t1 ADD UNIQUE INDEX c3p5(c3(5)); connection default; SET DEBUG_SYNC = 'now WAIT_FOR c3p5_created'; +Warnings: +Warning 1639 debug sync point wait timed out SELECT sf.name, sf.pos FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES si INNER JOIN INFORMATION_SCHEMA.INNODB_SYS_FIELDS sf ON si.index_id = sf.index_id WHERE si.name = '?c3p5'; name pos -c3 0 SET DEBUG_SYNC = 'ib_after_row_insert SIGNAL ins_done WAIT_FOR ddl_timed_out'; INSERT INTO t1 VALUES(347,33101,NULL); connection con1; … This suggests that any DEBUG_SYNC test could fail in the same way when running under Valgrind, from time to time. I believe that the reason is that Valgrind’s built-in scheduler that emulates multiple threads may choose very unfair schedulings, analogous to rr record --chaos of the rr tool, which uses ptrace() to run the tracee one thread at a time.

Michael Widenius added a comment - 2022-10-18 09:00 - edited

I personally find valgrind easier and better to use at home than using ASAN and MSAN (that require specific builds).
I have also found that valgrind finds issues that neither ASAN and MSAN finds.
On reason is the 1 bit compared to 64 bit checking, but also because valgrind builds has a different MEM_ROOT code that makes it easier to find writing into wrong memory (as memory when compiled with valgrind is not allocated in big blocks).
In some sense, the different timings that valgrind causes is good as it exposes bugs that we would not found otherwise.
The increased run time with valgrind is of NO IMPORTANCE WHATSOEVER if it can find hard-to-find-bugs that we would not otherwise find. One does not have to use valgrind if one doesn't want to!
Using valgrind in buildbot is still important, both to find bugs that ASAN and MSAN does not find (see above) and to ensure that valgrind can be used by developers.

conclusion: There is no reason to disable valgrind support or valgrind builds in buildbot

Michael Widenius added a comment - 2022-10-18 09:00 - edited I personally find valgrind easier and better to use at home than using ASAN and MSAN (that require specific builds). I have also found that valgrind finds issues that neither ASAN and MSAN finds. On reason is the 1 bit compared to 64 bit checking, but also because valgrind builds has a different MEM_ROOT code that makes it easier to find writing into wrong memory (as memory when compiled with valgrind is not allocated in big blocks). In some sense, the different timings that valgrind causes is good as it exposes bugs that we would not found otherwise. The increased run time with valgrind is of NO IMPORTANCE WHATSOEVER if it can find hard-to-find-bugs that we would not otherwise find. One does not have to use valgrind if one doesn't want to! Using valgrind in buildbot is still important, both to find bugs that ASAN and MSAN does not find (see above) and to ensure that valgrind can be used by developers. conclusion: There is no reason to disable valgrind support or valgrind builds in buildbot

Marko Mäkelä added a comment - 2022-10-18 12:43

I would be curious to see one example of a genuine issue that Valgrind finds and ASAN and MSAN do not.

Marko Mäkelä added a comment - 2022-10-18 12:43 I would be curious to see one example of a genuine issue that Valgrind finds and ASAN and MSAN do not.

Marko Mäkelä added a comment - 2023-03-28 06:44 - edited

I just found an example of something that should be unlikely be implemented in Valgrind: MSAN_OPTIONS=poison_in_dtor=1, which was recently enabled by default, is flagging some code in ~~MDEV-30936~~. It looks like until MDEV-30942 has been fixed, we must disable that setting.

Marko Mäkelä added a comment - 2023-03-28 06:44 - edited I just found an example of something that should be unlikely be implemented in Valgrind: MSAN_OPTIONS=poison_in_dtor=1 , which was recently enabled by default , is flagging some code in MDEV-30936 . It looks like until MDEV-30942 has been fixed, we must disable that setting.

Marko Mäkelä added a comment - 2023-11-07 10:25

For what it is worth, Valgrind used to issue bogus warnings for Clang-generated code. Now it has been "improved". I get (among others) the following error when trying to invoke Valgrind on a WITH_VALGRIND=ON executable that was built with Clang 16.0.6.

### unhandled dwarf2 abbrev form code 0x23

==822403== Valgrind: debuginfo reader: ensure_valid failed:

==822403== Valgrind:   during call to ML_(img_get)

==822403== Valgrind:   request for range [2553503080, +4) exceeds

==822403== Valgrind:   valid image size of 271464928 for image:

==822403== Valgrind:   "/dev/shm/10.4m/sql/mysqld"

When I omit --valgrind from the mtr options, the tests will pass.

Marko Mäkelä added a comment - 2023-11-07 10:25 For what it is worth, Valgrind used to issue bogus warnings for Clang-generated code. Now it has been "improved". I get (among others) the following error when trying to invoke Valgrind on a WITH_VALGRIND=ON executable that was built with Clang 16.0.6. ### unhandled dwarf2 abbrev form code 0x23 ==822403== Valgrind: debuginfo reader: ensure_valid failed: ==822403== Valgrind: during call to ML_(img_get) ==822403== Valgrind: request for range [2553503080, +4) exceeds ==822403== Valgrind: valid image size of 271464928 for image: ==822403== Valgrind: "/dev/shm/10.4m/sql/mysqld" When I omit --valgrind from the mtr options, the tests will pass.

People

Assignee:: Michael Widenius

Reporter:: Vlad Bogolin

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2022-05-04 12:47

Updated:: 2023-11-07 10:25

Resolved:: 2022-10-18 09:01

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.25h

MariaDB Foundation Development