Details

    • Task
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Won't Fix
    • N/A
    • N/A
    • Buildbot
    • None

    Description

      marko suggested disabling the Valgrind builder for 10.5+ branches because we have working ASAN and MSAN, and (on the old buildbot only) UBSAN on these branches.

      Attachments

        Issue Links

          Activity

            ASAN shadow bytes are roughly equivalent to the A bits of Valgrind memcheck. One ASAN shadow byte covers 64 bytes, while the Valgrind A bits have 1-byte granularity.
            MSAN shadow bytes should be strictly equivalent to the V bits of Valgrind memcheck.

            Before 10.5, the code is too broken for MSAN, mainly because before MDEV-14024 updated the Perl Compatible Regular Expression library, it was massively broken (also for ASAN if you let it instrument that code).

            For branches up to 10.4, a Valgrind builder might still be useful.

            marko Marko Mäkelä added a comment - ASAN shadow bytes are roughly equivalent to the A bits of Valgrind memcheck . One ASAN shadow byte covers 64 bytes, while the Valgrind A bits have 1-byte granularity. MSAN shadow bytes should be strictly equivalent to the V bits of Valgrind memcheck . Before 10.5, the code is too broken for MSAN, mainly because before MDEV-14024 updated the Perl Compatible Regular Expression library, it was massively broken (also for ASAN if you let it instrument that code). For branches up to 10.4, a Valgrind builder might still be useful.
            marko Marko Mäkelä added a comment - - edited

            I made some observations while addressing what I consider bogus failures with Valgrind (MDEV-29710).

            When run under Valgrind, the server can be a lot slower, because Valgrind is a single-threaded JIT based CPU that emulates multiple threads by interleaving them. InnoDB is a multi-threaded storage engine. Shutdown or crash recovery can take very long under Valgrind when it chooses an unfortunate scheduling. For example, the mtr framework would silently and forcibly kill the server in the middle of STOP SLAVE or while InnoDB is shutting down. The test could subsequently fail in surprising ways.

            Moreover, as lock-free or std::atomic based performance improvements have been added to InnoDB, the Valgrind tests could be taking longer to run. Also new tests are being added to later branches. 10.3 typically completes tests in less than 1 hour, while later branches require 1 to 2 hours, depending on the assigned worker. MDEV-29508 was causing additional 2-hour timeouts in 10.5 and later until I disabled that test under Valgrind.

            I think that running Valgrind for 10.5 or later is a serious waste of resources. On the currently latest available 10.5 push, the ASAN build+test would complete in 36 minutes, and the MSAN build+test in 22 minutes.

            Unlike Valgrind, ASAN and MSAN cover all code (not just the server) and they are much more likely to catch errors that involve race conditions. And unlike Valgrind, the overhead of ASAN or MSAN is fairly low for multi-threaded programs (I think about 250% or 350%). Valgrind can make many things run hundreds or thousands of times slower.

            marko Marko Mäkelä added a comment - - edited I made some observations while addressing what I consider bogus failures with Valgrind ( MDEV-29710 ). When run under Valgrind, the server can be a lot slower, because Valgrind is a single-threaded JIT based CPU that emulates multiple threads by interleaving them. InnoDB is a multi-threaded storage engine. Shutdown or crash recovery can take very long under Valgrind when it chooses an unfortunate scheduling. For example, the mtr framework would silently and forcibly kill the server in the middle of STOP SLAVE or while InnoDB is shutting down. The test could subsequently fail in surprising ways. Moreover, as lock-free or std::atomic based performance improvements have been added to InnoDB, the Valgrind tests could be taking longer to run. Also new tests are being added to later branches. 10.3 typically completes tests in less than 1 hour, while later branches require 1 to 2 hours, depending on the assigned worker. MDEV-29508 was causing additional 2-hour timeouts in 10.5 and later until I disabled that test under Valgrind. I think that running Valgrind for 10.5 or later is a serious waste of resources. On the currently latest available 10.5 push, the ASAN build+test would complete in 36 minutes, and the MSAN build+test in 22 minutes. Unlike Valgrind, ASAN and MSAN cover all code (not just the server) and they are much more likely to catch errors that involve race conditions. And unlike Valgrind, the overhead of ASAN or MSAN is fairly low for multi-threaded programs (I think about 250% or 350%). Valgrind can make many things run hundreds or thousands of times slower.
            marko Marko Mäkelä added a comment - - edited

            Here is an example of a test that would run in 4 seconds without Valgrind:

            10.6 8872d2ee71cc65826909f6a64413ee46d1c5218d

            innodb.innodb-index-online '16k,crypt,innodb' w18 [ fail ]
                    Test ended at 2022-10-06 10:47:49
            CURRENT_TEST: innodb.innodb-index-online
            --- /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.result~	2022-10-06 10:33:27.155808599 +0000
            +++ /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.reject	2022-10-06 10:47:42.955415916 +0000
            @@ -445,11 +445,12 @@
             ALTER TABLE t1 ADD UNIQUE INDEX c3p5(c3(5));
             connection default;
             SET DEBUG_SYNC = 'now WAIT_FOR c3p5_created';
            +Warnings:
            +Warning	1639	debug sync point wait timed out
             SELECT sf.name, sf.pos FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES si
             INNER JOIN INFORMATION_SCHEMA.INNODB_SYS_FIELDS sf
             ON si.index_id = sf.index_id WHERE si.name = '?c3p5';
             name	pos
            -c3	0
             SET DEBUG_SYNC = 'ib_after_row_insert SIGNAL ins_done WAIT_FOR ddl_timed_out';
             INSERT INTO t1 VALUES(347,33101,NULL);
             connection con1;

            This suggests that any DEBUG_SYNC test could fail in the same way when running under Valgrind, from time to time. I believe that the reason is that Valgrind’s built-in scheduler that emulates multiple threads may choose very unfair schedulings, analogous to rr record --chaos of the rr tool, which uses ptrace() to run the tracee one thread at a time.

            marko Marko Mäkelä added a comment - - edited Here is an example of a test that would run in 4 seconds without Valgrind: 10.6 8872d2ee71cc65826909f6a64413ee46d1c5218d innodb.innodb-index-online '16k,crypt,innodb' w18 [ fail ] Test ended at 2022-10-06 10:47:49 CURRENT_TEST: innodb.innodb-index-online --- /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.result~ 2022-10-06 10:33:27.155808599 +0000 +++ /buildbot/amd64-ubuntu-1804-valgrind/build/mysql-test/suite/innodb/r/innodb-index-online,crypt.reject 2022-10-06 10:47:42.955415916 +0000 @@ -445,11 +445,12 @@ ALTER TABLE t1 ADD UNIQUE INDEX c3p5(c3(5)); connection default; SET DEBUG_SYNC = 'now WAIT_FOR c3p5_created'; +Warnings: +Warning 1639 debug sync point wait timed out SELECT sf.name, sf.pos FROM INFORMATION_SCHEMA.INNODB_SYS_INDEXES si INNER JOIN INFORMATION_SCHEMA.INNODB_SYS_FIELDS sf ON si.index_id = sf.index_id WHERE si.name = '?c3p5'; name pos -c3 0 SET DEBUG_SYNC = 'ib_after_row_insert SIGNAL ins_done WAIT_FOR ddl_timed_out'; INSERT INTO t1 VALUES(347,33101,NULL); connection con1; … This suggests that any DEBUG_SYNC test could fail in the same way when running under Valgrind, from time to time. I believe that the reason is that Valgrind’s built-in scheduler that emulates multiple threads may choose very unfair schedulings, analogous to rr record --chaos of the rr tool, which uses ptrace() to run the tracee one thread at a time.
            monty Michael Widenius added a comment - - edited

            I personally find valgrind easier and better to use at home than using ASAN and MSAN (that require specific builds).
            I have also found that valgrind finds issues that neither ASAN and MSAN finds.
            On reason is the 1 bit compared to 64 bit checking, but also because valgrind builds has a different MEM_ROOT code that makes it easier to find writing into wrong memory (as memory when compiled with valgrind is not allocated in big blocks).
            In some sense, the different timings that valgrind causes is good as it exposes bugs that we would not found otherwise.
            The increased run time with valgrind is of NO IMPORTANCE WHATSOEVER if it can find hard-to-find-bugs that we would not otherwise find. One does not have to use valgrind if one doesn't want to!
            Using valgrind in buildbot is still important, both to find bugs that ASAN and MSAN does not find (see above) and to ensure that valgrind can be used by developers.

            conclusion: There is no reason to disable valgrind support or valgrind builds in buildbot

            monty Michael Widenius added a comment - - edited I personally find valgrind easier and better to use at home than using ASAN and MSAN (that require specific builds). I have also found that valgrind finds issues that neither ASAN and MSAN finds. On reason is the 1 bit compared to 64 bit checking, but also because valgrind builds has a different MEM_ROOT code that makes it easier to find writing into wrong memory (as memory when compiled with valgrind is not allocated in big blocks). In some sense, the different timings that valgrind causes is good as it exposes bugs that we would not found otherwise. The increased run time with valgrind is of NO IMPORTANCE WHATSOEVER if it can find hard-to-find-bugs that we would not otherwise find. One does not have to use valgrind if one doesn't want to! Using valgrind in buildbot is still important, both to find bugs that ASAN and MSAN does not find (see above) and to ensure that valgrind can be used by developers. conclusion: There is no reason to disable valgrind support or valgrind builds in buildbot

            I would be curious to see one example of a genuine issue that Valgrind finds and ASAN and MSAN do not.

            marko Marko Mäkelä added a comment - I would be curious to see one example of a genuine issue that Valgrind finds and ASAN and MSAN do not.
            marko Marko Mäkelä added a comment - - edited

            I just found an example of something that should be unlikely be implemented in Valgrind: MSAN_OPTIONS=poison_in_dtor=1, which was recently enabled by default, is flagging some code in MDEV-30936. It looks like until MDEV-30942 has been fixed, we must disable that setting.

            marko Marko Mäkelä added a comment - - edited I just found an example of something that should be unlikely be implemented in Valgrind: MSAN_OPTIONS=poison_in_dtor=1 , which was recently enabled by default , is flagging some code in MDEV-30936 . It looks like until MDEV-30942 has been fixed, we must disable that setting.

            For what it is worth, Valgrind used to issue bogus warnings for Clang-generated code. Now it has been "improved". I get (among others) the following error when trying to invoke Valgrind on a WITH_VALGRIND=ON executable that was built with Clang 16.0.6.

            ### unhandled dwarf2 abbrev form code 0x23
            ==822403== Valgrind: debuginfo reader: ensure_valid failed:
            ==822403== Valgrind:   during call to ML_(img_get)
            ==822403== Valgrind:   request for range [2553503080, +4) exceeds
            ==822403== Valgrind:   valid image size of 271464928 for image:
            ==822403== Valgrind:   "/dev/shm/10.4m/sql/mysqld"
            

            When I omit --valgrind from the mtr options, the tests will pass.

            marko Marko Mäkelä added a comment - For what it is worth, Valgrind used to issue bogus warnings for Clang-generated code. Now it has been "improved". I get (among others) the following error when trying to invoke Valgrind on a WITH_VALGRIND=ON executable that was built with Clang 16.0.6. ### unhandled dwarf2 abbrev form code 0x23 ==822403== Valgrind: debuginfo reader: ensure_valid failed: ==822403== Valgrind: during call to ML_(img_get) ==822403== Valgrind: request for range [2553503080, +4) exceeds ==822403== Valgrind: valid image size of 271464928 for image: ==822403== Valgrind: "/dev/shm/10.4m/sql/mysqld" When I omit --valgrind from the mtr options, the tests will pass.

            People

              monty Michael Widenius
              vladbogo Vlad Bogolin
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0d
                  0d
                  Logged:
                  Time Spent - 0.25h
                  0.25h