Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31790

Extremely slow tests rpl.rpl_non_direct_mixed_mixing_engines and rpl.rpl_stm_mixing_engines

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL)
    • 10.5, 10.6, 10.11
    • Replication, Tests
    • None

    Description

      The replication tests rpl.rpl_non_direct_mixed_mixing_engines and rpl.rpl_stm_mixing_engines execute very slowly: over 6 minutes (240 seconds) on my system with an optimized debug build, and almost 8 minutes (480 seconds) on the mandatory staging builder that runs MemorySanitizer with --big-test.

      I think that the tests must be split or cleaned up so that they spend less time waiting for something and more time executing interesting steps. Because the tests are marked as big, they run very seldomly, and despite that, they fail pretty often.

      As a temporary work around, I will disable the tests on the MSAN builder, hoping to reduce the per-push test run time by a couple of minutes.

      Attachments

        Issue Links

          Activity

            Actually, the test run-time varies a lot. Two recent examples:
            10.5 (this is skipping the related test rpl.rpl_non_direct_row_mixing_engines, but I did not find the corresponding definition in the configuration):

            10.5 35533dc0b381ddbc665531d39ce3d3e9a8af2e88

            rpl.rpl_stm_mixing_engines 'innodb,stmt' w29 [ pass ]  144565
            rpl.rpl_row_mixing_engines 'innodb,row'  w30 [ pass ]  229229
            rpl.rpl_mixed_mixing_engines 'innodb,mix' w1 [ pass ]  249206
            rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w1 [ pass ]  185839
            rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w29 [ pass ]  247684
            

            I tried running these locally on the same commit, in a debug build without MSAN instrumentation:

            time ./mtr --parallel=6 rpl.rpl_{non_direct_,}{stm,row,mixed}_mixing_engines
            

            The servers were restarted 0 times
            Spent 230.703 of 42 seconds executing testcases
             
            Completed: All 6 tests were successful.
             
             
            real	0m41.903s
            user	2m54.432s
            sys	1m26.707s
            

            During the execution, I see that 6 of the 12 mariadbd processes are consuming some 20% to 30% of a single CPU core, while the other 6 are almost idle (a few per cent of a single CPU core). About 20% to 30% of a single CPU core is being spent by each mariadb-test client. The (user+sys)/real ratio is 6.23, so it appears that each of the 6 concurrent test instances is effectively running on a single CPU core (instead of 3: client and 2 servers). It may be that there is not that much excessive sleeping or waiting for timeouts going on, but simply context switching and synchronous communication.

            I repeated the exercise with a local clang-15 MSAN build of the same commit:

            10.5 35533dc0b381ddbc665531d39ce3d3e9a8af2e88

            rpl.rpl_row_mixing_engines 'innodb,row'  w1 [ pass ]  110355
            rpl.rpl_mixed_mixing_engines 'innodb,mix' w5 [ pass ]  110683
            rpl.rpl_non_direct_row_mixing_engines 'innodb,row' w6 [ pass ]  111352
            rpl.rpl_stm_mixing_engines 'innodb,stmt' w2 [ pass ]  111362
            rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w3 [ pass ]  111361
            rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w4 [ pass ]  111732
            --------------------------------------------------------------------------
            The servers were restarted 0 times
            Spent 666.845 of 117 seconds executing testcases
             
            Completed: All 6 tests were successful.
             
             
            real	1m57.685s
            user	8m11.913s
            sys	4m24.366s
            

            Again, the (user+sys)/real ratio is about 6. The MSAN overhead for these 6 tests appears to be (238s/42s)=5⅔, much more than the typical overhead. For the test innodb.innodb on my system, the execution times are 10942ms and 6217ms, that is, MSAN only takes 176% the time of a non-MSAN debug build.

            These tests were run on a RAM disk, occupying in average only 6 of the 40 available hardware threads. In the buildbot environment or when running the whole test suite, the times will be considerably longer.

            marko Marko Mäkelä added a comment - Actually, the test run-time varies a lot. Two recent examples: 10.5 (this is skipping the related test rpl.rpl_non_direct_row_mixing_engines , but I did not find the corresponding definition in the configuration ): 10.5 35533dc0b381ddbc665531d39ce3d3e9a8af2e88 rpl.rpl_stm_mixing_engines 'innodb,stmt' w29 [ pass ] 144565 rpl.rpl_row_mixing_engines 'innodb,row' w30 [ pass ] 229229 rpl.rpl_mixed_mixing_engines 'innodb,mix' w1 [ pass ] 249206 rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w1 [ pass ] 185839 rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w29 [ pass ] 247684 I tried running these locally on the same commit, in a debug build without MSAN instrumentation: time . /mtr --parallel=6 rpl.rpl_{non_direct_,}{stm,row,mixed}_mixing_engines The servers were restarted 0 times Spent 230.703 of 42 seconds executing testcases   Completed: All 6 tests were successful.     real 0m41.903s user 2m54.432s sys 1m26.707s During the execution, I see that 6 of the 12 mariadbd processes are consuming some 20% to 30% of a single CPU core, while the other 6 are almost idle (a few per cent of a single CPU core). About 20% to 30% of a single CPU core is being spent by each mariadb-test client. The (user+sys)/real ratio is 6.23, so it appears that each of the 6 concurrent test instances is effectively running on a single CPU core (instead of 3: client and 2 servers). It may be that there is not that much excessive sleeping or waiting for timeouts going on, but simply context switching and synchronous communication. I repeated the exercise with a local clang-15 MSAN build of the same commit: 10.5 35533dc0b381ddbc665531d39ce3d3e9a8af2e88 rpl.rpl_row_mixing_engines 'innodb,row' w1 [ pass ] 110355 rpl.rpl_mixed_mixing_engines 'innodb,mix' w5 [ pass ] 110683 rpl.rpl_non_direct_row_mixing_engines 'innodb,row' w6 [ pass ] 111352 rpl.rpl_stm_mixing_engines 'innodb,stmt' w2 [ pass ] 111362 rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w3 [ pass ] 111361 rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w4 [ pass ] 111732 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 666.845 of 117 seconds executing testcases   Completed: All 6 tests were successful.     real 1m57.685s user 8m11.913s sys 4m24.366s Again, the (user+sys)/real ratio is about 6. The MSAN overhead for these 6 tests appears to be (238s/42s)=5⅔, much more than the typical overhead. For the test innodb.innodb on my system, the execution times are 10942ms and 6217ms, that is, MSAN only takes 176% the time of a non-MSAN debug build. These tests were run on a RAM disk, occupying in average only 6 of the 40 available hardware threads. In the buildbot environment or when running the whole test suite, the times will be considerably longer.

            What motivated me to file this was this 11.0 run:

            11.0 f2b4972bd4f9d3f0131f156a6cbc3e0317571944

            rpl.rpl_non_direct_row_mixing_engines    [ skipped ]
            rpl.rpl_row_mixing_engines 'innodb,row'  w26 [ pass ]  494740
            rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w1 [ pass ]  502572
            rpl.rpl_mixed_mixing_engines 'innodb,mix' w15 [ pass ]  508157
            parts.partition_alter4_innodb 'innodb'   w12 [ pass ]  92192
            rpl.rpl_semi_sync_after_sync_row 'innodb,row' w26 [ pass ]  72836
            atomic.alter_table_aria 'innodb'         w10 [ pass ]  125183
            atomic.rename_table 'innodb'             w6 [ pass ]  89567
            rpl.rpl_semi_sync_after_sync 'innodb,stmt' w32 [ pass ]  72416
            stress.ddl_memory                        w18 [ pass ]  159190
            stress.ddl_myisam                        w5 [ pass ]  160570
            rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w15 [ pass ]  460095
            rpl.rpl_stm_mixing_engines 'innodb,stmt' w1 [ pass ]  442716
            

            The last two tests were running for a couple of minutes after the previous tests completed. More specifically, the w5 server error log ends in:

            CURRENT_TEST: stress.ddl_myisam
            2023-07-26 14:23:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
            2023-07-26 14:23:12 0 [Note] Debug sync points hit:                   6813106
            2023-07-26 14:23:12 0 [Note] Debug sync points executed:              0
            2023-07-26 14:23:12 0 [Note] Debug sync points max active per thread: 0
            2023-07-26 14:23:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: Shutdown complete
            

            The w1 server error log for the last test spans whopping 7 minutes and 25 seconds (actually 3 seconds more than the 442716ms reported in the test client output):

            CURRENT_TEST: rpl.rpl_stm_mixing_engines
            2023-07-26 14:18:38 1887 [Note] Deleted Master_info file '/dev/shm/var_auto_8m2g/1/mysqld.1/data/master.info'.
            2023-07-26 14:18:38 1887 [Note] Deleted Master_info file '/dev/shm/var_auto_8m2g/1/mysqld.1/data/relay-log.info'.
            2023-07-26 14:18:38 1889 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('')
            2023-07-26 14:26:03 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
            2023-07-26 14:26:03 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: Shutdown complete
            

            This test ended almost 3 minutes after the previous test (the 160-second stress.ddl_myisam). In other words, not running these extremely slow tests could help the mandatory MSAN staging builds finish 3 minutes faster.

            marko Marko Mäkelä added a comment - What motivated me to file this was this 11.0 run : 11.0 f2b4972bd4f9d3f0131f156a6cbc3e0317571944 rpl.rpl_non_direct_row_mixing_engines [ skipped ] rpl.rpl_row_mixing_engines 'innodb,row' w26 [ pass ] 494740 rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w1 [ pass ] 502572 rpl.rpl_mixed_mixing_engines 'innodb,mix' w15 [ pass ] 508157 … parts.partition_alter4_innodb 'innodb' w12 [ pass ] 92192 rpl.rpl_semi_sync_after_sync_row 'innodb,row' w26 [ pass ] 72836 atomic.alter_table_aria 'innodb' w10 [ pass ] 125183 atomic.rename_table 'innodb' w6 [ pass ] 89567 rpl.rpl_semi_sync_after_sync 'innodb,stmt' w32 [ pass ] 72416 stress.ddl_memory w18 [ pass ] 159190 stress.ddl_myisam w5 [ pass ] 160570 rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w15 [ pass ] 460095 rpl.rpl_stm_mixing_engines 'innodb,stmt' w1 [ pass ] 442716 The last two tests were running for a couple of minutes after the previous tests completed. More specifically, the w5 server error log ends in: CURRENT_TEST: stress.ddl_myisam 2023-07-26 14:23:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown 2023-07-26 14:23:12 0 [Note] Debug sync points hit: 6813106 2023-07-26 14:23:12 0 [Note] Debug sync points executed: 0 2023-07-26 14:23:12 0 [Note] Debug sync points max active per thread: 0 2023-07-26 14:23:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: Shutdown complete The w1 server error log for the last test spans whopping 7 minutes and 25 seconds (actually 3 seconds more than the 442716ms reported in the test client output): CURRENT_TEST: rpl.rpl_stm_mixing_engines 2023-07-26 14:18:38 1887 [Note] Deleted Master_info file '/dev/shm/var_auto_8m2g/1/mysqld.1/data/master.info'. 2023-07-26 14:18:38 1887 [Note] Deleted Master_info file '/dev/shm/var_auto_8m2g/1/mysqld.1/data/relay-log.info'. 2023-07-26 14:18:38 1889 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('') … 2023-07-26 14:26:03 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown … 2023-07-26 14:26:03 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: Shutdown complete This test ended almost 3 minutes after the previous test (the 160-second stress.ddl_myisam ). In other words, not running these extremely slow tests could help the mandatory MSAN staging builds finish 3 minutes faster.

            People

              Elkin Andrei Elkin
              marko Marko Mäkelä
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.