[MDEV-31790] Extremely slow tests rpl.rpl_non_direct_mixed_mixing_engines and rpl.rpl_stm_mixing_engines Created: 2023-07-28  Updated: 2023-11-28

Status: Open
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0, 11.1, 11.2
Fix Version/s: 10.4, 10.5, 10.6, 10.11, 11.0, 11.1, 11.2

Type: Bug Priority: Major
Reporter: Marko Mäkelä Assignee: Andrei Elkin
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Duplicate
is duplicated by MDEV-29979 rpl tests with mixing_engines sometim... Closed
Relates
relates to MDEV-20377 Make WITH_MSAN more usable Closed

 Description   

The replication tests rpl.rpl_non_direct_mixed_mixing_engines and rpl.rpl_stm_mixing_engines execute very slowly: over 6 minutes (240 seconds) on my system with an optimized debug build, and almost 8 minutes (480 seconds) on the mandatory staging builder that runs MemorySanitizer with --big-test.

I think that the tests must be split or cleaned up so that they spend less time waiting for something and more time executing interesting steps. Because the tests are marked as big, they run very seldomly, and despite that, they fail pretty often.

As a temporary work around, I will disable the tests on the MSAN builder, hoping to reduce the per-push test run time by a couple of minutes.



 Comments   
Comment by Marko Mäkelä [ 2023-07-28 ]

Actually, the test run-time varies a lot. Two recent examples:
10.5 (this is skipping the related test rpl.rpl_non_direct_row_mixing_engines, but I did not find the corresponding definition in the configuration):

10.5 35533dc0b381ddbc665531d39ce3d3e9a8af2e88

rpl.rpl_stm_mixing_engines 'innodb,stmt' w29 [ pass ]  144565
rpl.rpl_row_mixing_engines 'innodb,row'  w30 [ pass ]  229229
rpl.rpl_mixed_mixing_engines 'innodb,mix' w1 [ pass ]  249206
rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w1 [ pass ]  185839
rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w29 [ pass ]  247684

I tried running these locally on the same commit, in a debug build without MSAN instrumentation:

time ./mtr --parallel=6 rpl.rpl_{non_direct_,}{stm,row,mixed}_mixing_engines

The servers were restarted 0 times
Spent 230.703 of 42 seconds executing testcases
 
Completed: All 6 tests were successful.
 
 
real	0m41.903s
user	2m54.432s
sys	1m26.707s

During the execution, I see that 6 of the 12 mariadbd processes are consuming some 20% to 30% of a single CPU core, while the other 6 are almost idle (a few per cent of a single CPU core). About 20% to 30% of a single CPU core is being spent by each mariadb-test client. The (user+sys)/real ratio is 6.23, so it appears that each of the 6 concurrent test instances is effectively running on a single CPU core (instead of 3: client and 2 servers). It may be that there is not that much excessive sleeping or waiting for timeouts going on, but simply context switching and synchronous communication.

I repeated the exercise with a local clang-15 MSAN build of the same commit:

10.5 35533dc0b381ddbc665531d39ce3d3e9a8af2e88

rpl.rpl_row_mixing_engines 'innodb,row'  w1 [ pass ]  110355
rpl.rpl_mixed_mixing_engines 'innodb,mix' w5 [ pass ]  110683
rpl.rpl_non_direct_row_mixing_engines 'innodb,row' w6 [ pass ]  111352
rpl.rpl_stm_mixing_engines 'innodb,stmt' w2 [ pass ]  111362
rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w3 [ pass ]  111361
rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w4 [ pass ]  111732
--------------------------------------------------------------------------
The servers were restarted 0 times
Spent 666.845 of 117 seconds executing testcases
 
Completed: All 6 tests were successful.
 
 
real	1m57.685s
user	8m11.913s
sys	4m24.366s

Again, the (user+sys)/real ratio is about 6. The MSAN overhead for these 6 tests appears to be (238s/42s)=5⅔, much more than the typical overhead. For the test innodb.innodb on my system, the execution times are 10942ms and 6217ms, that is, MSAN only takes 176% the time of a non-MSAN debug build.

These tests were run on a RAM disk, occupying in average only 6 of the 40 available hardware threads. In the buildbot environment or when running the whole test suite, the times will be considerably longer.

Comment by Marko Mäkelä [ 2023-07-28 ]

What motivated me to file this was this 11.0 run:

11.0 f2b4972bd4f9d3f0131f156a6cbc3e0317571944

rpl.rpl_non_direct_row_mixing_engines    [ skipped ]
rpl.rpl_row_mixing_engines 'innodb,row'  w26 [ pass ]  494740
rpl.rpl_non_direct_stm_mixing_engines 'innodb,stmt' w1 [ pass ]  502572
rpl.rpl_mixed_mixing_engines 'innodb,mix' w15 [ pass ]  508157
parts.partition_alter4_innodb 'innodb'   w12 [ pass ]  92192
rpl.rpl_semi_sync_after_sync_row 'innodb,row' w26 [ pass ]  72836
atomic.alter_table_aria 'innodb'         w10 [ pass ]  125183
atomic.rename_table 'innodb'             w6 [ pass ]  89567
rpl.rpl_semi_sync_after_sync 'innodb,stmt' w32 [ pass ]  72416
stress.ddl_memory                        w18 [ pass ]  159190
stress.ddl_myisam                        w5 [ pass ]  160570
rpl.rpl_non_direct_mixed_mixing_engines 'innodb,mix' w15 [ pass ]  460095
rpl.rpl_stm_mixing_engines 'innodb,stmt' w1 [ pass ]  442716

The last two tests were running for a couple of minutes after the previous tests completed. More specifically, the w5 server error log ends in:

CURRENT_TEST: stress.ddl_myisam
2023-07-26 14:23:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2023-07-26 14:23:12 0 [Note] Debug sync points hit:                   6813106
2023-07-26 14:23:12 0 [Note] Debug sync points executed:              0
2023-07-26 14:23:12 0 [Note] Debug sync points max active per thread: 0
2023-07-26 14:23:12 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: Shutdown complete

The w1 server error log for the last test spans whopping 7 minutes and 25 seconds (actually 3 seconds more than the 442716ms reported in the test client output):

CURRENT_TEST: rpl.rpl_stm_mixing_engines
2023-07-26 14:18:38 1887 [Note] Deleted Master_info file '/dev/shm/var_auto_8m2g/1/mysqld.1/data/master.info'.
2023-07-26 14:18:38 1887 [Note] Deleted Master_info file '/dev/shm/var_auto_8m2g/1/mysqld.1/data/relay-log.info'.
2023-07-26 14:18:38 1889 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('')
2023-07-26 14:26:03 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd (initiated by: root[root] @ localhost [127.0.0.1]): Normal shutdown
2023-07-26 14:26:03 0 [Note] /home/buildbot/amd64-debian-11-msan/build/sql/mariadbd: Shutdown complete

This test ended almost 3 minutes after the previous test (the 160-second stress.ddl_myisam). In other words, not running these extremely slow tests could help the mandatory MSAN staging builds finish 3 minutes faster.

Generated at Thu Feb 08 10:26:29 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.