[MDEV-29979] rpl tests with mixing_engines sometimes time out in MSAN Created: 2022-11-08  Updated: 2023-08-01  Resolved: 2023-07-31

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.10, 10.11
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Angelique Sklavounos (Inactive) Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates MDEV-31790 Extremely slow tests rpl.rpl_non_dire... Open

 Description   

Happens with rpl.rpl_mixed_mixing_engines, rpl.rpl_non_direct_row_mixing_engines, rpl.rpl_non_direct_stm_mixing_engines.
A failure is shown below, but the output appears to vary randomly.

10.11 d186cb180

rpl.rpl_mixed_mixing_engines 'innodb,mix' w15 [ fail ]  timeout after 480 seconds
        Test ended at 2022-11-08 15:18:25
 
Test case timeout after 480 seconds
 
== /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl_mixed_mixing_engines.log == 
BEGIN;
include/show_binlog_events.inc
-e-e-e-e-e-e-e-e-e-e-e- >> B << -e-e-e-e-e-e-e-e-e-e-e-
-b-b-b-b-b-b-b-b-b-b-b- >> set-T << -b-b-b-b-b-b-b-b-b-b-b-
SET @var= fc_i_tt_5_suc(371, 2);
include/show_binlog_events.inc
-e-e-e-e-e-e-e-e-e-e-e- >> set-T << -e-e-e-e-e-e-e-e-e-e-e-
-b-b-b-b-b-b-b-b-b-b-b- >> set-NT << -b-b-b-b-b-b-b-b-b-b-b-
SET @var= fc_i_nt_3_tt_3_suc(371, 4);
include/show_binlog_events.inc
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
master-bin.000001	#	Gtid	#	#	BEGIN GTID #-#-#
master-bin.000001	#	Table_map	#	#	table_id: # (test.nt_3)
master-bin.000001	#	Table_map	#	#	table_id: # (test.nt_3)
master-bin.000001	#	Write_rows_v1	#	#	table_id: # flags: STMT_END_F
master-bin.000001	#	Query	#	#	COMMIT
-e-e-e-e-e-e-e-e-e-e-e- >> set-NT << -e-e-e-e-e-e-e-e-e-e-e-
-b-b-b-b-b-b-b-b-b-b-b- >> C << -b-b-b-b-b-b-b-b-b-b-b-
COMMIT;
include/show_binlog_events.inc
 
 == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/tmp/analyze-timeout-mysqld.1.err ==
mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/tmp/15/mys' (111)
 
 == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/tmp/analyze-timeout-mysqld.2.err ==
mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/tmp/15/mys' (111)
 
 
 - found 'core' (4/5)
gdb not found, cannot get the stack trace
 
Trying 'dbx' to get a backtrace
 
Trying 'lldb' to get a backtrace from coredump /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl.rpl_mixed_mixing_engines-innodb,mix/mysqld.2/data/core
Compressed file /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl.rpl_mixed_mixing_engines-innodb,mix/mysqld.2/data/core
 - found 'core' (5/5)
gdb not found, cannot get the stack trace
 
Trying 'dbx' to get a backtrace
 
Trying 'lldb' to get a backtrace from coredump /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl.rpl_mixed_mixing_engines-innodb,mix/mysqld.1/data/core
 - deleting it, already saved 5



 Comments   
Comment by Andrei Elkin [ 2023-04-17 ]

angelique.sklavounos, rpl.rpl_mixed_mixing_engines and all tests that invoke a huge rpl/inlculde/rpl_mixing_engines.test may need splitting into smaller ones. The latter invokes invokes rpl_mixing_engines.inc 370 times.

Could you please consider partition the affected tests into smaller ones?

Comment by Angelique Sklavounos (Inactive) [ 2023-05-01 ]

The tests have been passing for the past month, so there might not be a need anymore to split the tests up.

Comment by Roel Van de Paar [ 2023-07-31 ]

This ticket duplicates MDEV-31790, a more recent and fuller report, as far as test slowness in the MSAN builder goes. However, that item is not otherwise specifically discussed in the report, the title (and MTR) just use the term "time out", but that was the result of a crash.

There is no real timeout here; the issue is a crashed server. Note "found 'core' (5/5)", "mysqld.2/data/core" and "2002 Can't connect to local server through socket". However, as GDB was not available at the time, no reliable stack is present. Additionally the description says "A failure is shown below, but the output appears to vary random."

There is nothing further to analyse or fix here (except to ensure GDB is now available on the builder in question vladbogo fyi, to check); closing. If other crashes are observed on any builder, then they should be debugged individually.

Comment by Marko Mäkelä [ 2023-07-31 ]

Roel, I believe that as a result of a non-default 480-second (6-minute) timeout, the processes were killed externally by sending SIGABRT to them. Stack traces or core dumps produced from such a timeout-related killing could be useful for analyzing the root cause, in case it is a deadlock of threads. In this particular case, based on the findings in MDEV-31790, it might be that these tests may run extremely slowly, with no actual hang.

Comment by Vlad Bogolin [ 2023-07-31 ]

Roel hopefully this https://github.com/MariaDB/buildbot/pull/148 would solve the gdb issue

Generated at Thu Feb 08 10:12:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.