Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29979

rpl tests with mixing_engines sometimes time out in MSAN

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Duplicate
    • 10.10(EOL), 10.11
    • N/A
    • Replication, Tests
    • None

    Description

      Happens with rpl.rpl_mixed_mixing_engines, rpl.rpl_non_direct_row_mixing_engines, rpl.rpl_non_direct_stm_mixing_engines.
      A failure is shown below, but the output appears to vary randomly.

      10.11 d186cb180

      rpl.rpl_mixed_mixing_engines 'innodb,mix' w15 [ fail ]  timeout after 480 seconds
              Test ended at 2022-11-08 15:18:25
       
      Test case timeout after 480 seconds
       
      == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl_mixed_mixing_engines.log == 
      BEGIN;
      include/show_binlog_events.inc
      -e-e-e-e-e-e-e-e-e-e-e- >> B << -e-e-e-e-e-e-e-e-e-e-e-
      -b-b-b-b-b-b-b-b-b-b-b- >> set-T << -b-b-b-b-b-b-b-b-b-b-b-
      SET @var= fc_i_tt_5_suc(371, 2);
      include/show_binlog_events.inc
      -e-e-e-e-e-e-e-e-e-e-e- >> set-T << -e-e-e-e-e-e-e-e-e-e-e-
      -b-b-b-b-b-b-b-b-b-b-b- >> set-NT << -b-b-b-b-b-b-b-b-b-b-b-
      SET @var= fc_i_nt_3_tt_3_suc(371, 4);
      include/show_binlog_events.inc
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
      master-bin.000001	#	Gtid	#	#	BEGIN GTID #-#-#
      master-bin.000001	#	Table_map	#	#	table_id: # (test.nt_3)
      master-bin.000001	#	Table_map	#	#	table_id: # (test.nt_3)
      master-bin.000001	#	Write_rows_v1	#	#	table_id: # flags: STMT_END_F
      master-bin.000001	#	Query	#	#	COMMIT
      -e-e-e-e-e-e-e-e-e-e-e- >> set-NT << -e-e-e-e-e-e-e-e-e-e-e-
      -b-b-b-b-b-b-b-b-b-b-b- >> C << -b-b-b-b-b-b-b-b-b-b-b-
      COMMIT;
      include/show_binlog_events.inc
       
       == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/tmp/analyze-timeout-mysqld.1.err ==
      mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/tmp/15/mys' (111)
       
       == /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/tmp/analyze-timeout-mysqld.2.err ==
      mysqltest: Could not open connection 'default' after 500 attempts: 2002 Can't connect to local server through socket '/buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/tmp/15/mys' (111)
       
       
       - found 'core' (4/5)
      gdb not found, cannot get the stack trace
       
      Trying 'dbx' to get a backtrace
       
      Trying 'lldb' to get a backtrace from coredump /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl.rpl_mixed_mixing_engines-innodb,mix/mysqld.2/data/core
      Compressed file /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl.rpl_mixed_mixing_engines-innodb,mix/mysqld.2/data/core
       - found 'core' (5/5)
      gdb not found, cannot get the stack trace
       
      Trying 'dbx' to get a backtrace
       
      Trying 'lldb' to get a backtrace from coredump /buildbot/amd64-ubuntu-2004-msan/build/mysql-test/var/15/log/rpl.rpl_mixed_mixing_engines-innodb,mix/mysqld.1/data/core
       - deleting it, already saved 5
      

      Attachments

        Issue Links

          Activity

            Elkin Andrei Elkin added a comment -

            angelique.sklavounos, rpl.rpl_mixed_mixing_engines and all tests that invoke a huge rpl/inlculde/rpl_mixing_engines.test may need splitting into smaller ones. The latter invokes invokes rpl_mixing_engines.inc 370 times.

            Could you please consider partition the affected tests into smaller ones?

            Elkin Andrei Elkin added a comment - angelique.sklavounos , rpl.rpl_mixed_mixing_engines and all tests that invoke a huge rpl/inlculde/rpl_mixing_engines.test may need splitting into smaller ones. The latter invokes invokes rpl_mixing_engines.inc 370 times. Could you please consider partition the affected tests into smaller ones?

            The tests have been passing for the past month, so there might not be a need anymore to split the tests up.

            angelique.sklavounos Angelique Sklavounos (Inactive) added a comment - The tests have been passing for the past month, so there might not be a need anymore to split the tests up.

            This ticket duplicates MDEV-31790, a more recent and fuller report, as far as test slowness in the MSAN builder goes. However, that item is not otherwise specifically discussed in the report, the title (and MTR) just use the term "time out", but that was the result of a crash.

            There is no real timeout here; the issue is a crashed server. Note "found 'core' (5/5)", "mysqld.2/data/core" and "2002 Can't connect to local server through socket". However, as GDB was not available at the time, no reliable stack is present. Additionally the description says "A failure is shown below, but the output appears to vary random."

            There is nothing further to analyse or fix here (except to ensure GDB is now available on the builder in question vladbogo fyi, to check); closing. If other crashes are observed on any builder, then they should be debugged individually.

            Roel Roel Van de Paar added a comment - This ticket duplicates MDEV-31790 , a more recent and fuller report, as far as test slowness in the MSAN builder goes. However, that item is not otherwise specifically discussed in the report, the title (and MTR) just use the term "time out", but that was the result of a crash. There is no real timeout here; the issue is a crashed server. Note "found 'core' (5/5)", "mysqld.2/data/core" and "2002 Can't connect to local server through socket". However, as GDB was not available at the time, no reliable stack is present. Additionally the description says "A failure is shown below, but the output appears to vary random." There is nothing further to analyse or fix here (except to ensure GDB is now available on the builder in question vladbogo fyi, to check); closing. If other crashes are observed on any builder, then they should be debugged individually.

            Roel, I believe that as a result of a non-default 480-second (6-minute) timeout, the processes were killed externally by sending SIGABRT to them. Stack traces or core dumps produced from such a timeout-related killing could be useful for analyzing the root cause, in case it is a deadlock of threads. In this particular case, based on the findings in MDEV-31790, it might be that these tests may run extremely slowly, with no actual hang.

            marko Marko Mäkelä added a comment - Roel , I believe that as a result of a non-default 480-second (6-minute) timeout, the processes were killed externally by sending SIGABRT to them. Stack traces or core dumps produced from such a timeout-related killing could be useful for analyzing the root cause, in case it is a deadlock of threads. In this particular case, based on the findings in MDEV-31790 , it might be that these tests may run extremely slowly, with no actual hang.
            vladbogo Vlad Bogolin added a comment -

            Roel hopefully this https://github.com/MariaDB/buildbot/pull/148 would solve the gdb issue

            vladbogo Vlad Bogolin added a comment - Roel hopefully this https://github.com/MariaDB/buildbot/pull/148 would solve the gdb issue

            People

              Unassigned Unassigned
              angelique.sklavounos Angelique Sklavounos (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.