[MDEV-30836] MTR hangs after tests have completed Created: 2023-03-13  Updated: 2023-09-11  Resolved: 2023-09-05

Status: Closed
Project: MariaDB Server
Component/s: Scripts & Clients
Affects Version/s: 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Daniel Black
Resolution: Fixed Votes: 0
Labels: affects-tests, hang

Issue Links:
Relates
relates to MDEV-29023 MTR hangs after multiple failures Closed

 Description   

Below is a copy of a comment of mine from MDEV-29023. I did not test anything in the 10.4 series.

Unfortunately, while a test of 10.3 1848804840f5595f982c4cd502ba2112f6dd7911 did not hang, anything else where this was supposedly fixed (including 10.5, 10.8, 10.9) would still hang.

Here is an example of 10.9 f53f64b7b9edaef8e413add322225dc33ebc8131 (the first 10.9 revision that includes the fix):

diff --git a/sql/sql_insert.cc b/sql/sql_insert.cc
index 283dd463fa8..872a1643e5c 100644
--- a/sql/sql_insert.cc
+++ b/sql/sql_insert.cc
@@ -713,6 +713,7 @@ bool mysql_insert(THD *thd, TABLE_LIST *table_list,
   SELECT_LEX *returning= thd->lex->has_returning() ? thd->lex->returning() : 0;
   unsigned char *readbuff= NULL;
 
+  DBUG_ASSERT(opt_bootstrap);
 #ifndef EMBEDDED_LIBRARY
   char *query= thd->query();
   /*

With this patch applied and an invocation of

nice ./mtr --parallel=60 --big-test --force --max-test-fail=0 --skip-core-file

the test would hang, with a few mariadbd processes still active but sleeping. After

killall -QUIT mariadbd

the server processes would terminate, but the mtr test driver would not react to them going away.

I got interested in this because there was a bogus failure on buildbot.mariadb.org that looked like a hang:

bb-10.6-MDEV-26055-MDEV-26827 632e745f906630fc5d2682d837a87275d5ce2868

sysschema.v_innodb_lock_waits 'innodb'   w3 [ pass ]     23
unit.conc_cursor                         w1 [ pass ]      2
innodb.skip_locked_nowait 'innodb'       w11 [ pass ]   3014
unit.conc_performance                    w8 [ pass ]      9
sys_vars.auto_increment_offset_basic     w7 [ pass ]      4
unit.conc_ps                             w12 [ pass ]    604
unit.conc_ps_bugs                        w9 [ pass ]      5
innodb.instant_alter_bugs 'innodb'       w4 [ pass ]  11250
unit.conc_misc                           w6 [ pass ]      7
innodb.log_file_name 'innodb'            w13 [ pass ]  11419
unit.conc_connection                     w5 [ pass ]      3
stress.ddl_myisam                        w10 [ pass ]  31795
innodb.log_file 'innodb'                 w14 [ pass ]  23531
innodb.log_corruption 'innodb'           w2 [ pass ]  39422
 
command timed out: 600 seconds without output running [b'sh', b'-c', b'cd mysql-test && exec perl mysql-test-run.pl --verbose-restart --force --retry=3 --max-save-core=1 --max-save-datadir=1 --max-test-fail=20 --mem --parallel=$(expr 7 \\* 2) '], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1214.928726

In the file mysql-test/var/3/log/mysqld.1.err we do have a shutdown message as well as a message about a test that is not listed in the mtr output log above:

CURRENT_TEST: sysschema.v_innodb_lock_waits
CURRENT_TEST: sysschema.v_schema_table_lock_waits
2023-03-07 16:32:47 0 [Note] /home/buildbot/amd64-rhel-7/build/sql/mariadbd (initiated by: unknown): Normal shutdown
2023-03-07 16:32:47 0 [Note] InnoDB: FTS optimize thread exiting.
2023-03-07 16:32:47 0 [Note] InnoDB: Starting shutdown...
2023-03-07 16:32:47 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/var_auto_zzmE/3/mysqld.1/data/ib_buffer_pool
2023-03-07 16:32:47 0 [Note] InnoDB: Restricted to 126 pages due to innodb_buf_pool_dump_pct=25
2023-03-07 16:32:47 0 [Note] InnoDB: Buffer pool(s) dump completed at 230307 16:32:47
2023-03-07 16:32:48 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-03-07 16:32:48 0 [Note] InnoDB: Shutdown completed; log sequence number 21530679; transaction id 116
2023-03-07 16:32:48 0 [Note] /home/buildbot/amd64-rhel-7/build/sql/mariadbd: Shutdown complete

I do not know if this anomaly is related to the hang, but I am willing to believe so.



 Comments   
Comment by Marko Mäkelä [ 2023-04-03 ]

Another example:
https://buildbot.mariadb.org/#/builders/561/builds/119/steps/8/logs/stdio

10.8 dd2fe81122224c07a65e89b409dc28d70802fe99

innodb.innodb_page_compressed 'innodb,lzma' w8 [ fail ]
        Test ended at 2023-03-30 16:15:53
… skip lines 4239‥8535
innodb_gis.rtree_create_inplace 'innodb' w10 [ pass ]   1739
innodb_gis.rtree_drop_index 'innodb'     w4 [ pass ]   1682
stress.wrapper                           w4 [ skipped ]  Only to be run with mtr --stress
stress.deadlock_drop_table               w9 [ pass ]     64
sys_vars.alter_algorithm_basic           w4 [ pass ]     18
stress.misc                              w10 [ pass ]   3252
innodb.innodb_bug52663 'innodb'          w5 [ pass ]  50025
innodb_gis.rtree_search 'innodb'         w3 [ pass ]   7166
innodb_fts.sync_ddl 'innodb,orig'        w2 [ pass ]  15247
innodb.log_corruption 'innodb'           w1 [ pass ]  44124
stress.ddl_memory                        w6 [ pass ]  159732
stress.ddl_myisam                        w7 [ pass ]  162212
command timed out: 600 seconds without output running [b'bash', b'-xc', b'cd mysql-test && LD_LIBRARY_PATH=/msan-libs MSAN_OPTIONS=abort_on_error=1:poison_in_dtor=0 ./mtr --mem --big-test --force --retry=0 --skip-test=.*compression.* --max-test-fail=100 --parallel=$(expr 5 \\* 2)'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=2771.840262

After the failed test run (it failed because an uninstrumented LZMA library was available during a MemorySanitizer run), you will not see any output of w8 between the failure and the end.

According to the server error log https://ci.mariadb.org/33865/logs/amd64-ubuntu-2204-msan/mysqld.1.err.8 the test versioning.alter was started on that worker, but there was no attempt to shut down that server.

Comment by Marko Mäkelä [ 2023-06-08 ]

I found a reliable way to reproduce this on a Microsoft Windows builder on 10.6 and 10.9. To be exact:

git checkout 223c2c5b9d49e97addd173d7e9cf1e1874150d8c~ # or without the ~
git revert d3eefbaa55edb585e4fbf8f09ad4141c3be900e5

In https://buildbot.mariadb.org/#/builders/234/builds/19006 this would cause the test gcol.gcol_purge to hang. The test name is not displayed by the failure report:

main.status2                             w3 [ pass ]     25
innodb.xa_recovery 'innodb'              w8 [ pass ]   3998
mariabackup.rename_during_backup         w26 [ pass ]   4720
innodb.innodb_bug30423 'innodb'          w2 [ pass ]  26964
main.analyze_stmt_privileges2            w28 [ pass ]    775
command timed out: 600 seconds without output running [b'dojob', b'"', b'"C:\\Program Files (x86)\\Microsoft Visual Studio\\2022\\BuildTools\\Common7\\Tools\\VsDevCmd.bat" -arch=x64 && cd mysql-test && ..\\mysql-test\\collections\\buildbot_suites.bat && cd ..', b'"'], attempting to kill
program finished with exit code 1
elapsedTime=802.108496
main.alter_table_mdev539_my

The last line of this output shows one more problem: it is a truncated line that attempts to mention one more test name (main.alter_table_mdev539_myisam).

Comment by Marko Mäkelä [ 2023-06-08 ]

Here is another example on 11.0, which does not seem to involve any test failure or hang:

wget https://ci.mariadb.org/35706/logs/amd64-ubuntu-2004-debug/var.tar.gz
tar xzf var.tar.gz
for i in mysql-test/var/*/log/mysqld*err; do tail -2 $i|grep -q 'Shutdown complete'||echo $i; done

Because there is no output, all server instances should have terminated by an orderly shutdown. Yet, the stdio output of the tests in https://buildbot.mariadb.org/#/builders/369/builds/10205 end in the following:

innodb.read_only_recover_committed 'innodb' w5 [ pass ]  12557
unit.conc_connection                     w19 [ pass ]     17
innodb.log_file_name 'innodb'            w14 [ pass ]  16052
stress.ddl_myisam                        w10 [ pass ]  31636
innodb.log_file 'innodb'                 w4 [ pass ]  26478
innodb.log_corruption 'innodb'           w9 [ pass ]  47065
command timed out: 600 seconds without output running [b'sh', b'-c', b'\n            cd mysql-test &&\n            exec perl mysql-test-run.pl --verbose-restart --force --retry=3 --max-save-core=1 --max-save-datadir=1 --max-test-fail=20 --mem --parallel=$(expr 14 \\* 2) \n            '], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1407.556692

Comment by Marko Mäkelä [ 2023-06-28 ]

One more occurrence: https://buildbot.mariadb.org/#/builders/221/builds/23335

10.10 135e9766963ab55e9a1716e98f8b9960759049af

main.check                               w10 [ pass ]   2768
main.mrr_icp_extra                       w5 [ pass ]   2454
main.rpl_mysqldump_slave 'stmt'          w9 [ pass ]   1624
main.change_user_notembedded             w11 [ pass ]   5030
command timed out: 600 seconds without output running [b'sh', b'-c', b'\n            cd mysql-test &&\n            exec perl mysql-test-run.pl --verbose-restart --force --retry=3 --max-save-core=1 --max-save-datadir=1 --max-test-fail=20 --mem --parallel=$(expr 7 \\* 2) --suite=main --skip-test="^stack_crash$|^float$|^derived_split_innodb$|^mysql_client_test$|^kill$|^processlist_not_embedded$|^sp-big$"\n            '], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=700.677739

Comment by Marko Mäkelä [ 2023-06-28 ]

Edit: The following is due to an InnoDB test bug MDEV-31570 and likely CI misconfiguration (the "without output" timeout is shorter than the mtr internal timeout). It is unrelated to the problem in the Description.

Here is one failure that actually is due to a hanging test:
https://buildbot.mariadb.org/#/builders/534/builds/7256

unit.conc_connection                     w17 [ pass ]     11
innodb.log_file 'innodb'                 w19 [ pass ]  22324
innodb.log_corruption 'innodb'           w2 [ pass ]  39793
command timed out: 600 seconds without output running [b'sh', b'-c', b'\n            cd mysql-test &&\n            exec perl mysql-test-run.pl --verbose-restart --force --retry=3 --max-save-core=1 --max-save-datadir=1 --max-test-fail=20 --mem --parallel=$(expr 14 \\* 2) --ps-protocol\n            '], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1288.258424

Because vladbogo was quick enough to log in to the worker before it finally timed out in Buildbot, we found out that there was actually one stuck worker, but mtr failed to report it. The worker log file https://ci.mariadb.org/36162/logs/amd64-ubuntu-2204-debug-ps/mysqld.1.err.28 ends in the following:

2023-06-28  8:33:31 0 [Note] /home/buildbot/amd64-ubuntu-2204-debug-ps/build/sql/mariadbd: ready for connections.
Version: '11.0.3-MariaDB-debug-log'  socket: '/home/buildbot/amd64-ubuntu-2204-debug-ps/build/mysql-test/var/tmp/28/mysqld.1.sock'  port: 16420  Source distribution
CURRENT_TEST: innodb.alter_table
CURRENT_TEST: innodb.analyze_table
CURRENT_TEST: innodb.create_select
CURRENT_TEST: innodb.defrag_mdl-9155
CURRENT_TEST: innodb.gap_lock_split
2023-06-28  8:47:34 26 [Warning] Aborted connection 26 to db: 'test' user: 'root' host: 'localhost' (Got an error reading communication packets)
2023-06-28  8:47:34 27 [Warning] Aborted connection 27 to db: 'test' user: 'root' host: 'localhost' (Got an error writing communication packets)

I believe that the default test case timeout is 900 seconds (15 minutes). It might be the case that if a test starts near the end of the test run, the 600-second (10-minute) Buildbot timeout could kick in before mtr would get a chance to kill the process. Could the Buildbot timeout handling be fixed so that it would do multiple things:

  1. killall -ABRT mariadbd
  2. wait a few seconds for any more mtr output, or mtr termination
  3. finally, kill mtr like it currently does
Comment by Aleksey Midenkov [ 2023-08-01 ]

Commenting this out helped me to reproduce the hang:

        # Wait a bit and see if a server died, if so report that instead
        mtr_milli_sleep(100);
        my $srvproc= My::SafeProcess::check_any();
        if ($srvproc && grep($srvproc eq $_, started(all_servers()))) {
          $proc= $srvproc;
          goto SRVDIED;
        }

There is mixed sematics as well:

sub exit_status {
  my $self= shift;
  my $raw= $self->{EXIT_STATUS};
 
  croak("Can't call exit_status before process has died")
    unless defined $raw;
 
  if ($raw & 127)
  {
    # Killed by signal
    my $signal_num=  $raw & 127;
    my $dumped_core= $raw & 128;
    return 1; # Return error code
  }
  else
  {
    # Normal process exit
    return $raw >> 8;
  };
}

exit_status() returns 1 for signal, but normal exit also returns 1. That way MTR doesn't do anything special about signal-killed child.

Comment by Aleksey Midenkov [ 2023-08-01 ]

MTR workers are hanging in run_worker() at 944:

  while (my $line= <$server>){

Comment by Aleksey Midenkov [ 2023-08-10 ]

Please review bb-10.4-midenok

Comment by Marko Mäkelä [ 2023-08-10 ]

Thank you for the fix! For the record, I successfully tested this on 10.6 as follows:

git checkout 0be4781428a4044b13b085965820a995652bb0e9
git merge --no-commit 4ada8f384b8a54b1a37262a5a821ccdc4474cf55
git checkout HEAD mysql-test/suite/versioning storage/innobase
# patch sql/sql_insert.cc to inject the fault
# compile, and test as
cd mysql-test
./mtr --parallel=60 --big-test --force --max-test-fail=0 --skip-core-file --skip-stack-trace

Spent 1335.877 of 234 seconds executing testcases
 
Completed: Failed 3942/5939 tests, 33.63% were successful.

There are some additional (uncommitted) tests in my working directory. The main point is that mtr no longer hangs at the end, despite the massive amount of test failures.

Comment by Daniel Black [ 2023-09-05 ]

Thank you very much!

Did one minor change in Cygwin for the perl version on centos 7

Generated at Thu Feb 08 10:19:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.