Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8859

rpl.rpl_mdev382 sporadically fails to finish due to disappeared expect file

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 5.5(EOL)
    • 5.5.50
    • Tests
    • None
    • 5.5.50

    Description

      http://buildbot.askmonty.org/buildbot/builders/kvm-deb-precise-x86/builds/2982/steps/test_4/logs/mysqld.1.err.3
      http://buildbot.askmonty.org/buildbot/builders/kvm-deb-precise-x86/builds/2982/steps/test_4/logs/stdio

      In the test output, it looks like this:

      rpl.rpl_loaddata 'stmt,xtradb'           w3 [ pass ]    901
      worker[3] mysql-test-run: *** ERROR: can't open file "/run/shm/var/3/tmp/mysqld.1.expect": No such file or directory

      After that, the worker does not do anything anymore, and at the end MTR complains

      mysql-test-run: *** ERROR: Not all tests completed (only 3093 of 3094)

      We know it's rpl.rpl_mdev382 because it's the last test in the corresponding server error log:

      Version: '5.5.46-MariaDB-1~precise-log'  socket: '/run/shm/var/tmp/3/mysqld.1.sock'  port: 16040  mariadb.org binary distribution
      CURRENT_TEST: rpl.rpl_mdev382
      150927 19:39:14 [Note] /usr/sbin/mysqld: Normal shutdown
       
      150927 19:39:14 [Note] Event Scheduler: Purging the queue. 0 events
      150927 19:39:14  InnoDB: Starting shutdown...
      150927 19:39:15  InnoDB: Shutdown completed; log sequence number 1621017
      150927 19:39:15 [Note] /usr/sbin/mysqld: Shutdown complete

      The test only does restart once, and does it seemingly accurate: it writes "wait", then shuts down the server, then removes the file, then writes "restart".
      And removing the file fails claiming that it does not exist.
      I could not reproduce it, but it's not the first time I see it (not so easy to find since it won't show up in the cross-reference report).

      For now, I don't see why it's happening.

      Attachments

        Issue Links

          Activity

            elenst Elena Stepanova added a comment - - edited Relatively fresh occasion on 5.5-galera (Feb 17, 2016): http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-amd64/builds/2356/steps/test/logs/mysqld.1.err.1 http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-centos5-amd64/builds/2356/steps/test/logs/stdio So, it's not specific either to 32-bit platform or the particular builder.

            There are 5 occasions on labrador, and it does not even run MTR with parallel workers. So, the theory that a parallel test somehow interferes can be safely abandoned.

            http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5554 (bb-10.0-serg)
            http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5566
            http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5567
            http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5582
            http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5818

            All but one are on 5.5 trees. It starts getting strange – given that 10.x have many more pushes and test runs, it should hit the problem more frequently.

            elenst Elena Stepanova added a comment - There are 5 occasions on labrador, and it does not even run MTR with parallel workers. So, the theory that a parallel test somehow interferes can be safely abandoned. http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5554 (bb-10.0-serg) http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5566 http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5567 http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5582 http://buildbot.askmonty.org/buildbot/builders/labrador/builds/5818 All but one are on 5.5 trees. It starts getting strange – given that 10.x have many more pushes and test runs, it should hit the problem more frequently.
            elenst Elena Stepanova added a comment - Occasions on p8: http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar/builds/304 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar/builds/586 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/1036 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/109 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/1164 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/140 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/196 http://buildbot.askmonty.org/buildbot/builders/p8-rhel7-bintar-debug/builds/820 (maybe there are more, didn't check vlp04-06).

            Reproduced locally on 5.5.44 release build by running perl ./mtr rpl.rpl_mdev382 --repeat=100

            elenst Elena Stepanova added a comment - Reproduced locally on 5.5.44 release build by running perl ./mtr rpl.rpl_mdev382 --repeat=100

            The failure is very easily and reliably reproducible with this change:

            diff --git a/mysql-test/suite/rpl/t/rpl_mdev382.test b/mysql-test/suite/rpl/t/rpl_mdev382.test
            index 21925e0..c26e9e8 100644
            --- a/mysql-test/suite/rpl/t/rpl_mdev382.test
            +++ b/mysql-test/suite/rpl/t/rpl_mdev382.test
            @@ -210,6 +210,7 @@ EOF
             --shutdown_server 30
             
             --remove_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
            +--sleep 1
             --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
             restart-rpl_mdev382.test
             EOF
            

            The problem is that after finding mysqld.1.expect file containing wait, mysql-test-run.pl keeps reading it in a loop. So, there is a short window between remove and write when the file does not exist, if MTR happens to attempt to read at this moment, the failure occurs.

            The fix is to use append_file instead.
            It has already been fixed in 10.x, hidden among other fixes in this commit:

            commit 6b97512b2170fc6a77288cef6e5a85cad773ae4e
            Author: unknown <knielsen@knielsen-hq.org>
            Date:   Wed Apr 24 13:05:40 2013 +0200
             
                Add missing check for thd->killed in mysql_binlog_send().
                
                The slave dump thread running on the master only checked thd->killed whenever
                it reached the end of a binlog file, not between events. This could
                unnecessarily delay server shutdown.
                
                This was found by code inspection while tracking down some occasional "forcing
                close of thread..." errors in Buildbot. Hopefully this will fix the failures,
                but the fix is correct in any case.
                
                Also increase the wait during server shutdown, 2 seconds is a bit tight in
                case of heavy I/O stall, and it seems better to delay shutdown a bit than
                force-kill threads unnecessarily.
                
                Also fix some races in test cases that restart the mysqld server. The .expect
                file should be changed with --append_file, --remove_file + --write_file
                creates a short window where mysqld can error out due to .expect file missing.
            

            So it's a 5.5-only fix.

            elenst Elena Stepanova added a comment - The failure is very easily and reliably reproducible with this change: diff --git a/mysql-test/suite/rpl/t/rpl_mdev382.test b/mysql-test/suite/rpl/t/rpl_mdev382.test index 21925e0..c26e9e8 100644 --- a/mysql-test/suite/rpl/t/rpl_mdev382.test +++ b/mysql-test/suite/rpl/t/rpl_mdev382.test @@ -210,6 +210,7 @@ EOF --shutdown_server 30 --remove_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect +--sleep 1 --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect restart-rpl_mdev382.test EOF The problem is that after finding mysqld.1.expect file containing wait , mysql-test-run.pl keeps reading it in a loop. So, there is a short window between remove and write when the file does not exist, if MTR happens to attempt to read at this moment, the failure occurs. The fix is to use append_file instead. It has already been fixed in 10.x, hidden among other fixes in this commit: commit 6b97512b2170fc6a77288cef6e5a85cad773ae4e Author: unknown <knielsen@knielsen-hq.org> Date: Wed Apr 24 13:05:40 2013 +0200   Add missing check for thd->killed in mysql_binlog_send(). The slave dump thread running on the master only checked thd->killed whenever it reached the end of a binlog file, not between events. This could unnecessarily delay server shutdown. This was found by code inspection while tracking down some occasional "forcing close of thread..." errors in Buildbot. Hopefully this will fix the failures, but the fix is correct in any case. Also increase the wait during server shutdown, 2 seconds is a bit tight in case of heavy I/O stall, and it seems better to delay shutdown a bit than force-kill threads unnecessarily. Also fix some races in test cases that restart the mysqld server. The .expect file should be changed with --append_file, --remove_file + --write_file creates a short window where mysqld can error out due to .expect file missing. So it's a 5.5-only fix.
            elenst Elena Stepanova added a comment - https://github.com/MariaDB/server/commit/67b4a6f576db102d652b401c3d0150556826f641

            People

              elenst Elena Stepanova
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.