[MDEV-30836] MTR hangs after tests have completed Created: 2023-03-13 Updated: 2023-09-11 Resolved: 2023-09-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Scripts & Clients |
| Affects Version/s: | 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0 |
| Fix Version/s: | 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4, 11.1.3, 11.2.2 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Marko Mäkelä | Assignee: | Daniel Black |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | affects-tests, hang | ||
| Issue Links: |
|
||||||||
| Description |
|
Below is a copy of a comment of mine from Unfortunately, while a test of 10.3 1848804840f5595f982c4cd502ba2112f6dd7911 did not hang, anything else where this was supposedly fixed (including 10.5, 10.8, 10.9) would still hang. Here is an example of 10.9 f53f64b7b9edaef8e413add322225dc33ebc8131 (the first 10.9 revision that includes the fix):
With this patch applied and an invocation of
the test would hang, with a few mariadbd processes still active but sleeping. After
the server processes would terminate, but the mtr test driver would not react to them going away. I got interested in this because there was a bogus failure on buildbot.mariadb.org that looked like a hang:
In the file mysql-test/var/3/log/mysqld.1.err we do have a shutdown message as well as a message about a test that is not listed in the mtr output log above:
I do not know if this anomaly is related to the hang, but I am willing to believe so. |
| Comments |
| Comment by Marko Mäkelä [ 2023-04-03 ] | |||||||||||||||||||||||||||
|
Another example:
After the failed test run (it failed because an uninstrumented LZMA library was available during a MemorySanitizer run), you will not see any output of w8 between the failure and the end. According to the server error log https://ci.mariadb.org/33865/logs/amd64-ubuntu-2204-msan/mysqld.1.err.8 the test versioning.alter was started on that worker, but there was no attempt to shut down that server. | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-08 ] | |||||||||||||||||||||||||||
|
I found a reliable way to reproduce this on a Microsoft Windows builder on 10.6 and 10.9. To be exact:
In https://buildbot.mariadb.org/#/builders/234/builds/19006 this would cause the test gcol.gcol_purge to hang. The test name is not displayed by the failure report:
The last line of this output shows one more problem: it is a truncated line that attempts to mention one more test name (main.alter_table_mdev539_myisam). | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-08 ] | |||||||||||||||||||||||||||
|
Here is another example on 11.0, which does not seem to involve any test failure or hang:
Because there is no output, all server instances should have terminated by an orderly shutdown. Yet, the stdio output of the tests in https://buildbot.mariadb.org/#/builders/369/builds/10205 end in the following:
| |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-28 ] | |||||||||||||||||||||||||||
|
One more occurrence: https://buildbot.mariadb.org/#/builders/221/builds/23335
| |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-06-28 ] | |||||||||||||||||||||||||||
|
Edit: The following is due to an InnoDB test bug Here is one failure that actually is due to a hanging test:
Because vladbogo was quick enough to log in to the worker before it finally timed out in Buildbot, we found out that there was actually one stuck worker, but mtr failed to report it. The worker log file https://ci.mariadb.org/36162/logs/amd64-ubuntu-2204-debug-ps/mysqld.1.err.28 ends in the following:
I believe that the default test case timeout is 900 seconds (15 minutes). It might be the case that if a test starts near the end of the test run, the 600-second (10-minute) Buildbot timeout could kick in before mtr would get a chance to kill the process. Could the Buildbot timeout handling be fixed so that it would do multiple things:
| |||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-08-01 ] | |||||||||||||||||||||||||||
|
Commenting this out helped me to reproduce the hang:
There is mixed sematics as well:
exit_status() returns 1 for signal, but normal exit also returns 1. That way MTR doesn't do anything special about signal-killed child. | |||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-08-01 ] | |||||||||||||||||||||||||||
|
MTR workers are hanging in run_worker() at 944:
| |||||||||||||||||||||||||||
| Comment by Aleksey Midenkov [ 2023-08-10 ] | |||||||||||||||||||||||||||
|
Please review bb-10.4-midenok | |||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-08-10 ] | |||||||||||||||||||||||||||
|
Thank you for the fix! For the record, I successfully tested this on 10.6 as follows:
There are some additional (uncommitted) tests in my working directory. The main point is that mtr no longer hangs at the end, despite the massive amount of test failures. | |||||||||||||||||||||||||||
| Comment by Daniel Black [ 2023-09-05 ] | |||||||||||||||||||||||||||
|
Thank you very much! Did one minor change in Cygwin for the perl version on centos 7 |