[MDEV-11499] mysqltest, Windows : improve diagnostics if server fails to shutdown Created: 2016-12-07  Updated: 2021-09-24  Resolved: 2021-09-24

Status: Closed
Project: MariaDB Server
Component/s: Platform Windows, Tests
Fix Version/s: 10.3.32, 10.4.22, 10.5.13, 10.6.5

Type: Task Priority: Major
Reporter: Vladislav Vaintroub Assignee: Vladislav Vaintroub
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-26325 mysqld process hangs when stopped aft... Closed
relates to MDEV-4784 merge test cases from 5.6 Stalled
relates to MDEV-9843 InnoDB hangs on startup between "Inno... Closed
relates to MDEV-11370 Improvents in MTR crash or deadlock r... Open
relates to MDEV-14080 InnoDB shutdown sometimes hangs Closed

 Description   

MySQL commit 74726c59b7a650b948929b2839e83e26f9853d4d
b3e655fc0c6ab75e4037409e1783122eddcc736e
and this one : b3198c1c9ab62ea78a95909082b8cb32ba9d5f28



 Comments   
Comment by Marko Mäkelä [ 2017-01-27 ]

Normally, a forced kill of InnoDB should be harmless, because crash recovery will take care of it. However, innodb_read_only (introduced in MySQL 5.6 and MariaDB 10.0) prevents crash recovery, and the forced kills on shutdown_server timeout will be visible as failures of tests that use the innodb_read_only option. Now that MDEV-11814 is fixed, the symptoms should be more uniform: access to any InnoDB tables will be denied after the forced kill + read-only-restart.

Before MDEV-11623 introduced innodb.101_compatibility into 10.1.21, there was only one test that used the innodb_read_only option: innodb.innodb-get-fk. That test has been failing in the past due to this problem.

To be able to find the root cause of the shutdown hangs, we must first fix the shutdown_server statement in mtr by applying the MySQL commits mentioned in the description. We should also make sure that a core dump is produced when the shutdown hangs, and that mtr will dump the stack traces from the core dump, so that the buildbot logs will show some hints why the shutdown is hanging.

I think that this work should initially target 10.1 and ultimately be ported to 10.0 if possible.

Comment by Elena Stepanova [ 2017-01-27 ]

I' have some scattered notes from my previous short brush with this problem. There aren't really useful suggestions in there, so you can safely ignore them, but maybe they'll just make you consider some things that you'd otherwise miss at first.

Initially the main goal was to catch problems on server startup and shutdown which happens outside a test – that is, when MTR tries to start the server before a test, or when it discards a server which is no longer needed.

The problem that marko mentions in the previous comment is a bit different. The mysqltest command shutdown_server only affects server restarts which happen inside a test, when the test uses it either directly or through a chain on include files. This should be easy to fix with a one-liner, i'll give it a try. You can later re-do it with merging MySQL 5.7 changes in mysqltest.cc, it seemed very intrusive when I looked at it.

But it's not a solution to the general problem. For usual startup and shutdown MTR doesn't use shutdown_server command, it's MTR's own doing, in safe_process.cc and such. In a short attempt, I couldn't get it done reliably, it would either kill too little or too much. It must be doable, I guess I just didn't dig up the right place.

Please remember that we can't rely on SIGABRT always terminating the server, it's known to hang. There must be SIGKILL after another timeout, either unconditionally or upon a check that the process still exists.

The actual challenge is reporting, as just SIGABRT-ing the server isn't helping if we don't get a stack trace out of it.

When it happens on server startup, it should be somewhat easier, and possibly it will even happen automatically if you solve the problem of killing the hanging server (and will abort it instead). When crash happens on startup, we are about to enter a test, so MTR is somewhere in check-testcase and is already able to report a failure and produce a stack trace if there is a coredump. Maybe it's not reliable and needs to be fixed, but at least there is a mechanism for that.

But the shutdown (server restart between tests) is a problem. There is no mechanism for processing it properly, everything needs to be added. MTR discards the server, so it doesn't care what's happening to it. It only checks for "warnings generated in error logs during shutdown" (it's the exact line from mysql-test-run.pl if you need to find the place). Somewhere around it, but not in the elsif itself of course, a search for coredumps and calls for My::CoreDump->show probably need to be added. It's also important to honor there opt_max_save_core if it's set, because otherwise on a bad build we can cause a big problem with disk space.

There is no mechanism of attaching a hang/crash on shutdown to any specific test, they can only be reported the same way as "warnings during shutdown" are reported now after a chain of tests. The problem is that they are very often overlooked – when there are any test failures in the output, people are usually only looking for them. It would be nice to make them somewhat more visible; ideally, maybe, to add a separate "server_startup_shutdown_report" pseudo-test, much like we have valgrind_report when valgrind is enabled. It might be complicated though, and can definitely wait.

On a somewhat different (yet related) note, I might have spotted another reason why a stack trace isn't printed even when it should be. The theory remained unchecked, so unless I do it before, you might want to look at it.
It appears that coredumps are only ever searched for and stack traces are taken if this condition is not met:

            if ($opt_max_save_datadir > 0 &&
                $num_saved_datadir >= $opt_max_save_datadir)

It shouldn't be so, it's one thing not to save a datadir and quite another not to check for useful stuff before removing. And we do run with max-save-datadir=1 on some builders, so it might be a problem.

Generated at Thu Feb 08 07:50:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.