[MDEV-11370] Improvents in MTR crash or deadlock reporting Created: 2016-11-28  Updated: 2019-11-12

Status: Open
Project: MariaDB Server
Component/s: Tests
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Elena Stepanova Assignee: Vicențiu Ciorbaru
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-11499 mysqltest, Windows : improve diagnost... Closed

 Description   

Make MTR kill server with SIGABRT when the server startup timeout is exceeded

Now when the server does not start in reasonable time, the only diagnostics that we get from buildbot is

Failed to start mysqld.3

and the server error log. When server hangs on startup, we would want to have a stack trace.

Check if it's possible to kill the server with SIGABRT when the result is 'Failed to start'. If the server is already down, it won't hurt, but if it hangs, it can give more information.

Make MTR kill server with SIGABRT when the server shutdown timeout is exceeded and print the stack trace

When server shutdown is exceeded, MTR silently SIGKILLs the server and proceeds. It hides important failures on server shutdown. Instead, it should report a hang and get a stack trace. On the other hand, it should be differentiated from intentional server killing, when the shutdown timeout is initially zero – in that case, it should be SIGKILL-ed without a stack trace.



 Comments   
Comment by Marko Mäkelä [ 2019-11-12 ]

Starting with MariaDB Server 10.3.1, shutdown_server will kill -ABRT upon exceeding the timeout (60 seconds by default).

However, the kill -ABRT will not always produce a core dump, and mysqltest does not always seem to be interested in extracting the stack traces from the core dump. Here is a recent example of an apparent shutdown hang, possibly caused by MDEV-14486. On kvm-asan, core dumps are disabled by default due to their sheer size:

10.4 0308de94ee806c21b6776ecab73396da75282596

CURRENT_TEST: innodb.ibuf_not_empty
2019-11-11 10:17:51 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld: ready for connections.
Version: '10.4.11-MariaDB-debug-log'  socket: '/dev/shm/var/tmp/1/mysqld.1.sock'  port: 16000  Source distribution
2019-11-11 10:17:52 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld (initiated by: root[root] @ localhost []): Normal shutdown
2019-11-11 10:17:52 0 [Note] Event Scheduler: Purging the queue. 0 events
2019-11-11 10:17:52 0 [Note] InnoDB: FTS optimize thread exiting.
2019-11-11 10:17:52 0 [Note] InnoDB: Starting shutdown...
2019-11-11 10:17:52 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/var/1/mysqld.1/data/ib_buffer_pool
2019-11-11 10:17:52 0 [Note] InnoDB: Instance 0, restricted to 123 pages due to innodb_buf_pool_dump_pct=25
2019-11-11 10:17:52 0 [Note] InnoDB: Buffer pool(s) dump completed at 191111 10:17:52
191111 10:18:52 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.11-MariaDB-debug-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=0
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63622 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x5fc00
/usr/lib/x86_64-linux-gnu/libasan.so.2(+0x4a077)[0x7f5e580f5077]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(my_print_stacktrace+0xce)[0x2328dd3]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(handle_fatal_signal+0x87a)[0x11377b4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f5e57220390]
/lib/x86_64-linux-gnu/libpthread.so.0(__nanosleep+0x2d)[0x7f5e5721fc1d]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x1b42488]
maria/ha_maria.cc:2773(ha_maria::external_lock(THD*, int))[0x1ae0635]
csv/ha_tina.cc:399(ha_tina::init_tina_writer())[0x1d342ed]
sql/spatial.cc:2823(Gis_multi_polygon::area(double*, char const**) const)[0x19559b8]
sql/sql_connect.cc:1351(do_handle_one_connection(CONNECT*))[0x11400c7]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x9f0371]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x9f0fa7]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_Z15plugin_shutdownv+0x1a5)[0x9f5373]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x6fe7ca]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_Z11mysqld_mainiPPc+0x173d)[0x70c25a]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(main+0x20)[0x6f5606]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f5e55d14830]
/home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_start+0x29)[0x6f5519]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /dev/shm/var/1/mysqld.1/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    0                    bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             23715                23715                processes 
Max open files            1024                 1024                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       23715                23715                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/share/apport/apport %p %s %c %P
 
2019-11-11 10:18:55 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld (mysqld 10.4.11-MariaDB-debug-log) starting as process 21192 ...

Note: there were exactly 60 seconds between the two messages near the start of the excerpt, so we can be fairly sure that the SIGABRT was sent externally. Also worth noting (and worth nothing) is the completely bogus stack trace.

I think that before attempting to deliver the SIGABRT, it would be useful to attach gdb to the process and issue thread apply all backtrace. In that way, we will get some information even if the core dump is disabled. (There should be some error handling, of course. Attaching the debugger would not work if another debugger is already attached, or might not produce desired results if the server is running under Valgrind, for example.)

Furthermore, it looks like that server crashes on re-bootstrap are being mostly ignored (only resulting in some messages in the stdout.log, not leading to test failures).

Finally, I suspect that we are not always trying to extract stack traces when the server crashes on restart, either between tests or inside a test.

Last, a wishlist item: If very many tests fail or the server crashes very probably during startup (which could be the case while developing some change), then it is likely that ./mtr --max-test-fail=0 will be aborted abruptly.

Generated at Thu Feb 08 07:49:26 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.