[MDEV-8004] key_buffer related crashes in MyISAM table check, stacktrace in error log truncated Created: 2015-04-16  Updated: 2015-12-01  Resolved: 2015-12-01

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Aria
Affects Version/s: 10.0.16
Fix Version/s: 10.0.21, 10.1.7

Type: Bug Priority: Critical
Reporter: Hartmut Holzgraefe Assignee: Michael Widenius
Resolution: Fixed Votes: 2
Labels: None
Environment:

CentOS 6.5 - 64bit


Issue Links:
Relates
relates to MDEV-8325 Server crash in pagecache_fwrite afte... Confirmed

 Description   

There have been seveal cases now where MariaDB 10.0.16 crashed with a strack trace that only shows signal/crash handling stack frames, starting with pthread_cond_signal() call, e.g.:

  Thread pointer: 0x0x7f482702d008
  Attempting backtrace. You can use the following information to find out
  where mysqld died. If you see no messages after this, something went
  terribly wrong...
  stack_bottom = 0x7f90cb141840 thread_stack 0x48000
  /usr/sbin/mysqld(my_print_stacktrace+0x2b)[0xb73d3b]
  /usr/sbin/mysqld(handle_fatal_signal+0x398)[0x726518]
  /lib64/libpthread.so.0(+0xf710)[0x7f90cae0a710]
  /lib64/libpthread.so.0(pthread_cond_signal+0x44)[0x7f90cae06e34]
 
  Trying to get some variables.
  Some pointers may be invalid and cause the dump to abort.
  Query (0x7f48306c5015): is an invalid pointer
  Connection ID (thread ID): 5
  Status: NOT_KILLED
 

So there is no information whatsoever that would hint towards what acutally lead up to this crash (and nothing suspicious in the error log leading up to the crash either). The only suspicious pattern that I could identify so far is that the Connection ID is usually a very low one (so far seen: 5 twice, 7 once, and one very high number in the 60 000 range ...) so things may be related to replication threads).

Any idea how to get a more useful stack trace, preferrably without having to enable core dumps (as this happens on systems with huge memory / process size)?



 Comments   
Comment by Sergei Golubchik [ 2015-06-01 ]

To answer the last question: you can to get stack traces with gdb. It doesn't require a core dump. Run something like

gdb mysqld `pgrep mysqld` -ex c -ex 'thread apply all bt' -ex c -ex quit

It will attach to the running mysqld, let it continue running until it crashes, collect stack traces for all threads, and let it exit.

That's tested and works (but not on mysqld, on a small a.out — multithreaded with a sigsegv handler).

I've no idea what gdb does to mysqld performance.

Comment by Simon J Mudd [ 2015-06-11 ]

Sergei,

I missed this completely. You're missing the point. the point is not to get a stack trace of a running server, the point is to get a stack trace of the thread when mysqld crashes.

Current 10.0.16 (not tried later versions) do not provide a complete strack trace of the activity of the thread which generates an exception. So the context / cause of the crash is not visible to the user or to pass back to the developers so they can track down why it happened.

This works in MySQL and MariaDB 5.5 but from what I can see it seems to have stopped working in 10.0. (or at least in 10.0.1X this seems to be the case). See above output.

If something goes wrong with mysqld and it crashes this information is invaluable. Given how complex mysqld is having it crash with no clue as to the cause is a problem for the user of the software as it's going to make it much harder to get the problem fixed.

So please if you can ensure that when mysqld crashes the full stack trace of the "broken" thread is shown.

I hope this clarifies the situation.

Comment by Sergei Golubchik [ 2015-06-11 ]

simon.mudd, but that's precisely what I suggested. That command is to get a stack trace when mysqld crashes.

Comment by Hartmut Holzgraefe [ 2015-06-18 ]

Instead of

  gdb mysqld `pgrep mysqld` ...

it has to be

  gbd mysqld `pgrep -x mysqld` ...

for exact pgrep match, otherwise it's likely to pick up the mysqld_safe scripts process instead of the actual mysqld as the wrapper script is likely to have a lower PID than the server process.

Now I can indeed see a gdb stack trace when forcing a "kill -11" on mysqld.

I also did some simple performance tests with sysbench on my home machine and couldn't seen any significant difference between runs with or without gdb attached.

So this gdb trick looks like an interesting approach indeed.

Comment by Hartmut Holzgraefe [ 2015-06-18 ]

Current 10.0.16 (not tried later versions) do not provide a complete strack trace of the activity of the thread which generates an exception.

In general it does, I get a stack trace in the error log just fine when forcing one with "killall -11 mysqld", with both self compiled binaries and MariaDB RPMs.

I also tried to forcefully overwrite the stack with zeros and then force a crash by a NULL pointer assignment in a patched mysqld, but I either got the full stack trace from that, or traces that contained at least one mangled all-zero symbol entry after the "pthread_cond_signal" line from that ....

So it looks as if there's a bug in 10.0.16 (or 10.0.x in general) that mangles the stack in a very interesting way that makes it look valid to the stack trace resolver in a way as if pthread_cond_signal() was indeed the bottom of a call stack ...?

Comment by Hartmut Holzgraefe [ 2015-06-18 ]

PS: maybe it should be "thread all apply bt full" right away to get more detailed gdb output? Or maybe a combination of

... -ex 'thread apply all bt' -ex 'thread apply all bt full' ...

for both the simple and full format (untested, but should work?)

Comment by Simon J Mudd [ 2015-06-19 ]

I think you're missing something rather important.

The request is to ensure that the stack trace provided by MariaDB when it crashes after being started up normally works.
Currently it does not. Production servers can't realistically all be started up the way suggested by Sergei above. They are started up via the normal startup scripts, and at least up to MariaDB 5.5 and for example MySQL 5.6 and 5.7 will provide a stack trace of the broken thread if mysqld crashes. This is not happening now with 10.0.

Comment by Sergei Golubchik [ 2015-06-19 ]

Okay, I see. The complain here is not that MariaDB crashes, but that there is no stack trace.

I'll reopen it.

Comment by Sergei Golubchik [ 2015-06-19 ]

May be the problem here is not the stack tracing code itself, but that this bug that corrupts the stack in this specific way is only possible in 10.0 (meaning, 5.5 and mysql 5.6, 5.7 would've also printed no stack trace for such a stack corruption).

If this is the case, the real stack trace (with gdb, as above) could help us to locate the bug.

Comment by Sergei Golubchik [ 2015-06-19 ]

MDEV-8325 — another issue with a crash without a stack trace.

Comment by Hartmut Holzgraefe [ 2015-07-24 ]

Two crashes with same stack trace have now been reported that have happened after MariaDB 10.0.19 was restarted after a hard kill

Core dumps were available and unmangled stack traces could be extracted

These two crashes happened during MyISAM table repair (triggered by auto repair on open feature)
while accessing the MyISAM key buffer cache. key_buffer_size in these cases were 50G
on systems with a total of 384GB RAM. Similar previous MariaDB 5.5 instances did not
crash in this way at all, so this may due to key buffer related changes in 10.0.x?

Last function call before pthread_cond_signal() is simple_key_cache_read().
I haven't found any direct call to pthread_cond_signal() or the mysql_cond_signal() convenience macro in there yet. There is probably at least one extra macro or
inline function layer in between that doesn't show in the stack trace?

#0  0x00007ffc3de858ac in pthread_kill () from /lib64/libpthread.so.0
#1  0x000000000072b342 in handle_fatal_signal ()
#2  <signal handler called>
#3  0x00007ffc3de84e34 in pthread_cond_signal@@GLIBC_2.3.2 ()
   from /lib64/libpthread.so.0
#4  0x0000000000b6a8b3 in simple_key_cache_read ()
#5  0x0000000000b2f975 in _mi_fetch_keypage ()
#6  0x0000000000b0b94b in chk_index_down ()
#7  0x0000000000b0bea3 in chk_index ()
#8  0x0000000000b0b98b in chk_index_down ()
#9  0x0000000000b0bea3 in chk_index ()
#10 0x0000000000b0b98b in chk_index_down ()
#11 0x0000000000b0bea3 in chk_index ()
#12 0x0000000000b0b98b in chk_index_down ()
#13 0x0000000000b0bea3 in chk_index ()
#14 0x0000000000b0dcc2 in chk_key ()
#15 0x0000000000b0872d in ha_myisam::check(THD*, st_ha_check_opt*) ()
#16 0x0000000000b05844 in ha_myisam::check_and_repair(THD*) ()
#17 0x0000000000592011 in Open_table_context::recover_from_failed_open() ()
#18 0x0000000000596e3e in open_tables(THD*, TABLE_LIST**, unsigned int*, unsigned int, Prelocking_strategy*) ()
#19 0x000000000063c699 in mysqld_show_create(THD*, TABLE_LIST*) ()
#20 0x00000000005d80dc in mysql_execute_command(THD*) ()
#21 0x00000000005da5e7 in mysql_parse ()
#22 0x00000000005dc9cc in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#23 0x0000000000699b53 in do_handle_one_connection(THD*) ()
#24 0x0000000000699c22 in handle_one_connection ()
#25 0x0000000000a715fd in pfs_spawn_thread ()
#26 0x00007ffc3de809d1 in start_thread () from /lib64/libpthread.so.0
#27 0x00007ffc3c59a8fd in clone () from /lib64/libc.so.6

Comment by Hartmut Holzgraefe [ 2015-07-24 ]

It's also weired that gdb could produce a stack trace just fine while the internal backtrace printing code couldn't?

Comment by Michael Widenius [ 2015-12-01 ]

The bug was an overflow when allocating an Aria keycache of more than 45G

Generated at Thu Feb 08 07:23:55 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.