[MDEV-10167] MariaDB crashes with Xen PVH - mysqld got signal 11 Created: 2016-06-02  Updated: 2016-09-21  Resolved: 2016-09-21

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.0.23, 10.0.25, 10.0
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Tomas Mozes Assignee: Jan Lindström (Inactive)
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Gentoo Linux, latest stable

Hardware:
HP Proliant DL360p Gen8, 2x CPU Intel Xeon CPU E5-2640 0 @ 2.50GHz, 256GB RAM, HW RAID 1 with 300-600GB disks 10.000/15.000 rpm

Software:
Linux Kernel 4.1.24
Glibc-2.22-r4
Gcc 4.9.3
Jemalloc-3.6.0
Openssl-1.0.2h
Libaio-0.3.110
Zlib-1.2.8-r1
Xen 4.6.1-r3


Attachments: Text File 01-crash-mariadb-10.0.25-1.log     Text File 02-crash-mariadb-10.0.25-2.log     Text File 03-crash-mariadb-10.0.23-1.log     Text File 04-crash-mariadb-10.0.23-2.log     File my.cnf     File mysqld.err    
Issue Links:
Relates
relates to MDEV-10020 InnoDB NOT IN Query Crash When One It... Closed

 Description   

Hello,
after upgrading our Gentoo Linux systems to latest stable versions, one of our MariaDB instance started to crash. The crashes seems to be random at times like:

160527 9:34:05 [ERROR] mysqld got signal 11 ;
160527 13:52:37 [ERROR] mysqld got signal 11 ;
160530 20:46:51 [ERROR] mysqld got signal 11 ;
160602 5:32:48 [ERROR] mysqld got signal 11 ;

We have a master-slave replication. We first upgraded the slave to 10.0.25, had it running for 2 weeks without problems. Then we did a master failover, it worked for a week and then it started to crash. First it crashed on MariaDB 10.0.25, then again on 10.0.25. We downgraded to MariaDB 10.0.23 (that version worked before the system update), but it crashed again. Then we switched back to the old master instance (failover) with MariaDB 10.0.23, but it crashed again.

The MariaDB instances are running virtualized as Xen DomU (in pvh mode) on Linux Kernel 4.1.24 with 230GB RAM / 24 vcpus. MariaDB was compiled with the following Gentoo USE flags: "extraengine jemalloc openssl pam server".

The database is mostly used for selects (1000-3000 qps).



 Comments   
Comment by Tomas Mozes [ 2016-06-02 ]

Gentoo bug report: https://bugs.gentoo.org/show_bug.cgi?id=584828

Comment by Elena Stepanova [ 2016-06-03 ]

It could have been the notorious 'long semaphore wait' crash, except that it used to be SIGABRT, not SIGSEGV.
jplindst, do you know why it could have changed? Or, can you get anything from the attached diagnostics output?

hydrapolic, could you please provide a bigger portion of the error log, the whole session since the server startup and till the restart?

Comment by Tomas Mozes [ 2016-06-03 ]

I've attached the server log. This is from the slave where we upgraded to version 10.0.25 on 5.5.2016, then we did a failover later on and the first crash appeared 27.5.2016.

Comment by Brian Evans [ 2016-06-08 ]

This may be related or a possible duplicate of MDEV-10020

Comment by Elena Stepanova [ 2016-06-09 ]

grknight, how do you figure? Do you see any resemblance in the crash stack trace, output, or anywhere else?

Comment by Brian Evans [ 2016-06-09 ]

@elenst, after review of the documents provided here, I must have misunderstood the OPs issue when i talked to him privately on IRC.
There he showed me a backtrace that looked like it was MDEV-10020.
This was from the same day as this report.

The dpaste is about to expire, so I'll repost it here

Thread pointer: 0x0x7fbbc9f89008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fbbfa5ddea0 thread_stack 0x48000
/usr/sbin/mysqld(my_print_stacktrace+0x3c)[0xb30aac]
/usr/sbin/mysqld(handle_fatal_signal+0x3a0)[0x6ea9a0]
/lib64/libpthread.so.0(+0x10db0)[0x7fbbf97e4db0]
/usr/sbin/mysqld(_ZN9in_string13value_to_itemEjP4Item+0x61)[0x71fe91]
/usr/sbin/mysqld[0x7bb276]
/usr/sbin/mysqld[0x7bb6e9]
/usr/sbin/mysqld(_ZN10SQL_SELECT17test_quick_selectEP3THD6BitmapILj64EEyybb+0x605)[0x7be3e5]
/usr/sbin/mysqld[0x5c6f14]
/usr/sbin/mysqld[0x5e8715]
/usr/sbin/mysqld(_ZN4JOIN14optimize_innerEv+0x774)[0x5ef3b4]
/usr/sbin/mysqld(_ZN4JOIN8optimizeEv+0x15)[0x5f1755]
/usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x75e)[0x5f1f7e]
/usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x162)[0x5f2292]
/usr/sbin/mysqld[0x59ec3a]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x557d)[0x5aad6d]
/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d0)[0x5acb00]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x177e)[0x5ae2ce]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x203)[0x6606e3]
/usr/sbin/mysqld(handle_one_connection+0x38)[0x660738]
/lib64/libpthread.so.0(+0x7484)[0x7fbbf97db484]
/lib64/libc.so.6(clone+0x6d)[0x7fbbf83d328d]

Comment by Elena Stepanova [ 2016-06-09 ]

grknight, thanks for this, it makes much more sense now; and you might still be right about it being a duplicate of MDEV-10020.

If we look at the attached crash logs and error log as a whole, the difference with the typical "long semaphore wait" problem is even more obvious.
Normally when it happens, InnoDB starts complaining about long waits (by printing the monitor output) while the server is still alive, keeps doing so for a while, and eventually gives up and goes down with assertion failure / SIGABRT when the wait exceeds the timeout.
Here, everything is backwards. First, out of blue the server reports SIGSEGV; only after that it produces the monitor output; and if we compare "semaphore wait" complaints in the output with the timestamps in the log, we'll see

160527  9:34:05 [ERROR] mysqld got signal 11 ;
...
Attempting backtrace. 
...
=====================================
2016-05-27 09:38:22 7f5437bfc700 INNODB MONITOR OUTPUT
=====================================
...
--Thread 140007088617216 has waited at lock0lock.cc line 4603 for 256.00 seconds the semaphore:
...

256 seconds is almost exactly the time between the crash and the monitor output.

So, my best guess is that the server gets the SIGSEGV on some reason (which might well be MDEV-10020), attempts to produce the backtrace, but somehow it gets stuck in the middle, and then InnoDB monitor starts acting up.
I don't know if it's even possible for the monitor to start counting and producing output after a crash, I don't remember ever seeing it before.
So, I'll assign it to jplindst to confirm (or reject) the theory, and if it's plausible, I suppose we can assume for now that the initial problem was indeed caused by MDEV-10020, and the rest is consequences of the post-crash instability.

Comment by Tomas Mozes [ 2016-06-10 ]

We had a chat with Brian on irc while trying to bring some more light on this bug report. I was trying to compile MariaDB on Gentoo in ways it would produce the most information to proceed and I was referring to MDEV-10020 while trying to crash MariaDB, because that is the only way I can for sure crash MariaDB 10.0.25. The goal was to find the optimal settings to use in production to produce some more logs / stack traces.

The dpaste output Brian posted was from the testing machine with MariaDB 10.0.25 while trying to crash it with MDEV-10020 with various compilation / debug settings.

That said, I don't think these two bugs have something in common - if you crash with MDEV-10020, you always get a stack trace, no matter how you compile MariaDB. In this case, there is no stack trace (from 5 crashes).

It seems we have identified the root cause of these crashes. Last week, we disabled Xen PVH mode (switching back to PV mode) and there was no crash afterwards (yet). I'll wait another week and if it won't crash again, I'll open a bug report on Xen.

Comment by Tomas Mozes [ 2016-06-21 ]

http://lists.xenproject.org/archives/html/xen-users/2016-06/msg00089.html

Comment by Jan Lindström (Inactive) [ 2016-06-21 ]

Some of the crashes are long semaphore waits, but not all. Too bad that you did not provide full unedited error logs for all crash cases. Not familiar with Xen PVH or PV mode.

Comment by Tomas Mozes [ 2016-06-21 ]

@jplindst, please see the attached mysqld.err.

Comment by Jan Lindström (Inactive) [ 2016-06-21 ]

I did, it does not contain the long semaphore wait error messages seen on other error logs and from startup to crash there is several hours and no error messages.

Comment by Tomas Mozes [ 2016-06-21 ]

@jplindst, I first attached the crash logs, then the surrounding logs. The information there is complete, just not in one place (sorry for that), so mysqld.err contains the full logs except the crash logs which are in separate logs.

Comment by Tomas Mozes [ 2016-09-21 ]

This bug can be closed.

This was caused by Xen PVH mode. After switching back to PV, the crashes are gone.

Thanks for all the help.

Generated at Thu Feb 08 07:40:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.