[MDEV-10167] MariaDB crashes with Xen PVH - mysqld got signal 11 - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Not a Bug
Affects Version/s: 10.0.23, 10.0.25, 10.0(EOL)
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
None
Environment:

Hide
Gentoo Linux, latest stable

Hardware:
HP Proliant DL360p Gen8, 2x CPU Intel Xeon CPU E5-2640 0 @ 2.50GHz, 256GB RAM, HW RAID 1 with 300-600GB disks 10.000/15.000 rpm

Software:
Linux Kernel 4.1.24
Glibc-2.22-r4
Gcc 4.9.3
Jemalloc-3.6.0
Openssl-1.0.2h
Libaio-0.3.110
Zlib-1.2.8-r1
Xen 4.6.1-r3

Show
Gentoo Linux, latest stable Hardware: HP Proliant DL360p Gen8, 2x CPU Intel Xeon CPU E5-2640 0 @ 2.50GHz, 256GB RAM, HW RAID 1 with 300-600GB disks 10.000/15.000 rpm Software: Linux Kernel 4.1.24 Glibc-2.22-r4 Gcc 4.9.3 Jemalloc-3.6.0 Openssl-1.0.2h Libaio-0.3.110 Zlib-1.2.8-r1 Xen 4.6.1-r3

Description

Hello,
after upgrading our Gentoo Linux systems to latest stable versions, one of our MariaDB instance started to crash. The crashes seems to be random at times like:

160527 9:34:05 [ERROR] mysqld got signal 11 ;
160527 13:52:37 [ERROR] mysqld got signal 11 ;
160530 20:46:51 [ERROR] mysqld got signal 11 ;
160602 5:32:48 [ERROR] mysqld got signal 11 ;

We have a master-slave replication. We first upgraded the slave to 10.0.25, had it running for 2 weeks without problems. Then we did a master failover, it worked for a week and then it started to crash. First it crashed on MariaDB 10.0.25, then again on 10.0.25. We downgraded to MariaDB 10.0.23 (that version worked before the system update), but it crashed again. Then we switched back to the old master instance (failover) with MariaDB 10.0.23, but it crashed again.

The MariaDB instances are running virtualized as Xen DomU (in pvh mode) on Linux Kernel 4.1.24 with 230GB RAM / 24 vcpus. MariaDB was compiled with the following Gentoo USE flags: "extraengine jemalloc openssl pam server".

The database is mostly used for selects (1000-3000 qps).

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

01-crash-mariadb-10.0.25-1.log
41 kB
2016-06-02 07:54
02-crash-mariadb-10.0.25-2.log
1 kB
2016-06-02 07:54
03-crash-mariadb-10.0.23-1.log
20 kB
2016-06-02 07:54
04-crash-mariadb-10.0.23-2.log
23 kB
2016-06-02 07:54
my.cnf
2 kB
2016-06-02 07:54
mysqld.err
23 kB
2016-06-03 14:42

Issue Links

relates to

MDEV-10020 InnoDB NOT IN Query Crash When One Item Is NULL

Closed

Activity

Ascending order - Click to sort in descending order

Tomáš Mózes added a comment - 2016-06-02 07:54

Gentoo bug report: https://bugs.gentoo.org/show_bug.cgi?id=584828

Tomáš Mózes added a comment - 2016-06-02 07:54 Gentoo bug report: https://bugs.gentoo.org/show_bug.cgi?id=584828

Elena Stepanova added a comment - 2016-06-03 10:33

It could have been the notorious 'long semaphore wait' crash, except that it used to be SIGABRT, not SIGSEGV.
jplindst, do you know why it could have changed? Or, can you get anything from the attached diagnostics output?

hydrapolic, could you please provide a bigger portion of the error log, the whole session since the server startup and till the restart?

Elena Stepanova added a comment - 2016-06-03 10:33 It could have been the notorious 'long semaphore wait' crash, except that it used to be SIGABRT, not SIGSEGV. jplindst , do you know why it could have changed? Or, can you get anything from the attached diagnostics output? hydrapolic , could you please provide a bigger portion of the error log, the whole session since the server startup and till the restart?

Tomáš Mózes added a comment - 2016-06-03 14:46 - edited

I've attached the server log. This is from the slave where we upgraded to version 10.0.25 on 5.5.2016, then we did a failover later on and the first crash appeared 27.5.2016.

Tomáš Mózes added a comment - 2016-06-03 14:46 - edited I've attached the server log. This is from the slave where we upgraded to version 10.0.25 on 5.5.2016, then we did a failover later on and the first crash appeared 27.5.2016.

Brian Evans added a comment - 2016-06-08 14:57

This may be related or a possible duplicate of ~~MDEV-10020~~

Brian Evans added a comment - 2016-06-08 14:57 This may be related or a possible duplicate of MDEV-10020

Elena Stepanova added a comment - 2016-06-09 10:52

grknight, how do you figure? Do you see any resemblance in the crash stack trace, output, or anywhere else?

Elena Stepanova added a comment - 2016-06-09 10:52 grknight , how do you figure? Do you see any resemblance in the crash stack trace, output, or anywhere else?

Brian Evans added a comment - 2016-06-09 12:39

@elenst, after review of the documents provided here, I must have misunderstood the OPs issue when i talked to him privately on IRC.
There he showed me a backtrace that looked like it was ~~MDEV-10020~~.
This was from the same day as this report.

The dpaste is about to expire, so I'll repost it here

Thread pointer: 0x0x7fbbc9f89008

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7fbbfa5ddea0 thread_stack 0x48000

/usr/sbin/mysqld(my_print_stacktrace+0x3c)[0xb30aac]

/usr/sbin/mysqld(handle_fatal_signal+0x3a0)[0x6ea9a0]

/lib64/libpthread.so.0(+0x10db0)[0x7fbbf97e4db0]

/usr/sbin/mysqld(_ZN9in_string13value_to_itemEjP4Item+0x61)[0x71fe91]

/usr/sbin/mysqld[0x7bb276]

/usr/sbin/mysqld[0x7bb6e9]

/usr/sbin/mysqld(_ZN10SQL_SELECT17test_quick_selectEP3THD6BitmapILj64EEyybb+0x605)[0x7be3e5]

/usr/sbin/mysqld[0x5c6f14]

/usr/sbin/mysqld[0x5e8715]

/usr/sbin/mysqld(_ZN4JOIN14optimize_innerEv+0x774)[0x5ef3b4]

/usr/sbin/mysqld(_ZN4JOIN8optimizeEv+0x15)[0x5f1755]

/usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x75e)[0x5f1f7e]

/usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x162)[0x5f2292]

/usr/sbin/mysqld[0x59ec3a]

/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x557d)[0x5aad6d]

/usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d0)[0x5acb00]

/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x177e)[0x5ae2ce]

/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x203)[0x6606e3]

/usr/sbin/mysqld(handle_one_connection+0x38)[0x660738]

/lib64/libpthread.so.0(+0x7484)[0x7fbbf97db484]

/lib64/libc.so.6(clone+0x6d)[0x7fbbf83d328d]

Brian Evans added a comment - 2016-06-09 12:39 @elenst, after review of the documents provided here, I must have misunderstood the OPs issue when i talked to him privately on IRC. There he showed me a backtrace that looked like it was MDEV-10020 . This was from the same day as this report. The dpaste is about to expire, so I'll repost it here Thread pointer: 0x0x7fbbc9f89008 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7fbbfa5ddea0 thread_stack 0x48000 /usr/sbin/mysqld(my_print_stacktrace+0x3c)[0xb30aac] /usr/sbin/mysqld(handle_fatal_signal+0x3a0)[0x6ea9a0] /lib64/libpthread.so.0(+0x10db0)[0x7fbbf97e4db0] /usr/sbin/mysqld(_ZN9in_string13value_to_itemEjP4Item+0x61)[0x71fe91] /usr/sbin/mysqld[0x7bb276] /usr/sbin/mysqld[0x7bb6e9] /usr/sbin/mysqld(_ZN10SQL_SELECT17test_quick_selectEP3THD6BitmapILj64EEyybb+0x605)[0x7be3e5] /usr/sbin/mysqld[0x5c6f14] /usr/sbin/mysqld[0x5e8715] /usr/sbin/mysqld(_ZN4JOIN14optimize_innerEv+0x774)[0x5ef3b4] /usr/sbin/mysqld(_ZN4JOIN8optimizeEv+0x15)[0x5f1755] /usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x75e)[0x5f1f7e] /usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x162)[0x5f2292] /usr/sbin/mysqld[0x59ec3a] /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x557d)[0x5aad6d] /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d0)[0x5acb00] /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x177e)[0x5ae2ce] /usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x203)[0x6606e3] /usr/sbin/mysqld(handle_one_connection+0x38)[0x660738] /lib64/libpthread.so.0(+0x7484)[0x7fbbf97db484] /lib64/libc.so.6(clone+0x6d)[0x7fbbf83d328d]

Elena Stepanova added a comment - 2016-06-09 16:29

grknight, thanks for this, it makes much more sense now; and you might still be right about it being a duplicate of ~~MDEV-10020~~.

If we look at the attached crash logs and error log as a whole, the difference with the typical "long semaphore wait" problem is even more obvious.
Normally when it happens, InnoDB starts complaining about long waits (by printing the monitor output) while the server is still alive, keeps doing so for a while, and eventually gives up and goes down with assertion failure / SIGABRT when the wait exceeds the timeout.
Here, everything is backwards. First, out of blue the server reports SIGSEGV; only after that it produces the monitor output; and if we compare "semaphore wait" complaints in the output with the timestamps in the log, we'll see

160527  9:34:05 [ERROR] mysqld got signal 11 ;

...

Attempting backtrace.

...

=====================================

2016-05-27 09:38:22 7f5437bfc700 INNODB MONITOR OUTPUT

=====================================

...

--Thread 140007088617216 has waited at lock0lock.cc line 4603 for 256.00 seconds the semaphore:

...

256 seconds is almost exactly the time between the crash and the monitor output.

So, my best guess is that the server gets the SIGSEGV on some reason (which might well be ~~MDEV-10020~~), attempts to produce the backtrace, but somehow it gets stuck in the middle, and then InnoDB monitor starts acting up.
I don't know if it's even possible for the monitor to start counting and producing output after a crash, I don't remember ever seeing it before.
So, I'll assign it to jplindst to confirm (or reject) the theory, and if it's plausible, I suppose we can assume for now that the initial problem was indeed caused by ~~MDEV-10020~~, and the rest is consequences of the post-crash instability.

Elena Stepanova added a comment - 2016-06-09 16:29 grknight , thanks for this, it makes much more sense now; and you might still be right about it being a duplicate of MDEV-10020 . If we look at the attached crash logs and error log as a whole, the difference with the typical "long semaphore wait" problem is even more obvious. Normally when it happens, InnoDB starts complaining about long waits (by printing the monitor output) while the server is still alive, keeps doing so for a while, and eventually gives up and goes down with assertion failure / SIGABRT when the wait exceeds the timeout. Here, everything is backwards. First, out of blue the server reports SIGSEGV ; only after that it produces the monitor output; and if we compare "semaphore wait" complaints in the output with the timestamps in the log, we'll see 160527 9:34:05 [ERROR] mysqld got signal 11 ; ... Attempting backtrace. ... ===================================== 2016-05-27 09:38:22 7f5437bfc700 INNODB MONITOR OUTPUT ===================================== ... --Thread 140007088617216 has waited at lock0lock.cc line 4603 for 256.00 seconds the semaphore: ... 256 seconds is almost exactly the time between the crash and the monitor output. So, my best guess is that the server gets the SIGSEGV on some reason (which might well be MDEV-10020 ), attempts to produce the backtrace, but somehow it gets stuck in the middle, and then InnoDB monitor starts acting up. I don't know if it's even possible for the monitor to start counting and producing output after a crash, I don't remember ever seeing it before. So, I'll assign it to jplindst to confirm (or reject) the theory, and if it's plausible, I suppose we can assume for now that the initial problem was indeed caused by MDEV-10020 , and the rest is consequences of the post-crash instability.

Tomáš Mózes added a comment - 2016-06-10 07:37

We had a chat with Brian on irc while trying to bring some more light on this bug report. I was trying to compile MariaDB on Gentoo in ways it would produce the most information to proceed and I was referring to ~~MDEV-10020~~ while trying to crash MariaDB, because that is the only way I can for sure crash MariaDB 10.0.25. The goal was to find the optimal settings to use in production to produce some more logs / stack traces.

The dpaste output Brian posted was from the testing machine with MariaDB 10.0.25 while trying to crash it with ~~MDEV-10020~~ with various compilation / debug settings.

That said, I don't think these two bugs have something in common - if you crash with ~~MDEV-10020~~, you always get a stack trace, no matter how you compile MariaDB. In this case, there is no stack trace (from 5 crashes).

It seems we have identified the root cause of these crashes. Last week, we disabled Xen PVH mode (switching back to PV mode) and there was no crash afterwards (yet). I'll wait another week and if it won't crash again, I'll open a bug report on Xen.

Tomáš Mózes added a comment - 2016-06-10 07:37 We had a chat with Brian on irc while trying to bring some more light on this bug report. I was trying to compile MariaDB on Gentoo in ways it would produce the most information to proceed and I was referring to MDEV-10020 while trying to crash MariaDB, because that is the only way I can for sure crash MariaDB 10.0.25. The goal was to find the optimal settings to use in production to produce some more logs / stack traces. The dpaste output Brian posted was from the testing machine with MariaDB 10.0.25 while trying to crash it with MDEV-10020 with various compilation / debug settings. That said, I don't think these two bugs have something in common - if you crash with MDEV-10020 , you always get a stack trace, no matter how you compile MariaDB. In this case, there is no stack trace (from 5 crashes). It seems we have identified the root cause of these crashes. Last week, we disabled Xen PVH mode (switching back to PV mode) and there was no crash afterwards (yet). I'll wait another week and if it won't crash again, I'll open a bug report on Xen.

Tomáš Mózes added a comment - 2016-06-21 06:49

http://lists.xenproject.org/archives/html/xen-users/2016-06/msg00089.html

Tomáš Mózes added a comment - 2016-06-21 06:49 http://lists.xenproject.org/archives/html/xen-users/2016-06/msg00089.html

Jan Lindström (Inactive) added a comment - 2016-06-21 07:08

Some of the crashes are long semaphore waits, but not all. Too bad that you did not provide full unedited error logs for all crash cases. Not familiar with Xen PVH or PV mode.

Jan Lindström (Inactive) added a comment - 2016-06-21 07:08 Some of the crashes are long semaphore waits, but not all. Too bad that you did not provide full unedited error logs for all crash cases. Not familiar with Xen PVH or PV mode.

Tomáš Mózes added a comment - 2016-06-21 08:59

@jplindst, please see the attached mysqld.err.

Tomáš Mózes added a comment - 2016-06-21 08:59 @jplindst, please see the attached mysqld.err.

Jan Lindström (Inactive) added a comment - 2016-06-21 09:24

I did, it does not contain the long semaphore wait error messages seen on other error logs and from startup to crash there is several hours and no error messages.

Jan Lindström (Inactive) added a comment - 2016-06-21 09:24 I did, it does not contain the long semaphore wait error messages seen on other error logs and from startup to crash there is several hours and no error messages.

Tomáš Mózes added a comment - 2016-06-21 10:46

@jplindst, I first attached the crash logs, then the surrounding logs. The information there is complete, just not in one place (sorry for that), so mysqld.err contains the full logs except the crash logs which are in separate logs.

Tomáš Mózes added a comment - 2016-06-21 10:46 @jplindst, I first attached the crash logs, then the surrounding logs. The information there is complete, just not in one place (sorry for that), so mysqld.err contains the full logs except the crash logs which are in separate logs.

Tomáš Mózes added a comment - 2016-09-21 07:17

This bug can be closed.

This was caused by Xen PVH mode. After switching back to PV, the crashes are gone.

Thanks for all the help.

Tomáš Mózes added a comment - 2016-09-21 07:17 This bug can be closed. This was caused by Xen PVH mode. After switching back to PV, the crashes are gone. Thanks for all the help.

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Tomáš Mózes

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2016-06-02 07:51

Updated:: 2016-09-21 09:00

Resolved:: 2016-09-21 09:00

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration