Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10167

MariaDB crashes with Xen PVH - mysqld got signal 11

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Not a Bug
    • 10.0.23, 10.0.25, 10.0(EOL)
    • N/A
    • None

    Description

      Hello,
      after upgrading our Gentoo Linux systems to latest stable versions, one of our MariaDB instance started to crash. The crashes seems to be random at times like:

      160527 9:34:05 [ERROR] mysqld got signal 11 ;
      160527 13:52:37 [ERROR] mysqld got signal 11 ;
      160530 20:46:51 [ERROR] mysqld got signal 11 ;
      160602 5:32:48 [ERROR] mysqld got signal 11 ;

      We have a master-slave replication. We first upgraded the slave to 10.0.25, had it running for 2 weeks without problems. Then we did a master failover, it worked for a week and then it started to crash. First it crashed on MariaDB 10.0.25, then again on 10.0.25. We downgraded to MariaDB 10.0.23 (that version worked before the system update), but it crashed again. Then we switched back to the old master instance (failover) with MariaDB 10.0.23, but it crashed again.

      The MariaDB instances are running virtualized as Xen DomU (in pvh mode) on Linux Kernel 4.1.24 with 230GB RAM / 24 vcpus. MariaDB was compiled with the following Gentoo USE flags: "extraengine jemalloc openssl pam server".

      The database is mostly used for selects (1000-3000 qps).

      Attachments

        1. 01-crash-mariadb-10.0.25-1.log
          41 kB
        2. 02-crash-mariadb-10.0.25-2.log
          1 kB
        3. 03-crash-mariadb-10.0.23-1.log
          20 kB
        4. 04-crash-mariadb-10.0.23-2.log
          23 kB
        5. my.cnf
          2 kB
        6. mysqld.err
          23 kB

        Issue Links

          Activity

            hydrapolic Tomáš Mózes added a comment - Gentoo bug report: https://bugs.gentoo.org/show_bug.cgi?id=584828

            It could have been the notorious 'long semaphore wait' crash, except that it used to be SIGABRT, not SIGSEGV.
            jplindst, do you know why it could have changed? Or, can you get anything from the attached diagnostics output?

            hydrapolic, could you please provide a bigger portion of the error log, the whole session since the server startup and till the restart?

            elenst Elena Stepanova added a comment - It could have been the notorious 'long semaphore wait' crash, except that it used to be SIGABRT, not SIGSEGV. jplindst , do you know why it could have changed? Or, can you get anything from the attached diagnostics output? hydrapolic , could you please provide a bigger portion of the error log, the whole session since the server startup and till the restart?
            hydrapolic Tomáš Mózes added a comment - - edited

            I've attached the server log. This is from the slave where we upgraded to version 10.0.25 on 5.5.2016, then we did a failover later on and the first crash appeared 27.5.2016.

            hydrapolic Tomáš Mózes added a comment - - edited I've attached the server log. This is from the slave where we upgraded to version 10.0.25 on 5.5.2016, then we did a failover later on and the first crash appeared 27.5.2016.
            grknight Brian Evans added a comment -

            This may be related or a possible duplicate of MDEV-10020

            grknight Brian Evans added a comment - This may be related or a possible duplicate of MDEV-10020

            grknight, how do you figure? Do you see any resemblance in the crash stack trace, output, or anywhere else?

            elenst Elena Stepanova added a comment - grknight , how do you figure? Do you see any resemblance in the crash stack trace, output, or anywhere else?
            grknight Brian Evans added a comment -

            @elenst, after review of the documents provided here, I must have misunderstood the OPs issue when i talked to him privately on IRC.
            There he showed me a backtrace that looked like it was MDEV-10020.
            This was from the same day as this report.

            The dpaste is about to expire, so I'll repost it here

            Thread pointer: 0x0x7fbbc9f89008
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x7fbbfa5ddea0 thread_stack 0x48000
            /usr/sbin/mysqld(my_print_stacktrace+0x3c)[0xb30aac]
            /usr/sbin/mysqld(handle_fatal_signal+0x3a0)[0x6ea9a0]
            /lib64/libpthread.so.0(+0x10db0)[0x7fbbf97e4db0]
            /usr/sbin/mysqld(_ZN9in_string13value_to_itemEjP4Item+0x61)[0x71fe91]
            /usr/sbin/mysqld[0x7bb276]
            /usr/sbin/mysqld[0x7bb6e9]
            /usr/sbin/mysqld(_ZN10SQL_SELECT17test_quick_selectEP3THD6BitmapILj64EEyybb+0x605)[0x7be3e5]
            /usr/sbin/mysqld[0x5c6f14]
            /usr/sbin/mysqld[0x5e8715]
            /usr/sbin/mysqld(_ZN4JOIN14optimize_innerEv+0x774)[0x5ef3b4]
            /usr/sbin/mysqld(_ZN4JOIN8optimizeEv+0x15)[0x5f1755]
            /usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x75e)[0x5f1f7e]
            /usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x162)[0x5f2292]
            /usr/sbin/mysqld[0x59ec3a]
            /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x557d)[0x5aad6d]
            /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d0)[0x5acb00]
            /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x177e)[0x5ae2ce]
            /usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x203)[0x6606e3]
            /usr/sbin/mysqld(handle_one_connection+0x38)[0x660738]
            /lib64/libpthread.so.0(+0x7484)[0x7fbbf97db484]
            /lib64/libc.so.6(clone+0x6d)[0x7fbbf83d328d]
            

            grknight Brian Evans added a comment - @elenst, after review of the documents provided here, I must have misunderstood the OPs issue when i talked to him privately on IRC. There he showed me a backtrace that looked like it was MDEV-10020 . This was from the same day as this report. The dpaste is about to expire, so I'll repost it here Thread pointer: 0x0x7fbbc9f89008 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x7fbbfa5ddea0 thread_stack 0x48000 /usr/sbin/mysqld(my_print_stacktrace+0x3c)[0xb30aac] /usr/sbin/mysqld(handle_fatal_signal+0x3a0)[0x6ea9a0] /lib64/libpthread.so.0(+0x10db0)[0x7fbbf97e4db0] /usr/sbin/mysqld(_ZN9in_string13value_to_itemEjP4Item+0x61)[0x71fe91] /usr/sbin/mysqld[0x7bb276] /usr/sbin/mysqld[0x7bb6e9] /usr/sbin/mysqld(_ZN10SQL_SELECT17test_quick_selectEP3THD6BitmapILj64EEyybb+0x605)[0x7be3e5] /usr/sbin/mysqld[0x5c6f14] /usr/sbin/mysqld[0x5e8715] /usr/sbin/mysqld(_ZN4JOIN14optimize_innerEv+0x774)[0x5ef3b4] /usr/sbin/mysqld(_ZN4JOIN8optimizeEv+0x15)[0x5f1755] /usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x75e)[0x5f1f7e] /usr/sbin/mysqld(_Z13handle_selectP3THDP3LEXP13select_resultm+0x162)[0x5f2292] /usr/sbin/mysqld[0x59ec3a] /usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x557d)[0x5aad6d] /usr/sbin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_state+0x1d0)[0x5acb00] /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x177e)[0x5ae2ce] /usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x203)[0x6606e3] /usr/sbin/mysqld(handle_one_connection+0x38)[0x660738] /lib64/libpthread.so.0(+0x7484)[0x7fbbf97db484] /lib64/libc.so.6(clone+0x6d)[0x7fbbf83d328d]

            grknight, thanks for this, it makes much more sense now; and you might still be right about it being a duplicate of MDEV-10020.

            If we look at the attached crash logs and error log as a whole, the difference with the typical "long semaphore wait" problem is even more obvious.
            Normally when it happens, InnoDB starts complaining about long waits (by printing the monitor output) while the server is still alive, keeps doing so for a while, and eventually gives up and goes down with assertion failure / SIGABRT when the wait exceeds the timeout.
            Here, everything is backwards. First, out of blue the server reports SIGSEGV; only after that it produces the monitor output; and if we compare "semaphore wait" complaints in the output with the timestamps in the log, we'll see

            160527  9:34:05 [ERROR] mysqld got signal 11 ;
            ...
            Attempting backtrace. 
            ...
            =====================================
            2016-05-27 09:38:22 7f5437bfc700 INNODB MONITOR OUTPUT
            =====================================
            ...
            --Thread 140007088617216 has waited at lock0lock.cc line 4603 for 256.00 seconds the semaphore:
            ...
            

            256 seconds is almost exactly the time between the crash and the monitor output.

            So, my best guess is that the server gets the SIGSEGV on some reason (which might well be MDEV-10020), attempts to produce the backtrace, but somehow it gets stuck in the middle, and then InnoDB monitor starts acting up.
            I don't know if it's even possible for the monitor to start counting and producing output after a crash, I don't remember ever seeing it before.
            So, I'll assign it to jplindst to confirm (or reject) the theory, and if it's plausible, I suppose we can assume for now that the initial problem was indeed caused by MDEV-10020, and the rest is consequences of the post-crash instability.

            elenst Elena Stepanova added a comment - grknight , thanks for this, it makes much more sense now; and you might still be right about it being a duplicate of MDEV-10020 . If we look at the attached crash logs and error log as a whole, the difference with the typical "long semaphore wait" problem is even more obvious. Normally when it happens, InnoDB starts complaining about long waits (by printing the monitor output) while the server is still alive, keeps doing so for a while, and eventually gives up and goes down with assertion failure / SIGABRT when the wait exceeds the timeout. Here, everything is backwards. First, out of blue the server reports SIGSEGV ; only after that it produces the monitor output; and if we compare "semaphore wait" complaints in the output with the timestamps in the log, we'll see 160527 9:34:05 [ERROR] mysqld got signal 11 ; ... Attempting backtrace. ... ===================================== 2016-05-27 09:38:22 7f5437bfc700 INNODB MONITOR OUTPUT ===================================== ... --Thread 140007088617216 has waited at lock0lock.cc line 4603 for 256.00 seconds the semaphore: ... 256 seconds is almost exactly the time between the crash and the monitor output. So, my best guess is that the server gets the SIGSEGV on some reason (which might well be MDEV-10020 ), attempts to produce the backtrace, but somehow it gets stuck in the middle, and then InnoDB monitor starts acting up. I don't know if it's even possible for the monitor to start counting and producing output after a crash, I don't remember ever seeing it before. So, I'll assign it to jplindst to confirm (or reject) the theory, and if it's plausible, I suppose we can assume for now that the initial problem was indeed caused by MDEV-10020 , and the rest is consequences of the post-crash instability.

            We had a chat with Brian on irc while trying to bring some more light on this bug report. I was trying to compile MariaDB on Gentoo in ways it would produce the most information to proceed and I was referring to MDEV-10020 while trying to crash MariaDB, because that is the only way I can for sure crash MariaDB 10.0.25. The goal was to find the optimal settings to use in production to produce some more logs / stack traces.

            The dpaste output Brian posted was from the testing machine with MariaDB 10.0.25 while trying to crash it with MDEV-10020 with various compilation / debug settings.

            That said, I don't think these two bugs have something in common - if you crash with MDEV-10020, you always get a stack trace, no matter how you compile MariaDB. In this case, there is no stack trace (from 5 crashes).

            It seems we have identified the root cause of these crashes. Last week, we disabled Xen PVH mode (switching back to PV mode) and there was no crash afterwards (yet). I'll wait another week and if it won't crash again, I'll open a bug report on Xen.

            hydrapolic Tomáš Mózes added a comment - We had a chat with Brian on irc while trying to bring some more light on this bug report. I was trying to compile MariaDB on Gentoo in ways it would produce the most information to proceed and I was referring to MDEV-10020 while trying to crash MariaDB, because that is the only way I can for sure crash MariaDB 10.0.25. The goal was to find the optimal settings to use in production to produce some more logs / stack traces. The dpaste output Brian posted was from the testing machine with MariaDB 10.0.25 while trying to crash it with MDEV-10020 with various compilation / debug settings. That said, I don't think these two bugs have something in common - if you crash with MDEV-10020 , you always get a stack trace, no matter how you compile MariaDB. In this case, there is no stack trace (from 5 crashes). It seems we have identified the root cause of these crashes. Last week, we disabled Xen PVH mode (switching back to PV mode) and there was no crash afterwards (yet). I'll wait another week and if it won't crash again, I'll open a bug report on Xen.
            hydrapolic Tomáš Mózes added a comment - http://lists.xenproject.org/archives/html/xen-users/2016-06/msg00089.html

            Some of the crashes are long semaphore waits, but not all. Too bad that you did not provide full unedited error logs for all crash cases. Not familiar with Xen PVH or PV mode.

            jplindst Jan Lindström (Inactive) added a comment - Some of the crashes are long semaphore waits, but not all. Too bad that you did not provide full unedited error logs for all crash cases. Not familiar with Xen PVH or PV mode.

            @jplindst, please see the attached mysqld.err.

            hydrapolic Tomáš Mózes added a comment - @jplindst, please see the attached mysqld.err.

            I did, it does not contain the long semaphore wait error messages seen on other error logs and from startup to crash there is several hours and no error messages.

            jplindst Jan Lindström (Inactive) added a comment - I did, it does not contain the long semaphore wait error messages seen on other error logs and from startup to crash there is several hours and no error messages.

            @jplindst, I first attached the crash logs, then the surrounding logs. The information there is complete, just not in one place (sorry for that), so mysqld.err contains the full logs except the crash logs which are in separate logs.

            hydrapolic Tomáš Mózes added a comment - @jplindst, I first attached the crash logs, then the surrounding logs. The information there is complete, just not in one place (sorry for that), so mysqld.err contains the full logs except the crash logs which are in separate logs.

            This bug can be closed.

            This was caused by Xen PVH mode. After switching back to PV, the crashes are gone.

            Thanks for all the help.

            hydrapolic Tomáš Mózes added a comment - This bug can be closed. This was caused by Xen PVH mode. After switching back to PV, the crashes are gone. Thanks for all the help.

            People

              jplindst Jan Lindström (Inactive)
              hydrapolic Tomáš Mózes
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.