Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-11370

Improvents in MTR crash or deadlock reporting

Details

    Description

      Make MTR kill server with SIGABRT when the server startup timeout is exceeded

      Now when the server does not start in reasonable time, the only diagnostics that we get from buildbot is

      Failed to start mysqld.3
      

      and the server error log. When server hangs on startup, we would want to have a stack trace.

      Check if it's possible to kill the server with SIGABRT when the result is 'Failed to start'. If the server is already down, it won't hurt, but if it hangs, it can give more information.

      Make MTR kill server with SIGABRT when the server shutdown timeout is exceeded and print the stack trace

      When server shutdown is exceeded, MTR silently SIGKILLs the server and proceeds. It hides important failures on server shutdown. Instead, it should report a hang and get a stack trace. On the other hand, it should be differentiated from intentional server killing, when the shutdown timeout is initially zero – in that case, it should be SIGKILL-ed without a stack trace.

      Attachments

        Issue Links

          Activity

            Starting with MariaDB Server 10.3.1, shutdown_server will kill -ABRT upon exceeding the timeout (60 seconds by default).

            However, the kill -ABRT will not always produce a core dump, and mysqltest does not always seem to be interested in extracting the stack traces from the core dump. Here is a recent example of an apparent shutdown hang, possibly caused by MDEV-14486. On kvm-asan, core dumps are disabled by default due to their sheer size:

            10.4 0308de94ee806c21b6776ecab73396da75282596

            CURRENT_TEST: innodb.ibuf_not_empty
            2019-11-11 10:17:51 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld: ready for connections.
            Version: '10.4.11-MariaDB-debug-log'  socket: '/dev/shm/var/tmp/1/mysqld.1.sock'  port: 16000  Source distribution
            2019-11-11 10:17:52 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld (initiated by: root[root] @ localhost []): Normal shutdown
            2019-11-11 10:17:52 0 [Note] Event Scheduler: Purging the queue. 0 events
            2019-11-11 10:17:52 0 [Note] InnoDB: FTS optimize thread exiting.
            2019-11-11 10:17:52 0 [Note] InnoDB: Starting shutdown...
            2019-11-11 10:17:52 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/var/1/mysqld.1/data/ib_buffer_pool
            2019-11-11 10:17:52 0 [Note] InnoDB: Instance 0, restricted to 123 pages due to innodb_buf_pool_dump_pct=25
            2019-11-11 10:17:52 0 [Note] InnoDB: Buffer pool(s) dump completed at 191111 10:17:52
            191111 10:18:52 [ERROR] mysqld got signal 6 ;
            This could be because you hit a bug. It is also possible that this binary
            or one of the libraries it was linked against is corrupt, improperly built,
            or misconfigured. This error can also be caused by malfunctioning hardware.
             
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed, 
            something is definitely wrong and this may fail.
             
            Server version: 10.4.11-MariaDB-debug-log
            key_buffer_size=1048576
            read_buffer_size=131072
            max_used_connections=1
            max_threads=153
            thread_count=0
            It is possible that mysqld could use up to 
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63622 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            Thread pointer: 0x0
            Attempting backtrace. You can use the following information to find out
            where mysqld died. If you see no messages after this, something went
            terribly wrong...
            stack_bottom = 0x0 thread_stack 0x5fc00
            /usr/lib/x86_64-linux-gnu/libasan.so.2(+0x4a077)[0x7f5e580f5077]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(my_print_stacktrace+0xce)[0x2328dd3]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(handle_fatal_signal+0x87a)[0x11377b4]
            /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f5e57220390]
            /lib/x86_64-linux-gnu/libpthread.so.0(__nanosleep+0x2d)[0x7f5e5721fc1d]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x1b42488]
            maria/ha_maria.cc:2773(ha_maria::external_lock(THD*, int))[0x1ae0635]
            csv/ha_tina.cc:399(ha_tina::init_tina_writer())[0x1d342ed]
            sql/spatial.cc:2823(Gis_multi_polygon::area(double*, char const**) const)[0x19559b8]
            sql/sql_connect.cc:1351(do_handle_one_connection(CONNECT*))[0x11400c7]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x9f0371]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x9f0fa7]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_Z15plugin_shutdownv+0x1a5)[0x9f5373]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x6fe7ca]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_Z11mysqld_mainiPPc+0x173d)[0x70c25a]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(main+0x20)[0x6f5606]
            /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f5e55d14830]
            /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_start+0x29)[0x6f5519]
            The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
            information that should help you find out what is causing the crash.
            Writing a core file...
            Working directory at /dev/shm/var/1/mysqld.1/data
            Resource Limits:
            Limit                     Soft Limit           Hard Limit           Units     
            Max cpu time              unlimited            unlimited            seconds   
            Max file size             unlimited            unlimited            bytes     
            Max data size             unlimited            unlimited            bytes     
            Max stack size            8388608              unlimited            bytes     
            Max core file size        0                    0                    bytes     
            Max resident set          unlimited            unlimited            bytes     
            Max processes             23715                23715                processes 
            Max open files            1024                 1024                 files     
            Max locked memory         65536                65536                bytes     
            Max address space         unlimited            unlimited            bytes     
            Max file locks            unlimited            unlimited            locks     
            Max pending signals       23715                23715                signals   
            Max msgqueue size         819200               819200               bytes     
            Max nice priority         0                    0                    
            Max realtime priority     0                    0                    
            Max realtime timeout      unlimited            unlimited            us        
            Core pattern: |/usr/share/apport/apport %p %s %c %P
             
            2019-11-11 10:18:55 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld (mysqld 10.4.11-MariaDB-debug-log) starting as process 21192 ...
            

            Note: there were exactly 60 seconds between the two messages near the start of the excerpt, so we can be fairly sure that the SIGABRT was sent externally. Also worth noting (and worth nothing) is the completely bogus stack trace.

            I think that before attempting to deliver the SIGABRT, it would be useful to attach gdb to the process and issue thread apply all backtrace. In that way, we will get some information even if the core dump is disabled. (There should be some error handling, of course. Attaching the debugger would not work if another debugger is already attached, or might not produce desired results if the server is running under Valgrind, for example.)

            Furthermore, it looks like that server crashes on re-bootstrap are being mostly ignored (only resulting in some messages in the stdout.log, not leading to test failures).

            Finally, I suspect that we are not always trying to extract stack traces when the server crashes on restart, either between tests or inside a test.

            Last, a wishlist item: If very many tests fail or the server crashes very probably during startup (which could be the case while developing some change), then it is likely that ./mtr --max-test-fail=0 will be aborted abruptly.

            marko Marko Mäkelä added a comment - Starting with MariaDB Server 10.3.1, shutdown_server will kill -ABRT upon exceeding the timeout (60 seconds by default). However, the kill -ABRT will not always produce a core dump, and mysqltest does not always seem to be interested in extracting the stack traces from the core dump. Here is a recent example of an apparent shutdown hang , possibly caused by MDEV-14486 . On kvm-asan, core dumps are disabled by default due to their sheer size: 10.4 0308de94ee806c21b6776ecab73396da75282596 CURRENT_TEST: innodb.ibuf_not_empty … 2019-11-11 10:17:51 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld: ready for connections. Version: '10.4.11-MariaDB-debug-log' socket: '/dev/shm/var/tmp/1/mysqld.1.sock' port: 16000 Source distribution 2019-11-11 10:17:52 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld (initiated by: root[root] @ localhost []): Normal shutdown 2019-11-11 10:17:52 0 [Note] Event Scheduler: Purging the queue. 0 events 2019-11-11 10:17:52 0 [Note] InnoDB: FTS optimize thread exiting. 2019-11-11 10:17:52 0 [Note] InnoDB: Starting shutdown... 2019-11-11 10:17:52 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/var/1/mysqld.1/data/ib_buffer_pool 2019-11-11 10:17:52 0 [Note] InnoDB: Instance 0, restricted to 123 pages due to innodb_buf_pool_dump_pct=25 2019-11-11 10:17:52 0 [Note] InnoDB: Buffer pool(s) dump completed at 191111 10:17:52 191111 10:18:52 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.4.11-MariaDB-debug-log key_buffer_size=1048576 read_buffer_size=131072 max_used_connections=1 max_threads=153 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63622 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x5fc00 /usr/lib/x86_64-linux-gnu/libasan.so.2(+0x4a077)[0x7f5e580f5077] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(my_print_stacktrace+0xce)[0x2328dd3] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(handle_fatal_signal+0x87a)[0x11377b4] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f5e57220390] /lib/x86_64-linux-gnu/libpthread.so.0(__nanosleep+0x2d)[0x7f5e5721fc1d] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x1b42488] maria/ha_maria.cc:2773(ha_maria::external_lock(THD*, int))[0x1ae0635] csv/ha_tina.cc:399(ha_tina::init_tina_writer())[0x1d342ed] sql/spatial.cc:2823(Gis_multi_polygon::area(double*, char const**) const)[0x19559b8] sql/sql_connect.cc:1351(do_handle_one_connection(CONNECT*))[0x11400c7] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x9f0371] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x9f0fa7] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_Z15plugin_shutdownv+0x1a5)[0x9f5373] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld[0x6fe7ca] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_Z11mysqld_mainiPPc+0x173d)[0x70c25a] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(main+0x20)[0x6f5606] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0)[0x7f5e55d14830] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld(_start+0x29)[0x6f5519] The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains information that should help you find out what is causing the crash. Writing a core file... Working directory at /dev/shm/var/1/mysqld.1/data Resource Limits: Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 8388608 unlimited bytes Max core file size 0 0 bytes Max resident set unlimited unlimited bytes Max processes 23715 23715 processes Max open files 1024 1024 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes Max file locks unlimited unlimited locks Max pending signals 23715 23715 signals Max msgqueue size 819200 819200 bytes Max nice priority 0 0 Max realtime priority 0 0 Max realtime timeout unlimited unlimited us Core pattern: |/usr/share/apport/apport %p %s %c %P   2019-11-11 10:18:55 0 [Note] /home/buildbot/buildbot/build/mariadb-10.4.11/sql/mysqld (mysqld 10.4.11-MariaDB-debug-log) starting as process 21192 ... Note: there were exactly 60 seconds between the two messages near the start of the excerpt, so we can be fairly sure that the SIGABRT was sent externally. Also worth noting (and worth nothing) is the completely bogus stack trace. I think that before attempting to deliver the SIGABRT , it would be useful to attach gdb to the process and issue thread apply all backtrace . In that way, we will get some information even if the core dump is disabled. (There should be some error handling, of course. Attaching the debugger would not work if another debugger is already attached, or might not produce desired results if the server is running under Valgrind, for example.) Furthermore, it looks like that server crashes on re-bootstrap are being mostly ignored (only resulting in some messages in the stdout.log , not leading to test failures). Finally, I suspect that we are not always trying to extract stack traces when the server crashes on restart, either between tests or inside a test. Last, a wishlist item: If very many tests fail or the server crashes very probably during startup (which could be the case while developing some change), then it is likely that ./mtr --max-test-fail=0 will be aborted abruptly.

            People

              cvicentiu Vicențiu Ciorbaru
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.