Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-21010

Mariadb hangs (during a backtrace), stops responding to new connections

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Cannot Reproduce
    • 10.3.17
    • N/A
    • Backup
    • Debian buster 10.3.17-MariaDB-0+deb10u1 amd64

    Description

      During my daily mysqldump (called by the Debian automysqlbackup script) a few times per week mysql hangs. The TCP listener and socket stay up but no longer accept new queries. The only resolution is to kill -9 the mysqld process.

      I was able to do a gdb stack trace in this state, see attachments.

      The Debian bug report is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=943962

      The error.log says:

      corrupted size vs. prev_size
      191108 7:22:01 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      To report this bug, see https://mariadb.com/kb/en/reporting-bugs

      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.

      Server version: 10.3.17-MariaDB-0+deb10u1-log
      key_buffer_size=134217728
      read_buffer_size=131072
      max_used_connections=6
      max_threads=65546
      thread_count=7
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142762793 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.

      Thread pointer: 0x7f33e4178cc8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f341981ace8 thread_stack 0x49000

      My query.log has this right before the crash:

      13019 Init DB drupal6
      13019 Query SHOW CREATE DATABASE IF NOT EXISTS `drupal6`
      13019 Query SHOW TABLES LIKE '%'
      13019 Query LOCK TABLES `access` READ /!32311 LOCAL */,`accesslog` READ /!32311 LOCAL /,`actions` READ /!32311 LOCAL /,`actions_aid` READ /!32311 LOCAL /,`advanced_help_index` READ /!32311 LOCAL /,`authmap` READ /!32311 LOCAL */,`ba
      tch` READ /!32311 LOCAL */,`blocks` READ /!32311 LOCAL /,`blocks_roles` READ /!32311 LOCAL /,`boxes` READ /!32311 LOCAL /,`cache` READ /!32311 LOCAL /,`cache_block` READ /!32311 LOCAL /,`cache_content` READ /!32311 LOCAL /,`cache_filter` READ /!323
      11 LOCAL /,`cache_form` READ /!32311 LOCAL /,`cache_menu` READ /!32311 LOCAL /,`cache_page` READ /!32311 LOCAL /,`cache_update` READ /!32311 LOCAL /,`cache_views` READ /!32311 LOCAL /,`cache_views_data` READ /!32311 LOCAL /,`comments` READ /!32311
      LOCAL /,`content_group` READ /!32311 LOCAL /,`content_group_fields` READ /!32311 LOCAL /,`content_node_field` READ /!32311 LOCAL /,`content_node_field_instance` READ /!32311 LOCAL /,`content_type_event` READ /!32311 LOCAL */,`date_format_locale` READ /
      !32311 LOCAL */,`date_format_types` READ /!32311 LOCAL /,`date_formats` READ /!32311 LOCAL /,`files` READ /!32311 LOCAL /,`filter_formats` READ /!32311 LOCAL /,`filters` READ /!32311 LOCAL /,`flood` READ /!32311 LOCAL /,`history` READ /!32311 LOCAL
      /,`image` READ /!32311 LOCAL /,`image_attach` READ /!32311 LOCAL /,`img_assist_map` READ /!32311 LOCAL /,`languages` READ /!32311 LOCAL /,`locales_source` READ /!32311 LOCAL /,`locales_target` READ /!32311 LOCAL /,`menu_custom` READ /!32311 LOCAL
      /,`menu_links` READ /!32311 LOCAL /,`menu_router` READ /!32311 LOCAL /,`messaging_message_parts` READ /!32311 LOCAL /,`messaging_store` READ /!32311 LOCAL /,`modr8_log` READ /!32311 LOCAL /,`node` READ /!32311 LOCAL /,`node_access` READ /!32311 LOC
      AL /,`node_access_role` READ /!32311 LOCAL /,`node_access_user` READ /!32311 LOCAL /,`node_comment_statistics` READ /!32311 LOCAL /,`node_counter` READ /!32311 LOCAL /,`node_revisions` READ /!32311 LOCAL /,`node_type` READ /!32311 LOCAL */,`notificat
      ions` READ /!32311 LOCAL */,`notifications_event` READ /!32311 LOCAL /,`notifications_fields` READ /!32311 LOCAL /,`notifications_queue` READ /!32311 LOCAL /,`notifications_sent` READ /!32311 LOCAL /,`permission` READ /!32311 LOCAL /,`poll` READ /!32
      311 LOCAL /,`poll_choices` READ /!32311 LOCAL /,`poll_votes` READ /!32311 LOCAL /,`revision_moderation` READ /!32311 LOCAL /,`role` READ /!32311 LOCAL /,`search_dataset` READ /!32311 LOCAL /,`search_index` READ /!32311 LOCAL */,`search_node_links` RE
      AD /!32311 LOCAL */,`search_total` READ /!32311 LOCAL /,`semaphore` READ /!32311 LOCAL /,`sessions` READ /!32311 LOCAL /,`system` READ /!32311 LOCAL /,`term_data` READ /!32311 LOCAL /,`term_hierarchy` READ /!32311 LOCAL /,`term_node` READ /!32311 L
      OCAL /,`term_relation` READ /!32311 LOCAL /,`term_synonym` READ /!32311 LOCAL /,`trigger_assignments` READ /!32311 LOCAL /,`upload` READ /!32311 LOCAL /,`url_alias` READ /!32311 LOCAL /,`user_import` READ /!32311 LOCAL /,`user_import_errors` READ /
      !32311 LOCAL /,`users` READ /!32311 LOCAL /,`users_roles` READ /!32311 LOCAL /,`variable` READ /!32311 LOCAL /,`views_display` READ /!32311 LOCAL /,`views_object_cache` READ /!32311 LOCAL /,`views_view` READ /!32311 LOCAL /,`vocabulary` READ /!3231
      1 LOCAL /,`vocabulary_node_types` READ /!32311 LOCAL /,`watchdog` READ /!32311 LOCAL /,`workflow_access` READ /!32311 LOCAL /,`workflow_node` READ /!32311 LOCAL /,`workflow_node_history` READ /!32311 LOCAL /,`workflow_scheduled_transition` READ /!323
      11 LOCAL /,`workflow_states` READ /!32311 LOCAL /,`workflow_transitions` READ /!32311 LOCAL /,`workflow_type_map` READ /!32311 LOCAL /,`workflows` READ /!32311 LOCAL /,`wysiwyg` READ /!32311 LOCAL /,`xmlsitemap` READ /!32311 LOCAL */,`xmlsitemap_node
      ` READ /!32311 LOCAL */,`xmlsitemap_taxonomy` READ /!32311 LOCAL */
      191108 7:22:01 13019 Query SELECT engine FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE() AND table_name = 'access'
      [snip]
      13019 Init DB drupal6
      13019 Query select @@collation_database
      13019 Query SELECT TRIGGER_NAME FROM INFORMATION_SCHEMA.TRIGGERS WHERE EVENT_OBJECT_SCHEMA = DATABASE() AND EVENT_OBJECT_TABLE = 'url_alias'
      13019 Query SET SESSION character_set_results = 'utf8mb4'
      13019 Query SELECT engine FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE() AND table_name = 'user_import'
      13019 Query SET SQL_QUOTE_SHOW_CREATE=1
      13019 Query SET SESSION character_set_results = 'binary'
      13019 Query show create table `user_import`
      13019 Query SET SESSION character_set_results = 'utf8mb4'
      13019 Query show fields from `user_import`
      13019 Query SELECT /*!40001 SQL_NO_CACHE */ `import_id`, `name`, `filename`, `oldfilename`, `filepath`, `started`, `pointer`, `processed`, `valid`, `first_line_skip`, `contact`, `username_space`, `send_email`, `field_match`, `roles`, `options
      `, `setting` FROM `user_import`

      Attachments

        Issue Links

          Activity

            RichieB Richard added a comment -

            Until the signal handler is changed to avoid this deadlock please ship Mariadb with a default config of "stack-trace=off". Having your database suddenly hang and stop processing requests is not acceptable in any situation.

            RichieB Richard added a comment - Until the signal handler is changed to avoid this deadlock please ship Mariadb with a default config of "stack-trace=off". Having your database suddenly hang and stop processing requests is not acceptable in any situation.
            marko Marko Mäkelä added a comment - - edited

            Below is a comment of mine dated 2024-02-06 from another ticket:

            To add insult to the injury of reading man 7 signal-safety on Linux (which mentions many things, including the following):

            POSIX.1-2001 TC1 clarified that if an application calls fork(2) from a signal handler and any of the fork handlers registered by pthread_atfork(3) calls a function that is not async-signal-safe, the behavior is undefined. A future revision of the standard is likely to remove fork(2) from the list of async-signal-safe functions.

            I read man 2 open on Linux yesterday:

            O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes.

            Starting with MDEV-24854 we enable O_DIRECT I/O by default.

            Our built-in stack trace resolver is invoking the fork(2) system call, and I suspect that it may invoke some system calls that are not listed as safe in man 7 signal-safety. There have been several cases where the server has crashed while executing the stack trace resolver, and some cases where it has hung. I am not aware of cases where we got corruption as a result of a crash, but then again, we have plenty of ‘mystery corruption’ bugs where the original cause of the corruption is unknown.

            marko Marko Mäkelä added a comment - - edited Below is a comment of mine dated 2024-02-06 from another ticket: To add insult to the injury of reading man 7 signal-safety on Linux (which mentions many things, including the following): POSIX.1-2001 TC1 clarified that if an application calls fork(2) from a signal handler and any of the fork handlers registered by pthread_atfork(3) calls a function that is not async-signal-safe, the behavior is undefined. A future revision of the standard is likely to remove fork(2) from the list of async-signal-safe functions. I read man 2 open on Linux yesterday: O_DIRECT I/Os should never be run concurrently with the fork(2) system call, if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag; this includes memory allocated on the heap and statically allocated buffers). Any such I/Os, whether submitted via an asynchronous I/O interface or from another thread in the process, should be completed before fork(2) is called. Failure to do so can result in data corruption and undefined behavior in parent and child processes. Starting with MDEV-24854 we enable O_DIRECT I/O by default. Our built-in stack trace resolver is invoking the fork(2) system call, and I suspect that it may invoke some system calls that are not listed as safe in man 7 signal-safety . There have been several cases where the server has crashed while executing the stack trace resolver, and some cases where it has hung. I am not aware of cases where we got corruption as a result of a crash, but then again, we have plenty of ‘mystery corruption’ bugs where the original cause of the corruption is unknown.

            MDEV-32363 complicated this further and added some output that is not going to be useful for anyone who is not using Galera:

            10.6 80fff4c6b1a741c25b46bde6ab9d80042ac47b8f

            240920 10:55:22 [ERROR] mysqld got signal 6 ;
            Sorry, we probably made a mistake, and this is a bug.
             
            Your assistance in bug reporting will enable us to fix this for the next release.
            To report this bug, see https://mariadb.com/kb/en/reporting-bugs
             
            We will try our best to scrape up some info that will hopefully help
            diagnose the problem, but since we have already crashed, 
            something is definitely wrong and this may fail.
             
            Server version: 10.6.20-MariaDB-debug source revision: 80fff4c6b1a741c25b46bde6ab9d80042ac47b8f
            key_buffer_size=134217728
            read_buffer_size=131072
            max_used_connections=0
            max_threads=153
            thread_count=0
            It is possible that mysqld could use up to 
            key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468095 K  bytes of memory
            Hope that's ok; if not, decrease some variables in the equation.
             
            WSREP: Suppressing further logging
            WSREP: Shutting down network communications
            terminate called after throwing an instance of 'wsrep::runtime_error'
              what():  provider not loaded
            

            marko Marko Mäkelä added a comment - MDEV-32363 complicated this further and added some output that is not going to be useful for anyone who is not using Galera: 10.6 80fff4c6b1a741c25b46bde6ab9d80042ac47b8f 240920 10:55:22 [ERROR] mysqld got signal 6 ; Sorry, we probably made a mistake, and this is a bug.   Your assistance in bug reporting will enable us to fix this for the next release. To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.6.20-MariaDB-debug source revision: 80fff4c6b1a741c25b46bde6ab9d80042ac47b8f key_buffer_size=134217728 read_buffer_size=131072 max_used_connections=0 max_threads=153 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 468095 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   WSREP: Suppressing further logging WSREP: Shutting down network communications terminate called after throwing an instance of 'wsrep::runtime_error' what(): provider not loaded

            A short comment to " Having your database suddenly hang and stop processing requests is not acceptable in any situation".

            Signal handlers in MariaDB are only used in the case the server crashes and is unable to process requests.
            Using signal handlers are not affecting normal operations.

            It is unfortunate that if the crash happens in a system call, like realloc(), then MariaDB will not be able to get a stack trace.
            Disabling stack traces in MariaDB by default is not a good idea as getting stack traces works in most cases and is very useful to find out what goes wrong.

            The current problem could have been solved if the user would have disabled stack traces and the next time MariaDB crashed would have produced a stack trace from the core file.

            monty Michael Widenius added a comment - A short comment to " Having your database suddenly hang and stop processing requests is not acceptable in any situation". Signal handlers in MariaDB are only used in the case the server crashes and is unable to process requests. Using signal handlers are not affecting normal operations. It is unfortunate that if the crash happens in a system call, like realloc(), then MariaDB will not be able to get a stack trace. Disabling stack traces in MariaDB by default is not a good idea as getting stack traces works in most cases and is very useful to find out what goes wrong. The current problem could have been solved if the user would have disabled stack traces and the next time MariaDB crashed would have produced a stack trace from the core file.

            I am closing this bug as we never got a stack trace for the original problem and we are because of that not able to find out why there was crash in realloc().
            It sounds like a memory overrun issue and is very likely to be fixed in a newer MariaDB version.

            monty Michael Widenius added a comment - I am closing this bug as we never got a stack trace for the original problem and we are because of that not able to find out why there was crash in realloc(). It sounds like a memory overrun issue and is very likely to be fixed in a newer MariaDB version.

            People

              cvicentiu Vicențiu Ciorbaru
              RichieB Richard
              Votes:
              2 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.