Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30271

Mysqld crash - Semaphore wait has lasted > 600 seconds

Details

    Description

      Hi ,
      We got a crash of mysqld service after many events with semaphore waiting

      InnoDB: ###### Diagnostic info printed to the standard error stream
      2022-12-19  2:34:16 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      221219  2:34:16 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.4.26-MariaDB-log
      key_buffer_size=134217728
      read_buffer_size=8388608
      max_used_connections=1839
      max_threads=2002
      thread_count=1845
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 82182626 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      

      dmesg had next records

      [Mon Dec 19 02:38:09 2022] mysqld[11780]: segfault at 0 ip 000055835b30ca2e sp 00007e9454072a50 error 6 in mysqld[55835a50f000+146b000]
      [Mon Dec 19 02:38:09 2022] Code: c7 45 00 00 00 00 00 8b 7d cc 4c 89 e2 4c 89 f6 e8 57 59 78 ff 49 89 c7 49 39 c4 0f 84 bb 02 00 00 e8 e6 18 00 00 41 8b 4d 00 <89> 08 85 c9 74 1c 49 83 ff ff 0f 84 e2 01 00 00 4d 85 ff 75 27 83
      

      Some variables

      MariaDB [(none)]> show global variables like '%buffer_size' ;
      +----------------------------+-----------+
      | Variable_name              | Value     |
      +----------------------------+-----------+
      | aria_pagecache_buffer_size | 134217728 |
      | aria_sort_buffer_size      | 268434432 |
      | bulk_insert_buffer_size    | 8388608   |
      | innodb_log_buffer_size     | 33554432  |
      | innodb_sort_buffer_size    | 1048576   |
      | join_buffer_size           | 33554432  |
      | key_buffer_size            | 134217728 |
      | mrr_buffer_size            | 262144    |
      | myisam_sort_buffer_size    | 134216704 |
      | preload_buffer_size        | 32768     |
      | read_buffer_size           | 8388608   |
      | read_rnd_buffer_size       | 8388608   |
      | sort_buffer_size           | 33554432  |
      +----------------------------+-----------+
      13 rows in set (0.001 sec)
       
      MariaDB [(none)]> select @@max_connections ;
      +-------------------+
      | @@max_connections |
      +-------------------+
      |              2000 |
      +-------------------+
      1 row in set (0.000 sec)
      

      I found that the same issue was in MDEV-25955 , and described that it fixed at 10.4.18
      but I have this with 10.4.26 version

      We rapidly almost reached limit of connections , but strange that this is caused a lot of locks waiting and as result - segfault

      I've uploaded archive with mysqld log and system variables - MDEV-30271.zip
      to ftp://ftp.mariadb.org/private/

      Attachments

        1. connections.jpg
          250 kB
          Anton
        2. memory.jpg
          152 kB
          Anton

        Activity

          danblack Daniel Black added a comment -

          Hi R, for these kinds of problems is critical to get the full backtrace of all threads.

          If this hang did generate a core dump, can you extract the all threads with these instructions.

          danblack Daniel Black added a comment - Hi R , for these kinds of problems is critical to get the full backtrace of all threads . If this hang did generate a core dump, can you extract the all threads with these instructions .
          R Anton added a comment -

          Hi danblack
          Unfortunately core dump was't created because variable core_file is turn off (by default)

          I can do backtrace of current running service , but it would be useless I think.
          Can I do something else for investigation of the problem ?

          R Anton added a comment - Hi danblack Unfortunately core dump was't created because variable core_file is turn off (by default) I can do backtrace of current running service , but it would be useless I think. Can I do something else for investigation of the problem ?
          danblack Daniel Black added a comment -

          > I can do backtrace of current running service , but it would be useless I think.

          You're right, it would need to be captured in the 600 seconds between when it occurred and the mariadb assertion that causes it termination.

          While gcore or the gdb scripting could grab a sample every 400ish seconds it would be a bit wasteful.

          I don't think core_file is the limiting factor. Is there a system LimitCore in effect (https://access.redhat.com/solutions/649193)? Is the kernel.core_pattern sysctl set? Does /proc/$(pidof mariabd)/limits limit the core?

          > Can I do something else for investigation of the problem ?

          Install the debuginfo packages for when it does occur.

          If you get stuck with a backtrace you can upload the core to ftp. Please note if you're using a RHEL package or a mariadb package. By default the buffer pool isn't included in the dump so there will be none or very little user data in the dump.

          Mainly prepare for when a core dump occurs is the most useful thing. There just isn't sufficient information exposed elsewhere.

          danblack Daniel Black added a comment - > I can do backtrace of current running service , but it would be useless I think. You're right, it would need to be captured in the 600 seconds between when it occurred and the mariadb assertion that causes it termination. While gcore or the gdb scripting could grab a sample every 400ish seconds it would be a bit wasteful. I don't think core_file is the limiting factor. Is there a system LimitCore in effect ( https://access.redhat.com/solutions/649193)? Is the kernel.core_pattern sysctl set? Does /proc/$(pidof mariabd)/limits limit the core? > Can I do something else for investigation of the problem ? Install the debuginfo packages for when it does occur. If you get stuck with a backtrace you can upload the core to ftp . Please note if you're using a RHEL package or a mariadb package. By default the buffer pool isn't included in the dump so there will be none or very little user data in the dump. Mainly prepare for when a core dump occurs is the most useful thing. There just isn't sufficient information exposed elsewhere.
          R Anton added a comment -

          Hi danblack

          The problem didn't appear again for a long time
          I suppose the ticket can be closed

          R Anton added a comment - Hi danblack The problem didn't appear again for a long time I suppose the ticket can be closed

          People

            Unassigned Unassigned
            R Anton
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.