Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22235

Mysql node crash regulary on Semaphore wait has lasted > 600 seconds

Details

    Description

      Hello,

      since our upgrade from mariadb 5.5.52, to 10.3.22 on 25/2/2020, we had 3 crashes on 2 of our mariadb nodes, each time caused by a MUTEX failure. This is about 1 crash per week, but it's inconsistent, we had only 3 days betwee second and third crash. Every time this happen, the whole mariadb cluster get blocked by this failing node and all our applications are blocked for around 12 minutes.

      Here is an extract of log (complete log of error attached).

      2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
      --Thread 139770264647424 has waited at lock0lock.cc line 4567 for 77.00 seconds the semaphore:
      Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
       
      2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
      --Thread 139762764658432 has waited at lock0lock.cc line 3903 for 429.00 seconds the semaphore:
      Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
       
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending reads 0, writes 0
      InnoDB: ###### Diagnostic info printed to the standard error stream
      2020-04-11 23:29:31 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      200411 23:29:31 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.3.22-MariaDB-log
      key_buffer_size=67108864
      read_buffer_size=1048576
      max_used_connections=601
      max_threads=602
      thread_count=397
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1928416 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f227c1d347b]
      /usr/sbin/mysqld(handle_fatal_signal+0x4b7)[0x7f227bcba087]
      /lib64/libpthread.so.0(+0x3e7380f7e0)[0x7f227b0997e0]
      /lib64/libc.so.6(gsignal+0x35)[0x7f22796f34f5]
      /lib64/libc.so.6(abort+0x175)[0x7f22796f4cd5]
      /usr/sbin/mysqld(+0xadd3ba)[0x7f227bfa73ba]
      /usr/sbin/mysqld(+0xa8e1c5)[0x7f227bf581c5]
      /lib64/libpthread.so.0(+0x3e73807aa1)[0x7f227b091aa1]
      /lib64/libc.so.6(clone+0x6d)[0x7f22797a9c4d]
      The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      Working directory at /var/lib/mysql
      Fatal signal 11 while backtracing
      2020-04-11 23:30:41 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
      2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
      2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
      2020-04-11 23:30:41 0 [Note] WSREP: CRC-32C: using hardware acceleration.
      2020-04-11 23:30:41 0 [Note] WSREP: Found saved state: 648134e5-23a5-11e8-bbbc-7b838bd00037:-1, safe_to_bootstrap: 0
      2020-04-11 23:30:41 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.42.11; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 8192M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.8; gcs.fc_limit = 80; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.42.11; gmcast.segment = 0; gmc
      2020-04-11 23:30:41 0 [Note] WSREP: GCache history reset: 648134e5-23a5-11e8-bbbc-7b838bd00037:0 -> 648134e5-23a5-11e8-bbbc-7b838bd00037:2336233095
      2020-04-11 23:30:41 0 [Note] WSREP: Assign initial position for certification: 2336233095, protocol version: -1
      2020-04-11 23:30:41 0 [Note] WSREP: wsrep_sst_grab()
      

      We don't know if it's related, but in each of those 3 cases, the error log contained also this error.

      OS WAIT ARRAY INFO: signal count 442592098
      RW-shared spins 0, rounds 140548689, OS waits 41314150
      RW-excl spins 0, rounds 868356577, OS waits 12754003
      RW-sx spins 5319566, rounds 96848713, OS waits 987940
      Spin rounds per wait: 140548689.00 RW-shared, 868356577.00 RW-excl, 18.21 RW-sx
      ------------------------
      LATEST FOREIGN KEY ERROR
      ------------------------
      2020-04-11 16:13:14 0x7f206e42e700 Transaction:
      TRANSACTION 51276290676, ACTIVE 0 sec updating or deleting
      mysql tables in use 2, locked 2
      4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
      MySQL thread id 270341, OS thread handle 139777265559296, query id 19109428451 192.168.42.11 redacteduser Updating
      delete from sessions where token='redactedToken'
      Foreign key constraint fails for table `RedactedDB`.`vpn_connections`:
      ,
        CONSTRAINT `fk_vpn_connections_1` FOREIGN KEY (`session_id`) REFERENCES `sessions` (`token`)
      Trying to delete or update in parent table, in index PRIMARY tuple:
      DATA TUPLE: 8 fields;
       0: len 32; hex 3861646565336535383530346131653763376438396463643938313036336634; asc 8adee3e58504a1e7c7d89dcd981063f4;;
       1: len 6; hex 000bf04e1a74; asc    N t;;
       2: len 7; hex 3100017bfe27ab; asc 1  { ' ;;
       3: len 4; hex 80031fff; asc     ;;
       4: len 4; hex 80068370; asc    p;;
       5: len 8; hex 8000125f4654bb2d; asc    _FT -;;
       6: len 8; hex 8000125f4654e23d; asc    _FT =;;
       7: len 4; hex 80000000; asc     ;;
       
      But in child table `RedactedDB`.`vpn_connections`, in index fk_vpn_connections_1, there is a record:
      PHYSICAL RECORD: n_fields 3; compact format; info bits 0
       0: len 30; hex 386164656533653538353034613165376337643839646364393831303633; asc 8adee3e58504a1e7c7d89dcd981063; (total 32 bytes);
       1: len 4; hex 800d4aa4; asc   J ;;
       2: len 4; hex 80068370; asc    p;;
      

      Also note that for each each crash, the 'redactedToken' is different. However, for a given crash, the 'redactedToken' appear around 30 times.
      However there is a nearly 5h gap between this last foreign key error and the mutext failure, so this might be unrelated.
      This is seriously impacting our business, any idea?

      Attached:

      • the error log
      • the list of mariadb variable global values

      Attachments

        1. error_mariadb.gz
          355 kB
          David Delbecq
        2. gdb.txt
          566 kB
          Mark Zealey
        3. issue.txt
          52 kB
          Mark Zealey
        4. mariadb_variables
          20 kB
          David Delbecq

        Activity

          dade, because you are apparently using Galera replication, I am thinking that this report might be a duplicate of MDEV-21910.

          Hangs in the server are best debugged if you could attach a debugger to the hung process, or start the debugger on a core dump, and then extract the stack traces of all threads. In GDB, I would recommend the following statements:

          set print frame-arguments all
          set height 0
          set log on
          thread apply all backtrace
          

          marko Marko Mäkelä added a comment - dade , because you are apparently using Galera replication, I am thinking that this report might be a duplicate of MDEV-21910 . Hangs in the server are best debugged if you could attach a debugger to the hung process, or start the debugger on a core dump, and then extract the stack traces of all threads. In GDB, I would recommend the following statements: set print frame-arguments all set height 0 set log on thread apply all backtrace
          mark.zealey Mark Zealey added a comment -

          We have also seen this exact same issue twice in the past couple of weeks, firstly on 10.4.13 and then yesterday on 10.4.14 after roughly 2 weeks of running (as we upgraded after first issue as we thought MDEV-21910 may be the cause). In our case we have very heavy reads&writes on a single server, and light reads/writes on the other 2. We are now shifting to try to focus all writes on a single server to see if this works around the issue. We have not been killing threads unlike 21910 which leads me to think this may well be a different bug, somehow triggered by load/race condition. We've been slowly ramping up load on our platform for the past 4 months or so and never saw this bug before a couple of weeks back even though we had been running 10.4.13 the whole time.

          I've attached the gdb.txt from the coredump (it was limited to 2gb so missing some stuff) you requested above from 10.4.14 (centos 8 + debuginfo), however for some reason most of the symbols are not being resolved - not sure how useful it will be. Let me know anything else I can try to make it more readable.

          mark.zealey Mark Zealey added a comment - We have also seen this exact same issue twice in the past couple of weeks, firstly on 10.4.13 and then yesterday on 10.4.14 after roughly 2 weeks of running (as we upgraded after first issue as we thought MDEV-21910 may be the cause). In our case we have very heavy reads&writes on a single server, and light reads/writes on the other 2. We are now shifting to try to focus all writes on a single server to see if this works around the issue. We have not been killing threads unlike 21910 which leads me to think this may well be a different bug, somehow triggered by load/race condition. We've been slowly ramping up load on our platform for the past 4 months or so and never saw this bug before a couple of weeks back even though we had been running 10.4.13 the whole time. I've attached the gdb.txt from the coredump (it was limited to 2gb so missing some stuff) you requested above from 10.4.14 (centos 8 + debuginfo), however for some reason most of the symbols are not being resolved - not sure how useful it will be. Let me know anything else I can try to make it more readable.
          mark.zealey Mark Zealey added a comment -

          issue.txt is abridged syslog output hopefully containing all useful info. As connections stalled many more connections started coming in up to the 3000 conn limit that we set.

          mark.zealey Mark Zealey added a comment - issue.txt is abridged syslog output hopefully containing all useful info. As connections stalled many more connections started coming in up to the 3000 conn limit that we set.

          People

            jplindst Jan Lindström (Inactive)
            dade David Delbecq
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.