Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-22235

Mysql node crash regulary on Semaphore wait has lasted > 600 seconds

Details

    Description

      Hello,

      since our upgrade from mariadb 5.5.52, to 10.3.22 on 25/2/2020, we had 3 crashes on 2 of our mariadb nodes, each time caused by a MUTEX failure. This is about 1 crash per week, but it's inconsistent, we had only 3 days betwee second and third crash. Every time this happen, the whole mariadb cluster get blocked by this failing node and all our applications are blocked for around 12 minutes.

      Here is an extract of log (complete log of error attached).

      2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
      --Thread 139770264647424 has waited at lock0lock.cc line 4567 for 77.00 seconds the semaphore:
      Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
       
      2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
      --Thread 139762764658432 has waited at lock0lock.cc line 3903 for 429.00 seconds the semaphore:
      Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
       
      InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
      InnoDB: Pending reads 0, writes 0
      InnoDB: ###### Diagnostic info printed to the standard error stream
      2020-04-11 23:29:31 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
      200411 23:29:31 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.3.22-MariaDB-log
      key_buffer_size=67108864
      read_buffer_size=1048576
      max_used_connections=601
      max_threads=602
      thread_count=397
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1928416 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f227c1d347b]
      /usr/sbin/mysqld(handle_fatal_signal+0x4b7)[0x7f227bcba087]
      /lib64/libpthread.so.0(+0x3e7380f7e0)[0x7f227b0997e0]
      /lib64/libc.so.6(gsignal+0x35)[0x7f22796f34f5]
      /lib64/libc.so.6(abort+0x175)[0x7f22796f4cd5]
      /usr/sbin/mysqld(+0xadd3ba)[0x7f227bfa73ba]
      /usr/sbin/mysqld(+0xa8e1c5)[0x7f227bf581c5]
      /lib64/libpthread.so.0(+0x3e73807aa1)[0x7f227b091aa1]
      /lib64/libc.so.6(clone+0x6d)[0x7f22797a9c4d]
      The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      Working directory at /var/lib/mysql
      Fatal signal 11 while backtracing
      2020-04-11 23:30:41 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
      2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
      2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
      2020-04-11 23:30:41 0 [Note] WSREP: CRC-32C: using hardware acceleration.
      2020-04-11 23:30:41 0 [Note] WSREP: Found saved state: 648134e5-23a5-11e8-bbbc-7b838bd00037:-1, safe_to_bootstrap: 0
      2020-04-11 23:30:41 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.42.11; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 8192M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.8; gcs.fc_limit = 80; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.42.11; gmcast.segment = 0; gmc
      2020-04-11 23:30:41 0 [Note] WSREP: GCache history reset: 648134e5-23a5-11e8-bbbc-7b838bd00037:0 -> 648134e5-23a5-11e8-bbbc-7b838bd00037:2336233095
      2020-04-11 23:30:41 0 [Note] WSREP: Assign initial position for certification: 2336233095, protocol version: -1
      2020-04-11 23:30:41 0 [Note] WSREP: wsrep_sst_grab()
      

      We don't know if it's related, but in each of those 3 cases, the error log contained also this error.

      OS WAIT ARRAY INFO: signal count 442592098
      RW-shared spins 0, rounds 140548689, OS waits 41314150
      RW-excl spins 0, rounds 868356577, OS waits 12754003
      RW-sx spins 5319566, rounds 96848713, OS waits 987940
      Spin rounds per wait: 140548689.00 RW-shared, 868356577.00 RW-excl, 18.21 RW-sx
      ------------------------
      LATEST FOREIGN KEY ERROR
      ------------------------
      2020-04-11 16:13:14 0x7f206e42e700 Transaction:
      TRANSACTION 51276290676, ACTIVE 0 sec updating or deleting
      mysql tables in use 2, locked 2
      4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
      MySQL thread id 270341, OS thread handle 139777265559296, query id 19109428451 192.168.42.11 redacteduser Updating
      delete from sessions where token='redactedToken'
      Foreign key constraint fails for table `RedactedDB`.`vpn_connections`:
      ,
        CONSTRAINT `fk_vpn_connections_1` FOREIGN KEY (`session_id`) REFERENCES `sessions` (`token`)
      Trying to delete or update in parent table, in index PRIMARY tuple:
      DATA TUPLE: 8 fields;
       0: len 32; hex 3861646565336535383530346131653763376438396463643938313036336634; asc 8adee3e58504a1e7c7d89dcd981063f4;;
       1: len 6; hex 000bf04e1a74; asc    N t;;
       2: len 7; hex 3100017bfe27ab; asc 1  { ' ;;
       3: len 4; hex 80031fff; asc     ;;
       4: len 4; hex 80068370; asc    p;;
       5: len 8; hex 8000125f4654bb2d; asc    _FT -;;
       6: len 8; hex 8000125f4654e23d; asc    _FT =;;
       7: len 4; hex 80000000; asc     ;;
       
      But in child table `RedactedDB`.`vpn_connections`, in index fk_vpn_connections_1, there is a record:
      PHYSICAL RECORD: n_fields 3; compact format; info bits 0
       0: len 30; hex 386164656533653538353034613165376337643839646364393831303633; asc 8adee3e58504a1e7c7d89dcd981063; (total 32 bytes);
       1: len 4; hex 800d4aa4; asc   J ;;
       2: len 4; hex 80068370; asc    p;;
      

      Also note that for each each crash, the 'redactedToken' is different. However, for a given crash, the 'redactedToken' appear around 30 times.
      However there is a nearly 5h gap between this last foreign key error and the mutext failure, so this might be unrelated.
      This is seriously impacting our business, any idea?

      Attached:

      • the error log
      • the list of mariadb variable global values

      Attachments

        1. error_mariadb.gz
          355 kB
          David Delbecq
        2. mariadb_variables
          20 kB
          David Delbecq
        3. gdb.txt
          566 kB
          Mark Zealey
        4. issue.txt
          52 kB
          Mark Zealey

        Activity

          People

            jplindst Jan Lindström (Inactive)
            dade David Delbecq
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.