[MDEV-22235] Mysql node crash regulary on Semaphore wait has lasted > 600 seconds Created: 2020-04-14  Updated: 2020-10-14  Resolved: 2020-08-24

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.3.22
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: David Delbecq Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 0
Labels: need_feedback
Environment:

Linux, centos 6 and centos 7


Attachments: File error_mariadb.gz     Text File gdb.txt     Text File issue.txt     HTML File mariadb_variables    

 Description   

Hello,

since our upgrade from mariadb 5.5.52, to 10.3.22 on 25/2/2020, we had 3 crashes on 2 of our mariadb nodes, each time caused by a MUTEX failure. This is about 1 crash per week, but it's inconsistent, we had only 3 days betwee second and third crash. Every time this happen, the whole mariadb cluster get blocked by this failing node and all our applications are blocked for around 12 minutes.

Here is an extract of log (complete log of error attached).

2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
--Thread 139770264647424 has waited at lock0lock.cc line 4567 for 77.00 seconds the semaphore:
Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
 
2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
--Thread 139762764658432 has waited at lock0lock.cc line 3903 for 429.00 seconds the semaphore:
Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
 
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending reads 0, writes 0
InnoDB: ###### Diagnostic info printed to the standard error stream
2020-04-11 23:29:31 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
200411 23:29:31 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.3.22-MariaDB-log
key_buffer_size=67108864
read_buffer_size=1048576
max_used_connections=601
max_threads=602
thread_count=397
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1928416 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f227c1d347b]
/usr/sbin/mysqld(handle_fatal_signal+0x4b7)[0x7f227bcba087]
/lib64/libpthread.so.0(+0x3e7380f7e0)[0x7f227b0997e0]
/lib64/libc.so.6(gsignal+0x35)[0x7f22796f34f5]
/lib64/libc.so.6(abort+0x175)[0x7f22796f4cd5]
/usr/sbin/mysqld(+0xadd3ba)[0x7f227bfa73ba]
/usr/sbin/mysqld(+0xa8e1c5)[0x7f227bf581c5]
/lib64/libpthread.so.0(+0x3e73807aa1)[0x7f227b091aa1]
/lib64/libc.so.6(clone+0x6d)[0x7f22797a9c4d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /var/lib/mysql
Fatal signal 11 while backtracing
2020-04-11 23:30:41 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
2020-04-11 23:30:41 0 [Note] WSREP: CRC-32C: using hardware acceleration.
2020-04-11 23:30:41 0 [Note] WSREP: Found saved state: 648134e5-23a5-11e8-bbbc-7b838bd00037:-1, safe_to_bootstrap: 0
2020-04-11 23:30:41 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.42.11; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 8192M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.8; gcs.fc_limit = 80; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.42.11; gmcast.segment = 0; gmc
2020-04-11 23:30:41 0 [Note] WSREP: GCache history reset: 648134e5-23a5-11e8-bbbc-7b838bd00037:0 -> 648134e5-23a5-11e8-bbbc-7b838bd00037:2336233095
2020-04-11 23:30:41 0 [Note] WSREP: Assign initial position for certification: 2336233095, protocol version: -1
2020-04-11 23:30:41 0 [Note] WSREP: wsrep_sst_grab()

We don't know if it's related, but in each of those 3 cases, the error log contained also this error.

OS WAIT ARRAY INFO: signal count 442592098
RW-shared spins 0, rounds 140548689, OS waits 41314150
RW-excl spins 0, rounds 868356577, OS waits 12754003
RW-sx spins 5319566, rounds 96848713, OS waits 987940
Spin rounds per wait: 140548689.00 RW-shared, 868356577.00 RW-excl, 18.21 RW-sx
------------------------
LATEST FOREIGN KEY ERROR
------------------------
2020-04-11 16:13:14 0x7f206e42e700 Transaction:
TRANSACTION 51276290676, ACTIVE 0 sec updating or deleting
mysql tables in use 2, locked 2
4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
MySQL thread id 270341, OS thread handle 139777265559296, query id 19109428451 192.168.42.11 redacteduser Updating
delete from sessions where token='redactedToken'
Foreign key constraint fails for table `RedactedDB`.`vpn_connections`:
,
  CONSTRAINT `fk_vpn_connections_1` FOREIGN KEY (`session_id`) REFERENCES `sessions` (`token`)
Trying to delete or update in parent table, in index PRIMARY tuple:
DATA TUPLE: 8 fields;
 0: len 32; hex 3861646565336535383530346131653763376438396463643938313036336634; asc 8adee3e58504a1e7c7d89dcd981063f4;;
 1: len 6; hex 000bf04e1a74; asc    N t;;
 2: len 7; hex 3100017bfe27ab; asc 1  { ' ;;
 3: len 4; hex 80031fff; asc     ;;
 4: len 4; hex 80068370; asc    p;;
 5: len 8; hex 8000125f4654bb2d; asc    _FT -;;
 6: len 8; hex 8000125f4654e23d; asc    _FT =;;
 7: len 4; hex 80000000; asc     ;;
 
But in child table `RedactedDB`.`vpn_connections`, in index fk_vpn_connections_1, there is a record:
PHYSICAL RECORD: n_fields 3; compact format; info bits 0
 0: len 30; hex 386164656533653538353034613165376337643839646364393831303633; asc 8adee3e58504a1e7c7d89dcd981063; (total 32 bytes);
 1: len 4; hex 800d4aa4; asc   J ;;
 2: len 4; hex 80068370; asc    p;;

Also note that for each each crash, the 'redactedToken' is different. However, for a given crash, the 'redactedToken' appear around 30 times.
However there is a nearly 5h gap between this last foreign key error and the mutext failure, so this might be unrelated.
This is seriously impacting our business, any idea?

Attached:

  • the error log
  • the list of mariadb variable global values


 Comments   
Comment by Marko Mäkelä [ 2020-07-24 ]

dade, because you are apparently using Galera replication, I am thinking that this report might be a duplicate of MDEV-21910.

Hangs in the server are best debugged if you could attach a debugger to the hung process, or start the debugger on a core dump, and then extract the stack traces of all threads. In GDB, I would recommend the following statements:

set print frame-arguments all
set height 0
set log on
thread apply all backtrace

Comment by Mark Zealey [ 2020-10-14 ]

We have also seen this exact same issue twice in the past couple of weeks, firstly on 10.4.13 and then yesterday on 10.4.14 after roughly 2 weeks of running (as we upgraded after first issue as we thought MDEV-21910 may be the cause). In our case we have very heavy reads&writes on a single server, and light reads/writes on the other 2. We are now shifting to try to focus all writes on a single server to see if this works around the issue. We have not been killing threads unlike 21910 which leads me to think this may well be a different bug, somehow triggered by load/race condition. We've been slowly ramping up load on our platform for the past 4 months or so and never saw this bug before a couple of weeks back even though we had been running 10.4.13 the whole time.

I've attached the gdb.txt from the coredump (it was limited to 2gb so missing some stuff) you requested above from 10.4.14 (centos 8 + debuginfo), however for some reason most of the symbols are not being resolved - not sure how useful it will be. Let me know anything else I can try to make it more readable.

Comment by Mark Zealey [ 2020-10-14 ]

issue.txt is abridged syslog output hopefully containing all useful info. As connections stalled many more connections started coming in up to the 3000 conn limit that we set.

Generated at Thu Feb 08 09:13:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.