Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Incomplete
-
10.3.22
-
Linux, centos 6 and centos 7
Description
Hello,
since our upgrade from mariadb 5.5.52, to 10.3.22 on 25/2/2020, we had 3 crashes on 2 of our mariadb nodes, each time caused by a MUTEX failure. This is about 1 crash per week, but it's inconsistent, we had only 3 days betwee second and third crash. Every time this happen, the whole mariadb cluster get blocked by this failing node and all our applications are blocked for around 12 minutes.
Here is an extract of log (complete log of error attached).
2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
|
--Thread 139770264647424 has waited at lock0lock.cc line 4567 for 77.00 seconds the semaphore:
|
Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
|
|
2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:
|
--Thread 139762764658432 has waited at lock0lock.cc line 3903 for 429.00 seconds the semaphore:
|
Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2
|
|
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
|
InnoDB: Pending reads 0, writes 0
|
InnoDB: ###### Diagnostic info printed to the standard error stream
|
2020-04-11 23:29:31 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
|
200411 23:29:31 [ERROR] mysqld got signal 6 ;
|
This could be because you hit a bug. It is also possible that this binary
|
or one of the libraries it was linked against is corrupt, improperly built,
|
or misconfigured. This error can also be caused by malfunctioning hardware.
|
|
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
|
We will try our best to scrape up some info that will hopefully help
|
diagnose the problem, but since we have already crashed,
|
something is definitely wrong and this may fail.
|
|
Server version: 10.3.22-MariaDB-log
|
key_buffer_size=67108864
|
read_buffer_size=1048576
|
max_used_connections=601
|
max_threads=602
|
thread_count=397
|
It is possible that mysqld could use up to
|
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1928416 K bytes of memory
|
Hope that's ok; if not, decrease some variables in the equation.
|
|
Thread pointer: 0x0
|
Attempting backtrace. You can use the following information to find out
|
where mysqld died. If you see no messages after this, something went
|
terribly wrong...
|
stack_bottom = 0x0 thread_stack 0x49000
|
/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f227c1d347b]
|
/usr/sbin/mysqld(handle_fatal_signal+0x4b7)[0x7f227bcba087]
|
/lib64/libpthread.so.0(+0x3e7380f7e0)[0x7f227b0997e0]
|
/lib64/libc.so.6(gsignal+0x35)[0x7f22796f34f5]
|
/lib64/libc.so.6(abort+0x175)[0x7f22796f4cd5]
|
/usr/sbin/mysqld(+0xadd3ba)[0x7f227bfa73ba]
|
/usr/sbin/mysqld(+0xa8e1c5)[0x7f227bf581c5]
|
/lib64/libpthread.so.0(+0x3e73807aa1)[0x7f227b091aa1]
|
/lib64/libc.so.6(clone+0x6d)[0x7f22797a9c4d]
|
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
|
information that should help you find out what is causing the crash.
|
Writing a core file...
|
Working directory at /var/lib/mysql
|
Fatal signal 11 while backtracing
|
2020-04-11 23:30:41 0 [Note] WSREP: Read nil XID from storage engines, skipping position init
|
2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
|
2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.
|
2020-04-11 23:30:41 0 [Note] WSREP: CRC-32C: using hardware acceleration.
|
2020-04-11 23:30:41 0 [Note] WSREP: Found saved state: 648134e5-23a5-11e8-bbbc-7b838bd00037:-1, safe_to_bootstrap: 0
|
2020-04-11 23:30:41 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.42.11; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 8192M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.8; gcs.fc_limit = 80; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.42.11; gmcast.segment = 0; gmc
|
2020-04-11 23:30:41 0 [Note] WSREP: GCache history reset: 648134e5-23a5-11e8-bbbc-7b838bd00037:0 -> 648134e5-23a5-11e8-bbbc-7b838bd00037:2336233095
|
2020-04-11 23:30:41 0 [Note] WSREP: Assign initial position for certification: 2336233095, protocol version: -1
|
2020-04-11 23:30:41 0 [Note] WSREP: wsrep_sst_grab()
|
We don't know if it's related, but in each of those 3 cases, the error log contained also this error.
OS WAIT ARRAY INFO: signal count 442592098
|
RW-shared spins 0, rounds 140548689, OS waits 41314150
|
RW-excl spins 0, rounds 868356577, OS waits 12754003
|
RW-sx spins 5319566, rounds 96848713, OS waits 987940
|
Spin rounds per wait: 140548689.00 RW-shared, 868356577.00 RW-excl, 18.21 RW-sx
|
------------------------
|
LATEST FOREIGN KEY ERROR
|
------------------------
|
2020-04-11 16:13:14 0x7f206e42e700 Transaction:
|
TRANSACTION 51276290676, ACTIVE 0 sec updating or deleting
|
mysql tables in use 2, locked 2
|
4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
|
MySQL thread id 270341, OS thread handle 139777265559296, query id 19109428451 192.168.42.11 redacteduser Updating
|
delete from sessions where token='redactedToken'
|
Foreign key constraint fails for table `RedactedDB`.`vpn_connections`:
|
,
|
CONSTRAINT `fk_vpn_connections_1` FOREIGN KEY (`session_id`) REFERENCES `sessions` (`token`)
|
Trying to delete or update in parent table, in index PRIMARY tuple:
|
DATA TUPLE: 8 fields;
|
0: len 32; hex 3861646565336535383530346131653763376438396463643938313036336634; asc 8adee3e58504a1e7c7d89dcd981063f4;;
|
1: len 6; hex 000bf04e1a74; asc N t;;
|
2: len 7; hex 3100017bfe27ab; asc 1 { ' ;;
|
3: len 4; hex 80031fff; asc ;;
|
4: len 4; hex 80068370; asc p;;
|
5: len 8; hex 8000125f4654bb2d; asc _FT -;;
|
6: len 8; hex 8000125f4654e23d; asc _FT =;;
|
7: len 4; hex 80000000; asc ;;
|
|
But in child table `RedactedDB`.`vpn_connections`, in index fk_vpn_connections_1, there is a record:
|
PHYSICAL RECORD: n_fields 3; compact format; info bits 0
|
0: len 30; hex 386164656533653538353034613165376337643839646364393831303633; asc 8adee3e58504a1e7c7d89dcd981063; (total 32 bytes);
|
1: len 4; hex 800d4aa4; asc J ;;
|
2: len 4; hex 80068370; asc p;;
|
Also note that for each each crash, the 'redactedToken' is different. However, for a given crash, the 'redactedToken' appear around 30 times.
However there is a nearly 5h gap between this last foreign key error and the mutext failure, so this might be unrelated.
This is seriously impacting our business, any idea?
Attached:
- the error log
- the list of mariadb variable global values