[MDEV-22235] Mysql node crash regulary on Semaphore wait has lasted > 600 seconds - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Incomplete
Affects Version/s: 10.3.22
Fix Version/s: N/A
Component/s: Galera, Storage Engine - InnoDB
Labels:
- need_feedback
Environment:
Linux, centos 6 and centos 7

Description

Hello,

since our upgrade from mariadb 5.5.52, to 10.3.22 on 25/2/2020, we had 3 crashes on 2 of our mariadb nodes, each time caused by a MUTEX failure. This is about 1 crash per week, but it's inconsistent, we had only 3 days betwee second and third crash. Every time this happen, the whole mariadb cluster get blocked by this failing node and all our applications are blocked for around 12 minutes.

Here is an extract of log (complete log of error attached).

2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:

--Thread 139770264647424 has waited at lock0lock.cc line 4567 for 77.00 seconds the semaphore:

Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2

2020-04-11 23:29:01 0 [Note] InnoDB: A semaphore wait:

--Thread 139762764658432 has waited at lock0lock.cc line 3903 for 429.00 seconds the semaphore:

Mutex at 0x7f227cc1edc0, Mutex LOCK_SYS created lock0lock.cc:460, lock var 2

InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:

InnoDB: Pending reads 0, writes 0

InnoDB: ###### Diagnostic info printed to the standard error stream

2020-04-11 23:29:31 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

200411 23:29:31 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.3.22-MariaDB-log

key_buffer_size=67108864

read_buffer_size=1048576

max_used_connections=601

max_threads=602

thread_count=397

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1928416 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

/usr/sbin/mysqld(my_print_stacktrace+0x2b)[0x7f227c1d347b]

/usr/sbin/mysqld(handle_fatal_signal+0x4b7)[0x7f227bcba087]

/lib64/libpthread.so.0(+0x3e7380f7e0)[0x7f227b0997e0]

/lib64/libc.so.6(gsignal+0x35)[0x7f22796f34f5]

/lib64/libc.so.6(abort+0x175)[0x7f22796f4cd5]

/usr/sbin/mysqld(+0xadd3ba)[0x7f227bfa73ba]

/usr/sbin/mysqld(+0xa8e1c5)[0x7f227bf581c5]

/lib64/libpthread.so.0(+0x3e73807aa1)[0x7f227b091aa1]

/lib64/libc.so.6(clone+0x6d)[0x7f22797a9c4d]

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains

information that should help you find out what is causing the crash.

Writing a core file...

Working directory at /var/lib/mysql

Fatal signal 11 while backtracing

2020-04-11 23:30:41 0 [Note] WSREP: Read nil XID from storage engines, skipping position init

2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'

2020-04-11 23:30:41 0 [Note] WSREP: wsrep_load(): Galera 25.3.28(r3875) by Codership Oy <info@codership.com> loaded successfully.

2020-04-11 23:30:41 0 [Note] WSREP: CRC-32C: using hardware acceleration.

2020-04-11 23:30:41 0 [Note] WSREP: Found saved state: 648134e5-23a5-11e8-bbbc-7b838bd00037:-1, safe_to_bootstrap: 0

2020-04-11 23:30:41 0 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = 192.168.42.11; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 8192M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 0.8; gcs.fc_limit = 80; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.listen_addr = tcp://192.168.42.11; gmcast.segment = 0; gmc

2020-04-11 23:30:41 0 [Note] WSREP: GCache history reset: 648134e5-23a5-11e8-bbbc-7b838bd00037:0 -> 648134e5-23a5-11e8-bbbc-7b838bd00037:2336233095

2020-04-11 23:30:41 0 [Note] WSREP: Assign initial position for certification: 2336233095, protocol version: -1

2020-04-11 23:30:41 0 [Note] WSREP: wsrep_sst_grab()

We don't know if it's related, but in each of those 3 cases, the error log contained also this error.

OS WAIT ARRAY INFO: signal count 442592098

RW-shared spins 0, rounds 140548689, OS waits 41314150

RW-excl spins 0, rounds 868356577, OS waits 12754003

RW-sx spins 5319566, rounds 96848713, OS waits 987940

Spin rounds per wait: 140548689.00 RW-shared, 868356577.00 RW-excl, 18.21 RW-sx

------------------------

LATEST FOREIGN KEY ERROR

------------------------

2020-04-11 16:13:14 0x7f206e42e700 Transaction:

TRANSACTION 51276290676, ACTIVE 0 sec updating or deleting

mysql tables in use 2, locked 2

4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1

MySQL thread id 270341, OS thread handle 139777265559296, query id 19109428451 192.168.42.11 redacteduser Updating

delete from sessions where token='redactedToken'

Foreign key constraint fails for table `RedactedDB`.`vpn_connections`:

  CONSTRAINT `fk_vpn_connections_1` FOREIGN KEY (`session_id`) REFERENCES `sessions` (`token`)

Trying to delete or update in parent table, in index PRIMARY tuple:

DATA TUPLE: 8 fields;

 0: len 32; hex 3861646565336535383530346131653763376438396463643938313036336634; asc 8adee3e58504a1e7c7d89dcd981063f4;;

 1: len 6; hex 000bf04e1a74; asc    N t;;

 2: len 7; hex 3100017bfe27ab; asc 1  { ' ;;

 3: len 4; hex 80031fff; asc     ;;

 4: len 4; hex 80068370; asc    p;;

 5: len 8; hex 8000125f4654bb2d; asc    _FT -;;

 6: len 8; hex 8000125f4654e23d; asc    _FT =;;

 7: len 4; hex 80000000; asc     ;;

But in child table `RedactedDB`.`vpn_connections`, in index fk_vpn_connections_1, there is a record:

PHYSICAL RECORD: n_fields 3; compact format; info bits 0

 0: len 30; hex 386164656533653538353034613165376337643839646364393831303633; asc 8adee3e58504a1e7c7d89dcd981063; (total 32 bytes);

 1: len 4; hex 800d4aa4; asc   J ;;

 2: len 4; hex 80068370; asc    p;;

Also note that for each each crash, the 'redactedToken' is different. However, for a given crash, the 'redactedToken' appear around 30 times.
However there is a nearly 5h gap between this last foreign key error and the mutext failure, so this might be unrelated.
This is seriously impacting our business, any idea?

Attached:

the error log
the list of mariadb variable global values

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

error_mariadb.gz
2020-04-14 10:09
355 kB
David Delbecq
gdb.txt
2020-10-14 09:29
566 kB
Mark Zealey
issue.txt
2020-10-14 09:41
52 kB
Mark Zealey
mariadb_variables
2020-04-14 10:19
20 kB
David Delbecq

Mysql node crash regulary on Semaphore wait has lasted > 600 seconds

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration