[MDEV-25955] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.3.27
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9
Component/s: Server, Storage Engine - InnoDB
Labels:
- crash
- innodb
Environment:
Linux x86_64

Description

I have a setup with a read/write primary and two replicas which are not being used beyond replication monitoring.

All 3 crashed at different times within 2 months of an upgrade from 10.2.23 with very similar symptoms: innodb triggering an abort() (signal 6) because of a 600s lock acquisition timeout, which further devolved into a segfault (signal 11) because of a null-pointer dereference during diagnostic log production for the abort.

The upgrade for all processes happened simultaneously (all 10.2 stopped, then all 10.3 started, and no process restart until the crashes).

I did not witness the crashes happening on the replicates (we only found out once the replication monitoring notified us of them being unreachable), but I just witnessed a crash on the primary, and I could notice that there were two distinct phases, a first long (until I first noticed the issue) lock sequence, which somehow involved an apparent deadlock. In the attached mariadb-error log, this is the period from 2021-06-16 4:56:21 until 2021-06-16 6:20:24. Then thre is a second phase, eventually showing a similar deadlock, and which did lead to the abort(). This is hte period from 2021-06-16 8:11:08 to the abort() at 2021-06-16 8:23:00.

The deadlock appears first in the automatically-logged "show engine innodb status" at 2021-06-16 05:48:30. Everytime, the deadlock involve the same thread id (140189330171648), which at some point is visible as "Main thread ID" - ~~but for whatever this means the "Main thread ID" changes throughout the incident~~ EDIT: my bad, it stays the same. Still, the deadlocking thread is always the same one.
At 2021-06-16 06:20:26 is the last report of the first phase which still shows the deadlock: it somehow disappears without crash on the report at 2021-06-16 06:20:46, which is also the last report of the first phase.

Then, during the second phase, the deadlock reappears at in the 2021-06-16 08:19:06 report, and is present until the last report at 2021-06-16 08:22:46.

In all cases, the main thread was: Process ID=33661, Main thread ID=140189330171648, state: enforcing dict cache limit

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mariadb_error.log-20210617.gz
1.75 MB
2021-06-18 03:17

Issue Links

duplicates

MDEV-24275 InnoDB persistent stats analyze forces full scan forcing lock crash

Closed

relates to

MDEV-24375 Semaphore wait has lasted > 600 seconds

Closed

Activity

People

Assignee:: Marko Mäkelä

Reporter:: Vincent Pelletier

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-06-18 03:19

Updated:: 2021-10-28 10:04

Resolved:: 2021-10-28 10:04