Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Duplicate
-
10.3.27
-
Linux x86_64
Description
I have a setup with a read/write primary and two replicas which are not being used beyond replication monitoring.
All 3 crashed at different times within 2 months of an upgrade from 10.2.23 with very similar symptoms: innodb triggering an abort() (signal 6) because of a 600s lock acquisition timeout, which further devolved into a segfault (signal 11) because of a null-pointer dereference during diagnostic log production for the abort.
The upgrade for all processes happened simultaneously (all 10.2 stopped, then all 10.3 started, and no process restart until the crashes).
I did not witness the crashes happening on the replicates (we only found out once the replication monitoring notified us of them being unreachable), but I just witnessed a crash on the primary, and I could notice that there were two distinct phases, a first long (until I first noticed the issue) lock sequence, which somehow involved an apparent deadlock. In the attached mariadb-error log, this is the period from 2021-06-16 4:56:21 until 2021-06-16 6:20:24. Then thre is a second phase, eventually showing a similar deadlock, and which did lead to the abort(). This is hte period from 2021-06-16 8:11:08 to the abort() at 2021-06-16 8:23:00.
The deadlock appears first in the automatically-logged "show engine innodb status" at 2021-06-16 05:48:30. Everytime, the deadlock involve the same thread id (140189330171648), which at some point is visible as "Main thread ID" - but for whatever this means the "Main thread ID" changes throughout the incident EDIT: my bad, it stays the same. Still, the deadlocking thread is always the same one.
At 2021-06-16 06:20:26 is the last report of the first phase which still shows the deadlock: it somehow disappears without crash on the report at 2021-06-16 06:20:46, which is also the last report of the first phase.
Then, during the second phase, the deadlock reappears at in the 2021-06-16 08:19:06 report, and is present until the last report at 2021-06-16 08:22:46.
In all cases, the main thread was: Process ID=33661, Main thread ID=140189330171648, state: enforcing dict cache limit
Attachments
Issue Links
- duplicates
-
MDEV-24275 InnoDB persistent stats analyze forces full scan forcing lock crash
- Closed
- relates to
-
MDEV-24375 Semaphore wait has lasted > 600 seconds
- Closed