Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25955

InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

    XMLWordPrintable

Details

    Description

      I have a setup with a read/write primary and two replicas which are not being used beyond replication monitoring.

      All 3 crashed at different times within 2 months of an upgrade from 10.2.23 with very similar symptoms: innodb triggering an abort() (signal 6) because of a 600s lock acquisition timeout, which further devolved into a segfault (signal 11) because of a null-pointer dereference during diagnostic log production for the abort.

      The upgrade for all processes happened simultaneously (all 10.2 stopped, then all 10.3 started, and no process restart until the crashes).

      I did not witness the crashes happening on the replicates (we only found out once the replication monitoring notified us of them being unreachable), but I just witnessed a crash on the primary, and I could notice that there were two distinct phases, a first long (until I first noticed the issue) lock sequence, which somehow involved an apparent deadlock. In the attached mariadb-error log, this is the period from 2021-06-16 4:56:21 until 2021-06-16 6:20:24. Then thre is a second phase, eventually showing a similar deadlock, and which did lead to the abort(). This is hte period from 2021-06-16 8:11:08 to the abort() at 2021-06-16 8:23:00.

      The deadlock appears first in the automatically-logged "show engine innodb status" at 2021-06-16 05:48:30. Everytime, the deadlock involve the same thread id (140189330171648), which at some point is visible as "Main thread ID" - but for whatever this means the "Main thread ID" changes throughout the incident EDIT: my bad, it stays the same. Still, the deadlocking thread is always the same one.
      At 2021-06-16 06:20:26 is the last report of the first phase which still shows the deadlock: it somehow disappears without crash on the report at 2021-06-16 06:20:46, which is also the last report of the first phase.

      Then, during the second phase, the deadlock reappears at in the 2021-06-16 08:19:06 report, and is present until the last report at 2021-06-16 08:22:46.

      In all cases, the main thread was: Process ID=33661, Main thread ID=140189330171648, state: enforcing dict cache limit

      Attachments

        Issue Links

          Activity

            People

              marko Marko Mäkelä
              vpelletier_nxd Vincent Pelletier
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.