Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25111

Long semaphore wait (> 800 secs), server stops responding

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.3.28
    • N/A

    Description

      We had been running MariaDB 10.3.27 with Galera cluster in production without any issues.

      Less than 26 hours after updating to 10.3.28 one of three servers stopped responding.

      The following errors were logged for more than 800 seconds until I restarted MariaDB:

      2021-03-10 15:10:45 0 [Warning] InnoDB: A long semaphore wait:
      --Thread 140231647176448 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
      Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      2021-03-10 15:10:45 0 [Warning] InnoDB: A long semaphore wait:
      --Thread 140286587229952 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
      Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      [clip]
      2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
      --Thread 140232309577472 has waited at row0row.cc line 1133 for 239.00 seconds the semaphore:
      X-lock on RW-latch at 0x7f8f0086a270 created in file buf0buf.cc line 1563
      a writer (thread id 140232091498240) has reserved it in mode  exclusive
      number of readers 0, waiters flag 1, lock_word: 0
      Last time write locked in file row0row.cc line 1133
      [clip]
      2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
      --Thread 140231088736000 has waited at lock0lock.cc line 3882 for 238.00 seconds the semaphore:
      Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
      --Thread 140232326362880 has waited at srv0srv.cc line 2026 for 238.00 seconds the semaphore:
      X-lock (wait_ex) on RW-latch at 0x55627310f3e0 created in file dict0dict.cc line 920
      a writer (thread id 140232326362880) has reserved it in mode  wait exclusive
      number of readers 4, waiters flag 1, lock_word: fffffffc
      Last time write locked in file srv0srv.cc line 2026
      2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
      --Thread 140231088121600 has waited at lock0lock.cc line 3882 for 238.00 seconds the semaphore:
      Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
      --Thread 140231087814400 has waited at lock0lock.cc line 3882 for 238.00 seconds the semaphore:
      Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
      

      I'll attach the full log from when the problem started. The log also contains InnoDB Monitor output.

      This sounds similar to MDEV-24375, but seems to be a different issue since MDEV-24275 should have fixed that in 10.3.28 if I got things right.

      Attachments

        Issue Links

          Activity

            emaijala, yes, this must be different from MDEV-24275 and MDEV-24188, which I suspect MDEV-24375 to be a duplicate report of. I would tend to believe that this has something to do with Galera, while the above tickets affect plain InnoDB and Galera alike.

            The InnoDB diagnostic output is virtually useless for diagnosing hangs or deadlocks. (In fact, it will be removed in 10.6: MDEV-21452.) We should find InnoDB hangs in our internal stress testing, and I am not currently aware of any hangs in InnoDB code. (But, I cannot speak about Galera.)

            For anything nontrivial, we really need stack traces of all threads during the hang or performance degradation. You can use some tool like http://poormansprofiler.org/ as well. But, without some stack trace output, I am afraid that we cannot proceed anywhere.

            Two notable Galera-related changes in the most recent releases that might have something to do with this were MDEV-23328 and MDEV-23536.

            marko Marko Mäkelä added a comment - emaijala , yes, this must be different from MDEV-24275 and MDEV-24188 , which I suspect MDEV-24375 to be a duplicate report of. I would tend to believe that this has something to do with Galera, while the above tickets affect plain InnoDB and Galera alike. The InnoDB diagnostic output is virtually useless for diagnosing hangs or deadlocks. (In fact, it will be removed in 10.6: MDEV-21452 .) We should find InnoDB hangs in our internal stress testing, and I am not currently aware of any hangs in InnoDB code. (But, I cannot speak about Galera.) For anything nontrivial, we really need stack traces of all threads during the hang or performance degradation. You can use some tool like http://poormansprofiler.org/ as well. But, without some stack trace output, I am afraid that we cannot proceed anywhere. Two notable Galera-related changes in the most recent releases that might have something to do with this were MDEV-23328 and MDEV-23536 .
            emaijala Ere Maijala added a comment - - edited

            @Marko Mäkelä, thanks for the information. I've now prepared to get stack traces in case we don't decide to downgrade to 10.3.27.

            emaijala Ere Maijala added a comment - - edited @Marko Mäkelä, thanks for the information. I've now prepared to get stack traces in case we don't decide to downgrade to 10.3.27.
            JDT JDT added a comment -

            We are also experiencing this issue since upgrading to 10.3.28. We're also using Galera and CentOS 7.9.2009. I've attached a log below.
            mysql-galera.log

            I will also try to get some more debug information from our Galera instance. I'll try to dump the running queries to a file every minute and try to narrow down the query that triggers the problem if possible.

            JDT JDT added a comment - We are also experiencing this issue since upgrading to 10.3.28. We're also using Galera and CentOS 7.9.2009. I've attached a log below. mysql-galera.log I will also try to get some more debug information from our Galera instance. I'll try to dump the running queries to a file every minute and try to narrow down the query that triggers the problem if possible.

            MDEV-24831 might be related to this. Those warnings can only appear in 10.6 debug builds, thanks to MDEV-21452.

            marko Marko Mäkelä added a comment - MDEV-24831 might be related to this. Those warnings can only appear in 10.6 debug builds, thanks to MDEV-21452 .

            People

              jplindst Jan Lindström (Inactive)
              emaijala Ere Maijala
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.