Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25955

InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

Details

    Description

      I have a setup with a read/write primary and two replicas which are not being used beyond replication monitoring.

      All 3 crashed at different times within 2 months of an upgrade from 10.2.23 with very similar symptoms: innodb triggering an abort() (signal 6) because of a 600s lock acquisition timeout, which further devolved into a segfault (signal 11) because of a null-pointer dereference during diagnostic log production for the abort.

      The upgrade for all processes happened simultaneously (all 10.2 stopped, then all 10.3 started, and no process restart until the crashes).

      I did not witness the crashes happening on the replicates (we only found out once the replication monitoring notified us of them being unreachable), but I just witnessed a crash on the primary, and I could notice that there were two distinct phases, a first long (until I first noticed the issue) lock sequence, which somehow involved an apparent deadlock. In the attached mariadb-error log, this is the period from 2021-06-16 4:56:21 until 2021-06-16 6:20:24. Then thre is a second phase, eventually showing a similar deadlock, and which did lead to the abort(). This is hte period from 2021-06-16 8:11:08 to the abort() at 2021-06-16 8:23:00.

      The deadlock appears first in the automatically-logged "show engine innodb status" at 2021-06-16 05:48:30. Everytime, the deadlock involve the same thread id (140189330171648), which at some point is visible as "Main thread ID" - but for whatever this means the "Main thread ID" changes throughout the incident EDIT: my bad, it stays the same. Still, the deadlocking thread is always the same one.
      At 2021-06-16 06:20:26 is the last report of the first phase which still shows the deadlock: it somehow disappears without crash on the report at 2021-06-16 06:20:46, which is also the last report of the first phase.

      Then, during the second phase, the deadlock reappears at in the 2021-06-16 08:19:06 report, and is present until the last report at 2021-06-16 08:22:46.

      In all cases, the main thread was: Process ID=33661, Main thread ID=140189330171648, state: enforcing dict cache limit

      Attachments

        Issue Links

          Activity

            I am currently in the process of reading all changed from 10.3.27 to 10.3.29 (latest 10.3 as of this writing), which makes me find MDEV-24275 which may be related.

            I do not know how I could confirm if background analyze was going on at the time of the locking issues. In any case, which we are not at a level of daily crashes, we do have weird optimiser behaviour (like electing a 5M rows index scan on one table instead of a 10k row index scan on another table as we would expect) which did seem to take a while to develop after the version change.

            vpelletier_nxd Vincent Pelletier added a comment - I am currently in the process of reading all changed from 10.3.27 to 10.3.29 (latest 10.3 as of this writing), which makes me find MDEV-24275 which may be related. I do not know how I could confirm if background analyze was going on at the time of the locking issues. In any case, which we are not at a level of daily crashes, we do have weird optimiser behaviour (like electing a 5M rows index scan on one table instead of a 10k row index scan on another table as we would expect) which did seem to take a while to develop after the version change.

            vpelletier_nxd, there was also MDEV-24188 that could cause real hangs (not only extreme slowness, like MDEV-24275 did if I remember correctly). I think that before spending more time to investigate this hang on the affected releases 10.3.27 or 10.3.26, you should try (and hopefully fail) to reproduce it with 10.3.28 or a later release.

            marko Marko Mäkelä added a comment - vpelletier_nxd , there was also MDEV-24188 that could cause real hangs (not only extreme slowness, like MDEV-24275 did if I remember correctly). I think that before spending more time to investigate this hang on the affected releases 10.3.27 or 10.3.26, you should try (and hopefully fail) to reproduce it with 10.3.28 or a later release.
            vpelletier_nxd Vincent Pelletier added a comment - - edited

            In my case I believe SELECT statements were running just fine up to the crash, while UPDATE/INSERT/REPLACE (and supposedly DELETE) in the affected table at the time of the locking issue were stuck.

            I guess the most reliable way to force triggering MDEV-24275 would be to run ANALYZE TABLE on one of these (the stock and category tables visible in the error log being quite large tables in row counts), hopefully triggering the lock timeout, then doing it on 10.3.29 and not triggering it. Unfortuntely for the realism this could trigger a similar-looking issue which could still be distinct... And so far the issue has happened only once per process over 2 months, and somehow earlier on processes only busy with replication (master is a lot busier than that).

            What I can at least confirm is that these tables were visibly affected by MDEV-24275's consequences on the content of mysql.innodb_table_stats. We did notice a significant decrease in performance in the weeks after the version change. I tried updating the statistics on the primary by copying values from one of the replicas which (for some reason) did not update these statistics since the version change, but sadly it did not seem to fix a known-newly-bad query plan (despite running FLUSH TABLE innodb_index_stats). So I'm not sure what is going on on this front.

            In any case, I ran set global innodb_stats_auto_recalc=0; as recommended on the Debian bug report linked to MDEV-24275, to at least prevent detroying the table statistics any further - if not prevent the next crash altogether.

            vpelletier_nxd Vincent Pelletier added a comment - - edited In my case I believe SELECT statements were running just fine up to the crash, while UPDATE / INSERT / REPLACE (and supposedly DELETE ) in the affected table at the time of the locking issue were stuck. I guess the most reliable way to force triggering MDEV-24275 would be to run ANALYZE TABLE on one of these (the stock and category tables visible in the error log being quite large tables in row counts), hopefully triggering the lock timeout, then doing it on 10.3.29 and not triggering it. Unfortuntely for the realism this could trigger a similar-looking issue which could still be distinct... And so far the issue has happened only once per process over 2 months, and somehow earlier on processes only busy with replication (master is a lot busier than that). What I can at least confirm is that these tables were visibly affected by MDEV-24275 's consequences on the content of mysql.innodb_table_stats . We did notice a significant decrease in performance in the weeks after the version change. I tried updating the statistics on the primary by copying values from one of the replicas which (for some reason) did not update these statistics since the version change, but sadly it did not seem to fix a known-newly-bad query plan (despite running FLUSH TABLE innodb_index_stats ). So I'm not sure what is going on on this front. In any case, I ran set global innodb_stats_auto_recalc=0; as recommended on the Debian bug report linked to MDEV-24275 , to at least prevent detroying the table statistics any further - if not prevent the next crash altogether.

            I have a test copy of the database, which was runing on 10.3.27. There, I could reproduce the same symptoms as the original crash by creating some modifiation worload and manually running ANALYZE TABLE <a large table>, up to the abort + segfault.

            Then, I upgraded this database to 10.3.29 (I believe I cannot use .28 as I need mroonga, and I believe there was some breakage on .28), and I could not reproduce the problem with these steps: ANALYZE TABLE <a large table> now finishes in under a second.

            Given the similarity of the symptoms I could observe before the upgrade, I believe this bug is a duplicate of MDEV-24275, and that upgrading to any version where MDEV-24275 is fixed is the proper solution, and disabling innodb's automated ANALYZE TABLE using set global innodb_stats_auto_recalc=0; should be a reliable workaround.

            Sorry for not finding that report before opening this one.

            vpelletier_nxd Vincent Pelletier added a comment - I have a test copy of the database, which was runing on 10.3.27. There, I could reproduce the same symptoms as the original crash by creating some modifiation worload and manually running ANALYZE TABLE <a large table> , up to the abort + segfault. Then, I upgraded this database to 10.3.29 (I believe I cannot use .28 as I need mroonga, and I believe there was some breakage on .28), and I could not reproduce the problem with these steps: ANALYZE TABLE <a large table> now finishes in under a second. Given the similarity of the symptoms I could observe before the upgrade, I believe this bug is a duplicate of MDEV-24275 , and that upgrading to any version where MDEV-24275 is fixed is the proper solution, and disabling innodb's automated ANALYZE TABLE using set global innodb_stats_auto_recalc=0; should be a reliable workaround. Sorry for not finding that report before opening this one.

            I believe this report can be closed as a duplicate of MDEV-24275.

            I've been running on 10.3.29 for over a month now without further lock acquisition timeout crash, and without the mysql.innodb_table_stats effects.

            vpelletier_nxd Vincent Pelletier added a comment - I believe this report can be closed as a duplicate of MDEV-24275 . I've been running on 10.3.29 for over a month now without further lock acquisition timeout crash, and without the mysql.innodb_table_stats effects.

            Thank you, I am closing this as a duplicate of MDEV-24275.

            marko Marko Mäkelä added a comment - Thank you, I am closing this as a duplicate of MDEV-24275 .

            People

              marko Marko Mäkelä
              vpelletier_nxd Vincent Pelletier
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.