Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17407

Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

Details

    Description

      i seem to get this with ANY slow query or even DDL instruction on a large table. It's compounded by the server then insisting on doing a crash recovery which seems to take hours.

      today a customer's server was running out of disk space so i (foolishly) tried to purge some now redundant audit stuff with a "delete where like". And now my customer looks like being without a server for an entire business day.
      After about 3 hours of crashrecovery i stopped the service (windows btw) and set innodb_force_recovery but that has crapped on the tables. so now having to fetch 65gb from off-site backup.

      Is there a setting to suppress the initial timeout and 'intentional crash'?
      Now i admit it was my fault for running such an expensive command today but that is really much too easy to crash a production server. And as i said i've had same issue doing DDL (alter table) on v large tables (50gb)

      Attachments

        Activity

          nigelgomm, you are being hit by the InnoDB watchdog. It is catching a long wait for an InnoDB internal RW-lock or mutex. These should normally not be held for more than a few milliseconds. The option that you asked for is called innodb_fatal_semaphore_wait_threshold. Its default value is 600 seconds. A hung server would typically not be killed after 600 seconds sharp, but more like after 900 seconds (15 minutes).

          Locks acquired by user transactions can be held for arbitrarily long times without InnoDB’s watchdog getting upset.

          I would like to see more data so that we can reproduce and diagnose the issue. Can you upload the server error log?

          marko Marko Mäkelä added a comment - nigelgomm , you are being hit by the InnoDB watchdog. It is catching a long wait for an InnoDB internal RW-lock or mutex. These should normally not be held for more than a few milliseconds. The option that you asked for is called innodb_fatal_semaphore_wait_threshold . Its default value is 600 seconds. A hung server would typically not be killed after 600 seconds sharp, but more like after 900 seconds (15 minutes). Locks acquired by user transactions can be held for arbitrarily long times without InnoDB’s watchdog getting upset. I would like to see more data so that we can reproduce and diagnose the issue. Can you upload the server error log?
          nigelgomm Nigel Gomm added a comment -

          attached .err file

          nigelgomm Nigel Gomm added a comment - attached .err file

          Also your ini file(s) please.

          elenst Elena Stepanova added a comment - Also your ini file(s) please.
          nigelgomm Nigel Gomm added a comment -

          is it hitting innodb_fatal_semaphore_wait_threshold that causes the crashrecovery on restart? i.e. if i set it to maximum and just abort a query that takes too long by closing the connection..... will mariaDB insist on the crashrecovery or will it restart immediately.

          It's the crash recovery on restart is the biggest problem.

          A query or DDL command - no matter how poorly constructed - should not take down a production server for 24 hours.

          or do i need to choose something other than innodb?

          nigelgomm Nigel Gomm added a comment - is it hitting innodb_fatal_semaphore_wait_threshold that causes the crashrecovery on restart? i.e. if i set it to maximum and just abort a query that takes too long by closing the connection..... will mariaDB insist on the crashrecovery or will it restart immediately. It's the crash recovery on restart is the biggest problem. A query or DDL command - no matter how poorly constructed - should not take down a production server for 24 hours. or do i need to choose something other than innodb?
          wlad Vladislav Vaintroub added a comment - - edited

          If you abort the query, MariaDB does not restart. If you can abort the query, that is.
          MariaDB crashes in "long semaphore wait", not just because it thinks that your query takes too long. It crashes, because it suspects internal bug, a deadlock
          It is abnormal termination, and possibly with larger timeout you would have to wait longer for crash to occur, but result would be the same.

          A example is e.g in MDEV-15707 , is a genuine deadlock, related to innodb change buffer, with symptoms quite similar to yours,. A strong hint to MDEV-15707 is the presence of ibuf0ibuf.cc in the deadlock detector output. This bug was fixed in 10.2.15, but it is still present in 10.2.14 that you have. Therefore my suggestion would be to upgrade to latest 10.2 and see if you can run on this bug again.

          And in the unlikely event that bug occurs still, in newer version, it would be nice , if you could add set core-file=1 in your my.ini [mysqld] section, so that if deadlock / crash occurs again, you could upload the mysqld.dmp from the data directory, and attach it to the bug report. This would help to analyze what happens better.

          wlad Vladislav Vaintroub added a comment - - edited If you abort the query, MariaDB does not restart. If you can abort the query, that is. MariaDB crashes in "long semaphore wait", not just because it thinks that your query takes too long. It crashes, because it suspects internal bug, a deadlock It is abnormal termination, and possibly with larger timeout you would have to wait longer for crash to occur, but result would be the same. A example is e.g in MDEV-15707 , is a genuine deadlock, related to innodb change buffer, with symptoms quite similar to yours,. A strong hint to MDEV-15707 is the presence of ibuf0ibuf.cc in the deadlock detector output. This bug was fixed in 10.2.15, but it is still present in 10.2.14 that you have. Therefore my suggestion would be to upgrade to latest 10.2 and see if you can run on this bug again. And in the unlikely event that bug occurs still, in newer version, it would be nice , if you could add set core-file=1 in your my.ini [mysqld] section, so that if deadlock / crash occurs again, you could upload the mysqld.dmp from the data directory, and attach it to the bug report. This would help to analyze what happens better.
          nigelgomm Nigel Gomm added a comment -

          ok... will do and thank you.

          nigelgomm Nigel Gomm added a comment - ok... will do and thank you.

          nigelgomm, has the upgrade helped?

          elenst Elena Stepanova added a comment - nigelgomm , has the upgrade helped?

          People

            Unassigned Unassigned
            nigelgomm Nigel Gomm
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.