[MDEV-17407] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.2.14
Fix Version/s: N/A
Component/s: Storage Engine - InnoDB
Labels:
- need_feedback
Environment:
windows server 2012

Description

i seem to get this with ANY slow query or even DDL instruction on a large table. It's compounded by the server then insisting on doing a crash recovery which seems to take hours.

today a customer's server was running out of disk space so i (foolishly) tried to purge some now redundant audit stuff with a "delete where like". And now my customer looks like being without a server for an entire business day.
After about 3 hours of crashrecovery i stopped the service (windows btw) and set innodb_force_recovery but that has crapped on the tables. so now having to fetch 65gb from off-site backup.

Is there a setting to suppress the initial timeout and 'intentional crash'?
Now i admit it was my fault for running such an expensive command today but that is really much too easy to crash a production server. And as i said i've had same issue doing DDL (alter table) on v large tables (50gb)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

CE-APP.err
94 kB
2018-10-09 11:45

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2018-10-09 11:14

nigelgomm, you are being hit by the InnoDB watchdog. It is catching a long wait for an InnoDB internal RW-lock or mutex. These should normally not be held for more than a few milliseconds. The option that you asked for is called innodb_fatal_semaphore_wait_threshold. Its default value is 600 seconds. A hung server would typically not be killed after 600 seconds sharp, but more like after 900 seconds (15 minutes).

Locks acquired by user transactions can be held for arbitrarily long times without InnoDB’s watchdog getting upset.

I would like to see more data so that we can reproduce and diagnose the issue. Can you upload the server error log?

Marko Mäkelä added a comment - 2018-10-09 11:14 nigelgomm , you are being hit by the InnoDB watchdog. It is catching a long wait for an InnoDB internal RW-lock or mutex. These should normally not be held for more than a few milliseconds. The option that you asked for is called innodb_fatal_semaphore_wait_threshold . Its default value is 600 seconds. A hung server would typically not be killed after 600 seconds sharp, but more like after 900 seconds (15 minutes). Locks acquired by user transactions can be held for arbitrarily long times without InnoDB’s watchdog getting upset. I would like to see more data so that we can reproduce and diagnose the issue. Can you upload the server error log?

Nigel Gomm added a comment - 2018-10-09 11:46

attached .err file

Nigel Gomm added a comment - 2018-10-09 11:46 attached .err file

Elena Stepanova added a comment - 2018-10-09 15:21

Also your ini file(s) please.

Elena Stepanova added a comment - 2018-10-09 15:21 Also your ini file(s) please.

Nigel Gomm added a comment - 2018-10-09 21:16

is it hitting innodb_fatal_semaphore_wait_threshold that causes the crashrecovery on restart? i.e. if i set it to maximum and just abort a query that takes too long by closing the connection..... will mariaDB insist on the crashrecovery or will it restart immediately.

It's the crash recovery on restart is the biggest problem.

A query or DDL command - no matter how poorly constructed - should not take down a production server for 24 hours.

or do i need to choose something other than innodb?

Nigel Gomm added a comment - 2018-10-09 21:16 is it hitting innodb_fatal_semaphore_wait_threshold that causes the crashrecovery on restart? i.e. if i set it to maximum and just abort a query that takes too long by closing the connection..... will mariaDB insist on the crashrecovery or will it restart immediately. It's the crash recovery on restart is the biggest problem. A query or DDL command - no matter how poorly constructed - should not take down a production server for 24 hours. or do i need to choose something other than innodb?

Vladislav Vaintroub added a comment - 2018-10-09 22:40 - edited

If you abort the query, MariaDB does not restart. If you can abort the query, that is.
MariaDB crashes in "long semaphore wait", not just because it thinks that your query takes too long. It crashes, because it suspects internal bug, a deadlock
It is abnormal termination, and possibly with larger timeout you would have to wait longer for crash to occur, but result would be the same.

A example is e.g in ~~MDEV-15707~~ , is a genuine deadlock, related to innodb change buffer, with symptoms quite similar to yours,. A strong hint to ~~MDEV-15707~~ is the presence of ibuf0ibuf.cc in the deadlock detector output. This bug was fixed in 10.2.15, but it is still present in 10.2.14 that you have. Therefore my suggestion would be to upgrade to latest 10.2 and see if you can run on this bug again.

And in the unlikely event that bug occurs still, in newer version, it would be nice , if you could add set core-file=1 in your my.ini [mysqld] section, so that if deadlock / crash occurs again, you could upload the mysqld.dmp from the data directory, and attach it to the bug report. This would help to analyze what happens better.

Vladislav Vaintroub added a comment - 2018-10-09 22:40 - edited If you abort the query, MariaDB does not restart. If you can abort the query, that is. MariaDB crashes in "long semaphore wait", not just because it thinks that your query takes too long. It crashes, because it suspects internal bug, a deadlock It is abnormal termination, and possibly with larger timeout you would have to wait longer for crash to occur, but result would be the same. A example is e.g in MDEV-15707 , is a genuine deadlock, related to innodb change buffer, with symptoms quite similar to yours,. A strong hint to MDEV-15707 is the presence of ibuf0ibuf.cc in the deadlock detector output. This bug was fixed in 10.2.15, but it is still present in 10.2.14 that you have. Therefore my suggestion would be to upgrade to latest 10.2 and see if you can run on this bug again. And in the unlikely event that bug occurs still, in newer version, it would be nice , if you could add set core-file=1 in your my.ini [mysqld] section, so that if deadlock / crash occurs again, you could upload the mysqld.dmp from the data directory, and attach it to the bug report. This would help to analyze what happens better.

Nigel Gomm added a comment - 2018-10-10 09:59

ok... will do and thank you.

Nigel Gomm added a comment - 2018-10-10 09:59 ok... will do and thank you.

Elena Stepanova added a comment - 2018-11-07 14:39

nigelgomm, has the upgrade helped?

Elena Stepanova added a comment - 2018-11-07 14:39 nigelgomm , has the upgrade helped?

People

Assignee:: Unassigned

Reporter:: Nigel Gomm

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2018-10-09 11:05

Updated:: 2018-12-07 00:59

Resolved:: 2018-12-07 00:59

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration