[MDEV-17407] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung. Created: 2018-10-09  Updated: 2018-12-07  Resolved: 2018-12-07

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Nigel Gomm Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: need_feedback
Environment:

windows server 2012


Attachments: File CE-APP.err    

 Description   

i seem to get this with ANY slow query or even DDL instruction on a large table. It's compounded by the server then insisting on doing a crash recovery which seems to take hours.

today a customer's server was running out of disk space so i (foolishly) tried to purge some now redundant audit stuff with a "delete where like". And now my customer looks like being without a server for an entire business day.
After about 3 hours of crashrecovery i stopped the service (windows btw) and set innodb_force_recovery but that has crapped on the tables. so now having to fetch 65gb from off-site backup.

Is there a setting to suppress the initial timeout and 'intentional crash'?
Now i admit it was my fault for running such an expensive command today but that is really much too easy to crash a production server. And as i said i've had same issue doing DDL (alter table) on v large tables (50gb)



 Comments   
Comment by Marko Mäkelä [ 2018-10-09 ]

nigelgomm, you are being hit by the InnoDB watchdog. It is catching a long wait for an InnoDB internal RW-lock or mutex. These should normally not be held for more than a few milliseconds. The option that you asked for is called innodb_fatal_semaphore_wait_threshold. Its default value is 600 seconds. A hung server would typically not be killed after 600 seconds sharp, but more like after 900 seconds (15 minutes).

Locks acquired by user transactions can be held for arbitrarily long times without InnoDB’s watchdog getting upset.

I would like to see more data so that we can reproduce and diagnose the issue. Can you upload the server error log?

Comment by Nigel Gomm [ 2018-10-09 ]

attached .err file

Comment by Elena Stepanova [ 2018-10-09 ]

Also your ini file(s) please.

Comment by Nigel Gomm [ 2018-10-09 ]

is it hitting innodb_fatal_semaphore_wait_threshold that causes the crashrecovery on restart? i.e. if i set it to maximum and just abort a query that takes too long by closing the connection..... will mariaDB insist on the crashrecovery or will it restart immediately.

It's the crash recovery on restart is the biggest problem.

A query or DDL command - no matter how poorly constructed - should not take down a production server for 24 hours.

or do i need to choose something other than innodb?

Comment by Vladislav Vaintroub [ 2018-10-09 ]

If you abort the query, MariaDB does not restart. If you can abort the query, that is.
MariaDB crashes in "long semaphore wait", not just because it thinks that your query takes too long. It crashes, because it suspects internal bug, a deadlock
It is abnormal termination, and possibly with larger timeout you would have to wait longer for crash to occur, but result would be the same.

A example is e.g in MDEV-15707 , is a genuine deadlock, related to innodb change buffer, with symptoms quite similar to yours,. A strong hint to MDEV-15707 is the presence of ibuf0ibuf.cc in the deadlock detector output. This bug was fixed in 10.2.15, but it is still present in 10.2.14 that you have. Therefore my suggestion would be to upgrade to latest 10.2 and see if you can run on this bug again.

And in the unlikely event that bug occurs still, in newer version, it would be nice , if you could add set core-file=1 in your my.ini [mysqld] section, so that if deadlock / crash occurs again, you could upload the mysqld.dmp from the data directory, and attach it to the bug report. This would help to analyze what happens better.

Comment by Nigel Gomm [ 2018-10-10 ]

ok... will do and thank you.

Comment by Elena Stepanova [ 2018-11-07 ]

nigelgomm, has the upgrade helped?

Generated at Thu Feb 08 08:36:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.