[MDEV-5156] Long semaphore wait crash Created: 2013-10-18 Updated: 2013-11-28 Resolved: 2013-11-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | 5.5.33a |
| Fix Version/s: | 5.5.34 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | azurit | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | crash, semaphore | ||
| Environment: |
Debian Squeeze, 64 bit |
||
| Attachments: |
|
| Description |
|
Our MariaDB server is crashing on long semaphor wait. It is happening usually at night when database backups are running (but sometimes also in the middle of the day). I'm 100% sure it's related to one of our users because when i moved him to different server, the problem moves with him. There were enough of memory and we had the same problem with MySQL 5.5. Everytime it happend, the mentioned user is running simultanious SELECT and TRUNCATE on the same table. |
| Comments |
| Comment by Elena Stepanova [ 2013-10-18 ] |
|
Hi, There is no truncate in the innodb status output in the attached log, so I assume you see it somewhere else. Do you have the general log enabled? Thanks. |
| Comment by azurit [ 2013-10-18 ] |
|
Here is the table name and structure: CREATE TABLE `index_products` ( The table is completely corrupted and unusable after every crash (attaching FRM file, there is no IBD file for that table). Sorry, i was wrong about the simultanious queries - they are both running at the same time in the same database by the same user but are not using the same table (= the SELECT is not using the table 'index_products'). You can find login of mentioned user in the log file, search for '_dbs' (i don't want to write logins here). The database to which table 'index_products' belongs is <login>_proeshop (but i can't see anything about it in the logs). Attaching log from other crash where is also truncate (i also saw it in process list while the problem was happening - i was able to connect via root account). This time i killed the MariaDB. |
| Comment by azurit [ 2013-10-21 ] |
|
Do you need anything more from me? Thank you. |
| Comment by Elena Stepanova [ 2013-10-22 ] |
|
Unfortunately, if there is no correlation with any particular flow as you supposed before (e.g. select and truncate on the same table), there is little chance to reproduce it locally. "Long semaphore wait" crash is just a consequence of some kind of a deadlock, which is almost always a concurrency problem. One thing we might try is, if you are able to turn on general log for a while, we can attempt to replay it later. But it only makes sense if you can also provide the full data dump (and people usually can't since such problems tend to happen on real-life production data).
Would you be able to do that? |
| Comment by azurit [ 2013-10-22 ] |
|
I would like to do it but i can't imagine that we will be running with general log enabled - that server is doing 560 queries per second in average. Why is MySQL/MariaDB not logging crash info that will be really usefull for debugging? Isn't there really anything other what i can do? For example, output from process list while the server is stucked. |
| Comment by Elena Stepanova [ 2013-10-22 ] |
|
>> Why is MySQL/MariaDB not logging crash info that will be really usefull for debugging? Sadly, this is one of the most obscure problems there is. Sometimes some cases get fixed, but at any time every distribution has "cold" unsolved mysteries of the "long semaphore wait" kind >> Isn't there really anything other what i can do? For example, output from process list while the server is stucked. Sure, please do that. Although, with 100+ connections running at once, and with 560 queries per second, we most likely won't be able to say which of the stuck queries caused the deadlock and which were just victims. Jan, |
| Comment by azurit [ 2013-10-22 ] |
|
>> Sure, please do that. Although, with 100+ connections running at once, and with 560 queries per second, we most likely won't be able to say which of the stuck queries caused the deadlock and which were just victims. That's true BUT we can assume that problematic queries are running longest because all other queries were started after the server was stucked. If i'll be able to get process list from several crashes, we can compare them and find querie(s) which are common to all. Am i right? |
| Comment by Elena Stepanova [ 2013-10-22 ] |
|
>> That's true BUT we can assume that problematic queries are running longest because all other queries were started after the server was stucked. If i'll be able to get process list from several crashes, we can compare them and find querie(s) which are common to all. Am i right? In theory, yes, which is why we've tried it a number of times, unfortunately it almost never works out. Firstly, in any seriously concurrent workflow there will be many queries started at the same second, some of them are guilty while others aren't. Secondly, the queries in production are usually more or less the same, it's not like a randomized artificial test when queries are totally different. And thirdly, the guilty ones are not necessarily the longest ones. There are usually some long queries that start working, then the guilty ones jump in and cause the problem, which in turn has the domino effect and makes the previous ones get stuck. That said, it's worth trying anyway, sometimes, although rarely, it does help. Also, it would be interesting to see how many connections were open at or around the moment when the problem starts. One of your logs shows that by the time the server crashed, all connection pool was used up (151 threads); but it's unclear whether it happened because previous queries would not return and applications would open new connections instead, or whether the problem was initially related to the fact that the connections were saturated. |
| Comment by azurit [ 2013-10-23 ] |
|
>> it would be interesting to see how many connections were open at or around the moment when the problem starts. One of your logs shows that by the time the server crashed, all connection pool was used up (151 threads); but it's unclear whether it happened because previous queries would not return and applications would open new connections instead, or whether the problem was initially related to the fact that the connections were saturated. In one case i was just right at the computer when it happened and i was able to login with root account. I saw lots of queries which were running but were stucked for sure. The server has reached the max connections limit because no queries were able to complete. I was also trying to kill some queries but it wasn't possible (they were still running). I also tried to just stop the server with '/etc/init.d/mysql stop' but it wasn't working, i had to hard kill it. I also did a screenshot of 'mytop' which i can upload if it's possible to mark it as not to be public available (you will see the query of that one user i was talking about almost at the top). |
| Comment by Elena Stepanova [ 2013-10-23 ] |
|
> which i can upload if it's possible to mark it as not to be public available |
| Comment by azurit [ 2013-10-23 ] |
|
Done, file name mysql-crash-mdev5156.jpg |
| Comment by Jan Lindström (Inactive) [ 2013-10-31 ] |
|
Hi, There is mysterious two lines on those logs: Oct 17 01:33:16 server02 mysqld: InnoDB: ###### Diagnostic info printed to the standard error stream Somehow the adaptive search index and table statistics (performance schema) are colliding. Do you have performance schema enabled ? (is that really needed). Additionally, you could try to increase the long semaphore wait timeout but that normally does not really resolve issue if you do not have long running transactions. Furthermore, you can try to disable adaptive has index. R: Jan |
| Comment by azurit [ 2013-10-31 ] |
|
Jan, there is attached file 'global_variables.txt' with all global variables from that server - you are probably asking for this: Maybe that user is enabling it locally (if it's possible)? |
| Comment by Jan Lindström (Inactive) [ 2013-11-20 ] |
|
It seems that some of the table statistics are calculated even in the case when performance_schema is off. Especially in case when table becomes empty (truncate table or delete from <table>). |
| Comment by Jan Lindström (Inactive) [ 2013-11-28 ] |
|
Could you please try MariaDB 5.5.34 with startup variable --innodb_use_stacktrace and if there is again long semaphore wait errors, I would need full error log. There should be additional information that would greatly help to identify the actual issue. |
| Comment by Jan Lindström (Inactive) [ 2013-11-28 ] |
|
Fixed some incorrect and not logical mutex operations and added a new configuration variable innobase_use_stacktrace (default off) to produce a stack trace from waiting and holding thread if long semaphore wait is detected. These, should help to resolve possible future issues more easily. |
| Comment by azurit [ 2013-11-28 ] |
|
Thank you very much, i will try it and report back. |