[MDEV-5156] Long semaphore wait crash Created: 2013-10-18  Updated: 2013-11-28  Resolved: 2013-11-28

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.33a
Fix Version/s: 5.5.34

Type: Bug Priority: Blocker
Reporter: azurit Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: crash, semaphore
Environment:

Debian Squeeze, 64 bit


Attachments: Text File global_variables.txt     File index_products.frm     File my.cnf     File mysql.log.gz     File mysql2.log.gz    

 Description   

Our MariaDB server is crashing on long semaphor wait. It is happening usually at night when database backups are running (but sometimes also in the middle of the day). I'm 100% sure it's related to one of our users because when i moved him to different server, the problem moves with him. There were enough of memory and we had the same problem with MySQL 5.5. Everytime it happend, the mentioned user is running simultanious SELECT and TRUNCATE on the same table.



 Comments   
Comment by Elena Stepanova [ 2013-10-18 ]

Hi,

There is no truncate in the innodb status output in the attached log, so I assume you see it somewhere else. Do you have the general log enabled?
Which table is it? Can you provide the structure of the table (SHOW CREATE TABLE) and either the data dump or, if you cannot do it on security reasons, at least estimates on the data contents (number of rows, SHOW INDEX, SHOW TABLE STATUS and such)?

Thanks.

Comment by azurit [ 2013-10-18 ]

Here is the table name and structure:

CREATE TABLE `index_products` (
`id` int(11) NOT NULL,
`string` text NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

The table is completely corrupted and unusable after every crash (attaching FRM file, there is no IBD file for that table).

Sorry, i was wrong about the simultanious queries - they are both running at the same time in the same database by the same user but are not using the same table (= the SELECT is not using the table 'index_products').

You can find login of mentioned user in the log file, search for '_dbs' (i don't want to write logins here). The database to which table 'index_products' belongs is <login>_proeshop (but i can't see anything about it in the logs).

Attaching log from other crash where is also truncate (i also saw it in process list while the problem was happening - i was able to connect via root account). This time i killed the MariaDB.

Comment by azurit [ 2013-10-21 ]

Do you need anything more from me? Thank you.

Comment by Elena Stepanova [ 2013-10-22 ]

Unfortunately, if there is no correlation with any particular flow as you supposed before (e.g. select and truncate on the same table), there is little chance to reproduce it locally. "Long semaphore wait" crash is just a consequence of some kind of a deadlock, which is almost always a concurrency problem. One thing we might try is, if you are able to turn on general log for a while, we can attempt to replay it later. But it only makes sense if you can also provide the full data dump (and people usually can't since such problems tend to happen on real-life production data).
So, if you can do that, please

  • run mysqldump (including both schema structure and data);
  • turn on general log to file (set global general_log=1, and also check general_log_file variable, and log_output should be FILE);
  • when the problem happens again, your server will get restarted, so general log will turn off automatically;
  • archive the dump and the general log and upload them to our ftp, the private section.

Would you be able to do that?

Comment by azurit [ 2013-10-22 ]

I would like to do it but i can't imagine that we will be running with general log enabled - that server is doing 560 queries per second in average. Why is MySQL/MariaDB not logging crash info that will be really usefull for debugging? Isn't there really anything other what i can do? For example, output from process list while the server is stucked.

Comment by Elena Stepanova [ 2013-10-22 ]

>> Why is MySQL/MariaDB not logging crash info that will be really usefull for debugging?
Well, it tries. But the catch is, if anyone knew what exactly is useful in this case, there would have been no case at all, one would have fixed it instead of logging. The server can only log what it has at the moment, but not the historical info which led to the problem – if it did so, it would have to log much more than just general log, basically the full trace, which of course is totally impossible in real life. Moreover, when it comes to the testing environment, the more potentially useful logging is turned on, the less reproducible the problem is, since it is slower and the probability of the race condition reduces.

Sadly, this is one of the most obscure problems there is. Sometimes some cases get fixed, but at any time every distribution has "cold" unsolved mysteries of the "long semaphore wait" kind
http://bugs.mysql.com/bug.php?id=59727
https://bugs.launchpad.net/percona-server/+bug/1074255
etc.

>> Isn't there really anything other what i can do? For example, output from process list while the server is stucked.

Sure, please do that. Although, with 100+ connections running at once, and with 560 queries per second, we most likely won't be able to say which of the stuck queries caused the deadlock and which were just victims.
Please also attach your my.cnf file or SHOW GLOBAL VARIABLES output.

Jan,
Could you please take a look at InnoDB status attached to the report (mysql.log.gz, mysql2.log.gz), maybe you'll see something suspicious there – i mean, apart from the whole hanging thing.

Comment by azurit [ 2013-10-22 ]

>> Sure, please do that. Although, with 100+ connections running at once, and with 560 queries per second, we most likely won't be able to say which of the stuck queries caused the deadlock and which were just victims.

That's true BUT we can assume that problematic queries are running longest because all other queries were started after the server was stucked. If i'll be able to get process list from several crashes, we can compare them and find querie(s) which are common to all. Am i right?

Comment by Elena Stepanova [ 2013-10-22 ]

>> That's true BUT we can assume that problematic queries are running longest because all other queries were started after the server was stucked. If i'll be able to get process list from several crashes, we can compare them and find querie(s) which are common to all. Am i right?

In theory, yes, which is why we've tried it a number of times, unfortunately it almost never works out. Firstly, in any seriously concurrent workflow there will be many queries started at the same second, some of them are guilty while others aren't. Secondly, the queries in production are usually more or less the same, it's not like a randomized artificial test when queries are totally different. And thirdly, the guilty ones are not necessarily the longest ones. There are usually some long queries that start working, then the guilty ones jump in and cause the problem, which in turn has the domino effect and makes the previous ones get stuck.

That said, it's worth trying anyway, sometimes, although rarely, it does help.

Also, it would be interesting to see how many connections were open at or around the moment when the problem starts. One of your logs shows that by the time the server crashed, all connection pool was used up (151 threads); but it's unclear whether it happened because previous queries would not return and applications would open new connections instead, or whether the problem was initially related to the fact that the connections were saturated.

Comment by azurit [ 2013-10-23 ]

>> it would be interesting to see how many connections were open at or around the moment when the problem starts. One of your logs shows that by the time the server crashed, all connection pool was used up (151 threads); but it's unclear whether it happened because previous queries would not return and applications would open new connections instead, or whether the problem was initially related to the fact that the connections were saturated.

In one case i was just right at the computer when it happened and i was able to login with root account. I saw lots of queries which were running but were stucked for sure. The server has reached the max connections limit because no queries were able to complete. I was also trying to kill some queries but it wasn't possible (they were still running). I also tried to just stop the server with '/etc/init.d/mysql stop' but it wasn't working, i had to hard kill it. I also did a screenshot of 'mytop' which i can upload if it's possible to mark it as not to be public available (you will see the query of that one user i was talking about almost at the top).

Comment by Elena Stepanova [ 2013-10-23 ]

> which i can upload if it's possible to mark it as not to be public available
You can't add a private attachment here in JIRA (at least I haven't found yet how it could be done), but you can upload it to our ftp at ftp.askmonty.org/private. Just mention mdev5156 somewhere in the file name.

Comment by azurit [ 2013-10-23 ]

Done, file name mysql-crash-mdev5156.jpg

Comment by Jan Lindström (Inactive) [ 2013-10-31 ]

Hi,

There is mysterious two lines on those logs:

Oct 17 01:33:16 server02 mysqld: InnoDB: ###### Diagnostic info printed to the standard error stream
Oct 17 01:33:17 server02 mysqld: InnoDB: Warning: a long semaphore wait:
Oct 17 01:33:17 server02 mysqld: --Thread 140172196787968 has waited at dict0dict.c line 299 for 272.00 seconds the semaphore:
Oct 17 01:33:17 server02 mysqld: X-lock on RW-latch at 0x7f7d15e7a5c8 '&dict_table_stats_latches[i]'
Oct 17 01:33:17 server02 mysqld: a writer (thread id 140172213294848) has reserved it in mode exclusive
Oct 17 01:33:17 server02 mysqld: number of readers 0, waiters flag 1, lock_word: 0
Oct 17 01:33:17 server02 mysqld: Last time read locked in file dict0dict.c line 296
Oct 17 01:33:17 server02 mysqld: Last time write locked in file /home/buildbot/buildbot/build/mariadb-5.5.33a/storage/xtradb/dict/dict0dict.c line 299
Oct 17 01:33:17 server02 mysqld: InnoDB: Warning: a long semaphore wait:
Oct 17 01:33:17 server02 mysqld: --Thread 140172213294848 has waited at btr0sea.c line 1203 for 272.00 seconds the semaphore:
Oct 17 01:33:17 server02 mysqld: X-lock (wait_ex) on RW-latch at 0x7f7d16d897f8 'btr_search_latch_part[i]'
Oct 17 01:33:17 server02 mysqld: a writer (thread id 140172213294848) has reserved it in mode wait exclusive
Oct 17 01:33:17 server02 mysqld: number of readers 1, waiters flag 1, lock_word: ffffffffffffffff
Oct 17 01:33:17 server02 mysqld: Last time read locked in file btr0sea.c line 1120
Oct 17 01:33:17 server02 mysqld: Last time write locked in file /home/buildbot/buildbot/build/mariadb-5.5.33a/storage/xtradb/btr/btr0sea.c line 675

Somehow the adaptive search index and table statistics (performance schema) are colliding. Do you have performance schema enabled ? (is that really needed). Additionally, you could try to increase the long semaphore wait timeout but that normally does not really resolve issue if you do not have long running transactions. Furthermore, you can try to disable adaptive has index.

R: Jan

Comment by azurit [ 2013-10-31 ]

Jan,

there is attached file 'global_variables.txt' with all global variables from that server - you are probably asking for this:
performance_schema OFF

Maybe that user is enabling it locally (if it's possible)?

Comment by Jan Lindström (Inactive) [ 2013-11-20 ]

It seems that some of the table statistics are calculated even in the case when performance_schema is off. Especially in case when table becomes empty (truncate table or delete from <table>).

Comment by Jan Lindström (Inactive) [ 2013-11-28 ]

Could you please try MariaDB 5.5.34 with startup variable --innodb_use_stacktrace and if there is again long semaphore wait errors, I would need full error log. There should be additional information that would greatly help to identify the actual issue.

Comment by Jan Lindström (Inactive) [ 2013-11-28 ]

Fixed some incorrect and not logical mutex operations and added a new configuration variable innobase_use_stacktrace (default off) to produce a stack trace from waiting and holding thread if long semaphore wait is detected. These, should help to resolve possible future issues more easily.

Comment by azurit [ 2013-11-28 ]

Thank you very much, i will try it and report back.

Generated at Thu Feb 08 07:02:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.