[MDEV-6161] server hangs in simple query with large strings (myisam) Created: 2014-04-24  Updated: 2015-06-06  Resolved: 2015-06-06

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 5.5.35, 5.5.36
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: vde Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

fedora-20


Attachments: Text File err.txt     Text File err1.txt     File hash.sql.gz     File mariadb.spec     File my.cnf.tgz     File trace.log     File udf_hash.so    

 Description   

some queries on sample table cause server hang:

Command | Time | State        | Info                                                                                                 
Query   | 1597 | Sending data | SELECT count(h.id)

with 0 disk activity

hanging query:

27 Query   SELECT count(h.id) FROM wp_bp_album_hash h
where  h.id!=8164 AND h.hashv !='' AND huffman_dist(h.hashv,'00BB0000 00000000 00BB9C80 00000000 00C09BF9 00000000 CCC00C04 000000CC DC4C0000 00000CCD FFDDCC0C 00CCCFFB DDFFDC40 00CFFFFD FFF3FFCC DDFDCCDD FFFFDD0C 0CCCCDDF CCCCC4CC 0C0DDDDD CCCC40C0 0000CCCC DCFFF400 0000C04C DCDDC000 00000CDD DD040000 0000CCCD C0000000 00000000 00000000 00000000')< 400

while next query works fine (in 2.941s):

27 Query     SELECT count(h.id) FROM wp_bp_album_hash h
where  h.id!=8164 AND h.hashk !='' AND huffman_dist(h.hashk,'00990000 00000000 00F99910 00000000 00009999 00000000 00000005 00000000 D4540000 0000004C FD444000 00044DFF CCDDC400 004D9DDC DFFFFF40 D99C444C FDDDCD00 000044CF 44444000 0009DC44 44400000 00000046 C4999400 00000044 D1DD4000 00000DDD 59000000 0000055D 00000000 00000000 00000000 00000000')< 390

huffman_dist - small UDF which calculates "BIT_COUNT( a ^ b )' word-by-word for LONG hashes (I can attach compiled one)

table in attachment



 Comments   
Comment by Elena Stepanova [ 2014-04-25 ]

Hi,

Yes, please attach the function. Can you (also) attach the source code?

Comment by vde [ 2014-04-25 ]

binary for UDF huffman_dist

Comment by vde [ 2014-04-25 ]

I'm too ashamed to this code will be here forever so I'll send it by email, OK?

Comment by Elena Stepanova [ 2014-04-25 ]

Sure, please do.
Alternatively, you can upload it to our FTP, the private section (ftp.askmonty.org/private). This way nobody except for MariaDB team will have access to it.

Comment by Elena Stepanova [ 2014-04-25 ]

Please also attach your cnf file(s) and/or output of SHOW VARIABLES.
So far I tried the query with your binary version of the function on a 32-bit machine and with one built from the source on a 64-bit machine, on MariaDB 5.5.35, 5.5.36 and 5.5.37, all works fine:

SELECT count(h.id) FROM wp_bp_album_hash h where  h.id!=8164 AND h.hashv !='' AND huffman_dist(h.hashv,'00BB0000 00000000 00BB9C80 00000000 00C09BF9 00000000 CCC00C04 000000CC DC4C0000 00000CCD FFDDCC0C 00CCCFFB DDFFDC40 00CFFFFD FFF3FFCC DDFDCCDD FFFFDD0C 0CCCCDDF CCCCC4CC 0C0DDDDD CCCC40C0 0000CCCC DCFFF400 0000C04C DCDDC000 00000CDD DD040000 0000CCCD C0000000 00000000 00000000 00000000')< 400;
+-------------+
| count(h.id) |
+-------------+
|         108 |
+-------------+
1 row in set (0.03 sec)

Comment by vde [ 2014-04-25 ]

I reproduced this problem on debug server so I can try to help localize it on my side.

PS. MariaDB built from sources by rpmbuild with attached SPEC

Comment by Elena Stepanova [ 2014-04-25 ]

Can you get a stack trace from the running process while it is hanging?
Something like gdb --batch --eval-command="thread apply all bt" <binary> <pid> would do (or anything of your choice as long as it produces the stack trace).

Comment by Elena Stepanova [ 2014-06-01 ]

Please comment to re-open the report when you have any new information.

Comment by vde [ 2014-06-04 ]

Sorry for timeout. Unfortunately I can not debug on the production server, and on debug server the bug occures periodically and is not caught.
Attachment - "mysqladmin proc" just before hanging

Comment by vde [ 2014-06-04 ]

Oops, I can't add attachment

Comment by vde [ 2014-06-09 ]

Seems that I can't add any attachments until reopening

Comment by Elena Stepanova [ 2014-06-09 ]

Re-opened, now you should be able to add the attachments.

Comment by vde [ 2014-06-09 ]

err.txt - "mysqladmin proc" just before hang

trace.log - stack trace when some threads hangs
err1 - "mysqladmin proc" just after trace.log

Comment by Elena Stepanova [ 2014-06-10 ]

Hi,

Thanks.
Did you take the stack trace snapshot once, or did you try it several times?
In the latter case, were there any changes in the stack traces between the snapshots?

Also, how is the CPU usage when it hangs?

Comment by vde [ 2014-06-10 ]

Once, there is no so much time before hangs.

CPU load about 0 - hanging threads don't eats CPU

Comment by Elena Stepanova [ 2014-06-10 ]

What about stack traces during hanging (not before)? Do they look any different?

I'm asking about stack traces and CPU because server that is really hanging (forever waiting for something) and server going through an endless loop often look the same – unavailable, irresponsive. The difference is that for a server in a loop consequent stack traces differ and CPU load is [relatively] high. Since your CPU is about 0, I assume it's not the case.

Comment by vde [ 2014-06-10 ]

>stack traces during hanging (not before)
I can't do much debug on production server because we lose all subscribers... "killall -9 mysqld" spells automatically when "mysqladmin ping" fails.

Yes its seems that threads are waiting for events forever and eventually use up the entire thread pool so server comes irresponsive.

Comment by Elena Stepanova [ 2015-06-06 ]

I tried again to reproduce it, but there is too much mystery in here.

The query quoted in the description hangs in 'Sending data', which can be anything, no way to know without the stack trace and more information (complete process list etc.). It can even be a problem in the UDF function, e.g. something causing an endless loop (only in this case CPU usage should be noticeable).

But the attached processlist and stack trace are from a different occasion, a query which starts with SELECT SQL_SMALL_RESULT h.id, h.hashv, huffman_dist(h.hashv,. It stops not on 'Sending data', but on 'Copying to tmp table'). Unfortunately, there is no complete query anywhere, so we can't even know why it wants to create a temporary table in the first place. Also, the stack trace does not look like something hanging, again it looks more like an endless loop – without a few consequent stack traces, there is no way to tell for sure.

So, if the problem still exists, we need:

  • at least a couple (better 3-4) stack traces taken with a short time interval;
  • complete processlist;
  • complete query which is being run;
  • server error log.
Generated at Thu Feb 08 07:09:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.