[MDEV-6105] Emoji unicode character string search query makes mariadb performance down Created: 2014-04-15 Updated: 2015-08-13 Resolved: 2014-06-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | None |
| Affects Version/s: | 5.5.37 |
| Fix Version/s: | 5.5.38, 10.0.12 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Seunguck Lee | Assignee: | Sergei Petrunia |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 5.6 x86_64 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
4 bytes emoji unicode character string search query on UTF8 table Test case
ref emoji unicode character from below site 4bytes emoji unicode character can't convert to utf8, function of sql_select.cc) So, client can't get anything from the query, but query takes a lot of time. Stack trace of copy of use input (4 bytes emoji string)
==> Suggest |
| Comments |
| Comment by Alexander Barkov [ 2014-04-22 ] | |||||||||||||||||||||||||||
|
– The full SQL script demonstrating the problem: SET NAMES utf8; | |||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2014-04-22 ] | |||||||||||||||||||||||||||
|
– A join with another table with an UTF8MB4 VARCHAR column is also slow: DROP TABLE IF EXISTS t2; | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
I think, the problem starts to happen earlier than the location pointed by Matt74. get_best_combination() is called after join optimization. The first sign of the problem is in range optimization (i.e. before join optimization has started): Debugging the query and being at: the execution calls: err= value->save_in_field_no_warnings(field, 1); and gets err=2. However, it ignores the return value, and produces the range of empty strings - ['', '']. Then it proceeds to get #rows estimates for it, construct possible range access, and eventually convert it into ref(const) access with create_ref_for_key(). | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
Questions:
| |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
More similar smiley characters for the unicode input-challenged. | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
Codes of these characters http://apps.timwhitlock.info/emoji/tables/unicode | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
Re the example with join: note that index access on t1.fd is not used:
in debugger, one can see that the WHERE clause is converted to:
| |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
The above means that ref access is not affected. The problem is only in range access. For range access, we need to find out what range can be inferred from "t.key CMP $value" when $value cannot be represented in the charset of t.key. | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
For the record: direct mentions of non-utf8 characters in MySQL client (like fd='��') cause another bug, MDEV-6218 (which is repeatable on mysql also). That bug, however is different from this one. | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
So, this bug only repeats when we have
If I attempt to use the right character set, I get an error:
| |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
Notes from discussion with with bar: Let's explore how the comparison is done in a non-index case:
That is, this utf-8 character that isn't really an utf-8 character is not equal to neither empty string, nor the '?' sign. | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
Debugging, one can see that the comparison is done in my_strnncollsp_utf8(), which has this code:
that is, characters that don't fit into UTF-8 cause the comparison to be done in a byte-by-byte way. | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
... Now, if we want index lookups to work in the same way as comparison works with non-indexed columns:
| |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
For non-equality comparisons, index scans are not equivalent to evaluating the WHERE (which is a separate bug):
| |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-07 ] | |||||||||||||||||||||||||||
|
Possible solution:
the second may cause queries to run slower than they ran before, but they will not produce incorrect results (like shown in the previous comment). | |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-05-08 ] | |||||||||||||||||||||||||||
|
Oracle's MySQL has some related fixes:
| |||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2014-06-05 ] | |||||||||||||||||||||||||||
|
Pushed a "simple" fix into 5.5, without backport of 5.6's refactoring. (porting 5.6's refactoring would not fully this this bug, so it would be overdoing it. refactoring should be pushed into 10.0) |