[MDEV-6105] Emoji unicode character string search query makes mariadb performance down Created: 2014-04-15  Updated: 2015-08-13  Resolved: 2014-06-05

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 5.5.37
Fix Version/s: 5.5.38, 10.0.12

Type: Bug Priority: Critical
Reporter: Seunguck Lee Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 5.6 x86_64


Attachments: HTML File smiley1-utf16.html    
Issue Links:
Relates
relates to MDEV-8613 Full table scan for WHERE indexed_var... Closed

 Description   

4 bytes emoji unicode character string search query on UTF8 table
could make mariadb performance down, if there's a lot of empty string('') index key.

Test case

CREATE TABLE test(
  id int AUTO_INCREMENT,
  fd varchar(20),
  primary key(id),
  index ix_fd(fd)
)engine=innodb default charset=UTF8;
 
INSERT INTO test(id, fd) VALUES (null, ''), (null, 'matt'), (null, 'pitt'), (null, 'lee'), (null, 'kim');
# Run 20 times below query
   INSERT INTO test SELECT * FROM test;
 
SELECT * FROM test WHERE fd=*some_4bytes_unicode_char*;

ref emoji unicode character from below site
http://apps.timwhitlock.info/emoji/tables/unicode
------------------------------------------------------------

4bytes emoji unicode character can't convert to utf8,
But mariadb doesn't check string conversion function's return code. (Check stack trace below)
And Xtradb will search empty string key from ix_fd index.
If ix_fd index has a lot of empty string key, then innodb will read every empty string key's record and pass it to mariadb engine.
But mariadb will check it with original 4 bytes emoji character. (evaluate_join_record()

{... select_cond_result= test(select_cond->val_int()); ...}

function of sql_select.cc)

So, client can't get anything from the query, but query takes a lot of time.
iF there's many same type of query, MariaDB server would be busy to read useless record from the table.

Stack trace of copy of use input (4 bytes emoji string)

Thread [18] 21629 [core: 1] (Suspended : Step)	
	Field_varstring::store() at field.cc:6,586 0x6d989c ==> RETURN 2	
	copy_inner() at sql_select.h:1,662 0x5cc27c	
	copy() at sql_select.h:1,552 0x5cc27c	
	create_ref_for_key() at sql_select.cc:8,102 0x5cc27 ==> Ignore return value on "sql_select.cc:8102  tmp.copy();"
	get_best_combination() at sql_select.cc:7,775 0x5d820f	
	make_join_statistics() at sql_select.cc:3,799 0x5ed818	
	JOIN::optimize() at sql_select.cc:1,218 0x5f0522	
	mysql_select() at sql_select.cc:3,065 0x5f943b	
	handle_select() at sql_select.cc:319 0x5f943b	
	execute_sqlcom_select() at sql_parse.cc:4,689 0x59dcad	
	<...more frames...>

==> Suggest
on tmp.copy() code line of sql_select.cc::create_ref_for_key(), MariaDB should check return value of copy().
and if there's some conversion error, MariaDB should throw ERROR not WARNING. or throw ERROR when STRICT_ALL_TABLES sql_mode is on.



 Comments   
Comment by Alexander Barkov [ 2014-04-22 ]

– The full SQL script demonstrating the problem:

SET NAMES utf8;
DROP TABLE IF EXISTS t1;
CREATE TABLE t1(
id int AUTO_INCREMENT,
fd varchar(20),
primary key(id),
index ix_fd(fd)
)engine=innodb default charset=UTF8;
INSERT INTO t1(id, fd) VALUES (null, ''),(null, 'matt'),(null, 'pitt'),(null, 'lee'),(null, 'kim');
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
INSERT INTO t1 (fd) SELECT fd FROM t1;
SELECT * FROM t1 WHERE fd='��';

Comment by Alexander Barkov [ 2014-04-22 ]

– A join with another table with an UTF8MB4 VARCHAR column is also slow:

DROP TABLE IF EXISTS t2;
CREATE TABLE t2 (fd VARCHAR(10)) CHARSET=UTF8MB4;
INSERT INTO t2 VALUES (_utf8mb4'��');
SELECT * FROM t1,t2 WHERE t1.fd=t2.fd;

Comment by Sergei Petrunia [ 2014-05-07 ]

I think, the problem starts to happen earlier than the location pointed by Matt74. get_best_combination() is called after join optimization. The first sign of the problem is in range optimization (i.e. before join optimization has started):

Debugging the query
SELECT * FROM t1 WHERE fd='��';

and being at:
#0 get_mm_leaf (...)
#1 0x000000000096d38e in get_mm_parts (...)
#2 0x000000000096c03f in get_func_mm_tree (...)
#3 0x000000000096c251 in get_full_func_mm_tree (...)
#4 0x000000000096cfe9 in get_mm_tree (...)
#5 0x0000000000961f9d in SQL_SELECT::test_quick_select (...)

the execution calls:

err= value->save_in_field_no_warnings(field, 1);

and gets err=2. However, it ignores the return value, and produces the range of empty strings - ['', '']. Then it proceeds to get #rows estimates for it, construct possible range access, and eventually convert it into ref(const) access with create_ref_for_key().

Comment by Sergei Petrunia [ 2014-05-07 ]

Questions:

  • Does the problem affect only range optimizer, or ref access is affected, too?
  • Can one fix the range optimizer by having get_mm_parts() return SEL_TREE(SEL_TREE::IMPOSSIBLE) when
    when value->save_in_field_no_warnings() returned 2? (This is probably much more complex than that. What if one runs
    t.key BETWEEN 'okay-string' AND '��' . Here, one endpoint can be represented in UTF-8, while the other cannot. If both endpoints cannot be represented, does it mean we can return SEL_TREE::IMPOSSIBLE from get_mm_parts?)
Comment by Sergei Petrunia [ 2014-05-07 ]

More similar smiley characters for the unicode input-challenged.

Comment by Sergei Petrunia [ 2014-05-07 ]

Codes of these characters http://apps.timwhitlock.info/emoji/tables/unicode

Comment by Sergei Petrunia [ 2014-05-07 ]

Re the example with join: note that index access on t1.fd is not used:

explain SELECT * FROM t1,t2 WHERE t1.fd=t2.fd;
+------+-------------+-------+-------+---------------+-------+---------+------+-------+--------------------------------------------------------------+
| id   | select_type | table | type  | possible_keys | key   | key_len | ref  | rows  | Extra                                                        |
+------+-------------+-------+-------+---------------+-------+---------+------+-------+--------------------------------------------------------------+
|    1 | SIMPLE      | t2    | ALL   | NULL          | NULL  | NULL    | NULL |     1 |                                                              |
|    1 | SIMPLE      | t1    | index | NULL          | ix_fd | 63      | NULL | 42652 | Using where; Using index; Using join buffer (flat, BNL join) |
+------+-------------+-------+-------+---------------+-------+---------+------+-------+--------------------------------------------------------------+

in debugger, one can see that the WHERE clause is converted to:

$52 = 0x198cc00 "(convert(`j2`.`t1`.`fd` using utf8mb4) = `j2`.`t2`.`fd`)"

Comment by Sergei Petrunia [ 2014-05-07 ]

The above means that ref access is not affected. The problem is only in range access.

For range access, we need to find out what range can be inferred from "t.key CMP $value" when $value cannot be represented in the charset of t.key.

Comment by Sergei Petrunia [ 2014-05-07 ]

For the record: direct mentions of non-utf8 characters in MySQL client (like fd='��') cause another bug, MDEV-6218 (which is repeatable on mysql also). That bug, however is different from this one.

Comment by Sergei Petrunia [ 2014-05-07 ]

So, this bug only repeats when we have

  • SET NAMES utf8;
  • then use a literal that is not a valid UTF-8 character.

If I attempt to use the right character set, I get an error:

MariaDB [j5]> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [j5]> explain SELECT * FROM t1 WHERE fd='��';
ERROR 1267 (HY000): Illegal mix of collations (utf8_general_ci,IMPLICIT) and (utf8mb4_general_ci,COERCIBLE) for operation '='

Comment by Sergei Petrunia [ 2014-05-07 ]

Notes from discussion with with bar: Let's explore how the comparison is done in a non-index case:

MariaDB [j5]> set names utf8;
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [j5]> create table t3 (val varchar(20)) CHARSET=utf8;
Query OK, 0 rows affected (0.86 sec)
 
MariaDB [j5]> insert into t3 values (''),('?'),('a');
Query OK, 3 rows affected (0.12 sec)
Records: 3  Duplicates: 0  Warnings: 0
 
MariaDB [j5]>  select val, val='��' from t3;
+------+------------+
| val  | val='��'     |
+------+------------+
|      |          0 |
| ?    |          0 |
| a    |          0 |
+------+------------+
3 rows in set (0.00 sec)
 
MariaDB [j5]> select collation('��');
+-------------------+
| collation('��')     |
+-------------------+
| utf8_general_ci   |
+-------------------+
1 row in set (0.00 sec)

That is, this utf-8 character that isn't really an utf-8 character is not equal to neither empty string, nor the '?' sign.

Comment by Sergei Petrunia [ 2014-05-07 ]

Debugging, one can see that the comparison is done in my_strnncollsp_utf8(), which has this code:

    s_res=my_utf8_uni(cs,&s_wc, s, se);
    t_res=my_utf8_uni(cs,&t_wc, t, te);
 
    if ( s_res <= 0 || t_res <= 0 )
    {
      /* Incorrect string, compare byte by byte value */
      return bincmp(s, se, t, te);
    }

that is, characters that don't fit into UTF-8 cause the comparison to be done in a byte-by-byte way.

Comment by Sergei Petrunia [ 2014-05-07 ]

... Now, if we want index lookups to work in the same way as comparison works with non-indexed columns:

  • "key_col= $unrepresentable_char" is always false, because $unrepresentable_char will not be byte-by-byte equal to value of key_col (attempt to INSERT an $unrepresentable_char into the table will force its conversion into a question mark)
  • as for "key_col < $unrepresentable_char", and other comparisons: byte-by-byte comparison doesn't map into any range in UTF-8 collation.
Comment by Sergei Petrunia [ 2014-05-07 ]

For non-equality comparisons, index scans are not equivalent to evaluating the WHERE (which is a separate bug):

MariaDB [j5]> select count(*) from t1 where fd <'��';
+----------+
| count(*) |
+----------+
|    16384 |
+----------+
1 row in set, 1 warning (0.08 sec)
 
MariaDB [j5]> select count(*) from t1 ignore index (ix_fd) where fd <'��';
+----------+
| count(*) |
+----------+
|    81920 |
+----------+
1 row in set (0.37 sec)

Comment by Sergei Petrunia [ 2014-05-07 ]

Possible solution:

  • "key_col=$unrepresentable_char" should produce SEL_TREE(SEL_TREE::IMPOSSIBLE)
  • all other "key_col CMP $unrepresentable_char" should produce NULL (i.e "cant use this for range access")

the second may cause queries to run slower than they ran before, but they will not produce incorrect results (like shown in the previous comment).

Comment by Sergei Petrunia [ 2014-05-08 ]

Oracle's MySQL has some related fixes:

  • Bug#11752201 ( jorgen.loland@oracle.com-20120525130050-mejp6m431gyx2z0d) - it fixes a different problem, but the effect is that they don't get the wrong query results like shown two comments earlier
  • A refactoring of Item::save_in_field and co.: jorgen.loland@oracle.com-20120507082918-pex86bcedbzemtsb .
Comment by Sergei Petrunia [ 2014-06-05 ]

Pushed a "simple" fix into 5.5, without backport of 5.6's refactoring. (porting 5.6's refactoring would not fully this this bug, so it would be overdoing it. refactoring should be pushed into 10.0)

Generated at Thu Feb 08 07:09:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.