[MDEV-6170] Incorrect ordering with utf8_bin and utf8mb4_bin collations Created: 2014-04-24  Updated: 2014-04-28  Resolved: 2014-04-25

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.10
Fix Version/s: 10.0.11

Type: Bug Priority: Major
Reporter: Jeremy Cole Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: character-set

Attachments: File sort_order_funny.result     File sort_order_funny.test    

 Description   

In MySQL 5.6 the utf8_bin and utf8mb4_bin collations were broken such that they ordered by only the high byte of the utf8 character (meaning ASCII sorts as though all characters were 0, resulting in random sort). This was fixed in the following commit (and maybe others):

https://github.com/darnaut/mysql-server/commit/5fe5a5b

This issue affects MariaDB 10 as well. See the attached simple test for this sorting bug.



 Comments   
Comment by Jeremy Cole [ 2014-04-24 ]

Note that with only that patch applied, I get a failure in main.ctype_utf32:

assertion failed at sql/filesort.cc:927 in void make_sortkey(Sort_param*, uchar*, uchar*): sort_field->length >= length

Obviously some other change is required but I am not sure what.

Comment by Jeremy Cole [ 2014-04-25 ]

Also note that the assert being hit seems correct; commenting it out causes utf32 to sort incorrectly and the test fails for that reason. On MySQL 5.6, the assert is never reached because `from != to` in the enclosing if(), which I think means the string is being reallocated elsewhere...

Comment by Jeremy Cole [ 2014-04-25 ]

Note that the upstream bug was this one:

http://bugs.mysql.com/bug.php?id=69005

Comment by Alexander Barkov [ 2014-04-25 ]

An SQL script demonstrating the problem:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8;
INSERT INTO t1 (a) VALUES ("a");
INSERT INTO t1 (a) VALUES ("b");
INSERT INTO t1 (a) VALUES ("c");
SELECT * FROM t1 ORDER BY a COLLATE utf8_bin DESC;
+---+
| a |
+---+
| a |
| b |
| c |
+---+
3 rows in set (0.00 sec)

Notice, the order is not descending.

Comment by Alexander Barkov [ 2014-04-25 ]

A similar script (with COLLATE moved to the table definition and with LOWER instead of COLLATE in ORDER BY):

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8 COLLATE utf8_bin;
INSERT INTO t1 (a) VALUES ("a");
INSERT INTO t1 (a) VALUES ("b");
INSERT INTO t1 (a) VALUES ("c");
SELECT * FROM t1 ORDER BY LOWER(a) DESC;
+---+
| a |
+---+
| a |
| b |
| c |
+---+

The order is not descending.

Comment by Alexander Barkov [ 2014-04-25 ]

Pushed into 10.0

Comment by Pavel Ivanov [ 2014-04-25 ]

I guess the patch in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/4164.
I'm very interested to know how replication safety of statements with LIMIT is related to this? Could you please explain?

Comment by Jeremy Cole [ 2014-04-25 ]

I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).

Comment by Jeremy Cole [ 2014-04-25 ]

Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

Comment by Alexander Barkov [ 2014-04-28 ]

Jeremy, you're right.
"CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used
in tests to create a VARCHAR(N) column using the current value of @@collation_connection.
It worked fine with all tests not using binlog with statement format.

A fix for MDEV-6170 added a new test into include/ctype_filesort.inc,
to cover as many collations as possible (not only utf8_bin).
As a result, the test binlog_stm_ctype_cp932 started to fail with the error
"Unsafe statement written to the binary log using statement...".
The easiest way was to make "SELECT .. LIMIT 0" deterministic,
which was erroneously not. So the patch actually fixes two problems.

The new behaviour IS tested in binlog_stm_ctype_cp932.

By the way, the old behaviour of "LIMIT 0" was not tested,
neither specifically, nor indirectly.

Comment by Alexander Barkov [ 2014-04-28 ]

Jeremy,

MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin,
and it is there in ctype-utf8.c, around 8158:

struct charset_info_st my_charset_utf8mb4_bin=
{
  46,0,0,             /* number       */
  MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE|
  MY_CS_UNICODE_SUPPLEMENT, /* state  */

MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin)
that use something more complex than just a prefix of the original string to create a binary sortable image.

  • utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences.
  • utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences.

These transformation are done for performance purposes: to reduce the length needed to sort data.

Comment by Jeremy Cole [ 2014-04-28 ]

Bar,

Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case.

I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

Generated at Thu Feb 08 07:09:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.