[MDEV-6170] Incorrect ordering with utf8_bin and utf8mb4_bin collations - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.10
Fix Version/s: 10.0.11
Component/s: None
Labels:
- character-set

Description

In MySQL 5.6 the utf8_bin and utf8mb4_bin collations were broken such that they ordered by only the high byte of the utf8 character (meaning ASCII sorts as though all characters were 0, resulting in random sort). This was fixed in the following commit (and maybe others):

https://github.com/darnaut/mysql-server/commit/5fe5a5b

This issue affects MariaDB 10 as well. See the attached simple test for this sorting bug.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

sort_order_funny.result
0.6 kB
2014-04-24 21:17
sort_order_funny.test
0.5 kB
2014-04-24 21:17

Issue Links

links to

Bug #69005 - Order by lower(column) produces out-of-order results

Activity

Ascending order - Click to sort in descending order

Jeremy Cole added a comment - 2014-04-24 22:13

Note that with only that patch applied, I get a failure in main.ctype_utf32:

assertion failed at sql/filesort.cc:927 in void make_sortkey(Sort_param*, uchar*, uchar*): sort_field->length >= length

Obviously some other change is required but I am not sure what.

Jeremy Cole added a comment - 2014-04-24 22:13 Note that with only that patch applied, I get a failure in main.ctype_utf32: assertion failed at sql/filesort.cc:927 in void make_sortkey(Sort_param*, uchar*, uchar*): sort_field->length >= length Obviously some other change is required but I am not sure what.

Jeremy Cole added a comment - 2014-04-25 02:31

Also note that the assert being hit seems correct; commenting it out causes utf32 to sort incorrectly and the test fails for that reason. On MySQL 5.6, the assert is never reached because `from != to` in the enclosing if(), which I think means the string is being reallocated elsewhere...

Jeremy Cole added a comment - 2014-04-25 02:31 Also note that the assert being hit seems correct; commenting it out causes utf32 to sort incorrectly and the test fails for that reason. On MySQL 5.6, the assert is never reached because `from != to` in the enclosing if(), which I think means the string is being reallocated elsewhere...

Jeremy Cole added a comment - 2014-04-25 02:37

Note that the upstream bug was this one:

http://bugs.mysql.com/bug.php?id=69005

Jeremy Cole added a comment - 2014-04-25 02:37 Note that the upstream bug was this one: http://bugs.mysql.com/bug.php?id=69005

Alexander Barkov added a comment - 2014-04-25 08:58

An SQL script demonstrating the problem:

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8;

INSERT INTO t1 (a) VALUES ("a");

INSERT INTO t1 (a) VALUES ("b");

INSERT INTO t1 (a) VALUES ("c");

SELECT * FROM t1 ORDER BY a COLLATE utf8_bin DESC;

+---+

| a |

+---+

| a |

| b |

| c |

+---+

3 rows in set (0.00 sec)

Notice, the order is not descending.

Alexander Barkov added a comment - 2014-04-25 08:58 An SQL script demonstrating the problem: DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8; INSERT INTO t1 (a) VALUES ("a"); INSERT INTO t1 (a) VALUES ("b"); INSERT INTO t1 (a) VALUES ("c"); SELECT * FROM t1 ORDER BY a COLLATE utf8_bin DESC; +---+ | a | +---+ | a | | b | | c | +---+ 3 rows in set (0.00 sec) Notice, the order is not descending.

Alexander Barkov added a comment - 2014-04-25 08:59

A similar script (with COLLATE moved to the table definition and with LOWER instead of COLLATE in ORDER BY):

DROP TABLE IF EXISTS t1;

CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8 COLLATE utf8_bin;

INSERT INTO t1 (a) VALUES ("a");

INSERT INTO t1 (a) VALUES ("b");

INSERT INTO t1 (a) VALUES ("c");

SELECT * FROM t1 ORDER BY LOWER(a) DESC;

+---+

| a |

+---+

| a |

| b |

| c |

+---+

The order is not descending.

Alexander Barkov added a comment - 2014-04-25 08:59 A similar script (with COLLATE moved to the table definition and with LOWER instead of COLLATE in ORDER BY): DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8 COLLATE utf8_bin; INSERT INTO t1 (a) VALUES ("a"); INSERT INTO t1 (a) VALUES ("b"); INSERT INTO t1 (a) VALUES ("c"); SELECT * FROM t1 ORDER BY LOWER(a) DESC; +---+ | a | +---+ | a | | b | | c | +---+ The order is not descending.

Alexander Barkov added a comment - 2014-04-25 16:33

Pushed into 10.0

Alexander Barkov added a comment - 2014-04-25 16:33 Pushed into 10.0

Pavel Ivanov added a comment - 2014-04-25 18:16

I guess the patch in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/4164.
I'm very interested to know how replication safety of statements with LIMIT is related to this? Could you please explain?

Pavel Ivanov added a comment - 2014-04-25 18:16 I guess the patch in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/4164 . I'm very interested to know how replication safety of statements with LIMIT is related to this? Could you please explain?

Jeremy Cole added a comment - 2014-04-25 19:30

I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).

Jeremy Cole added a comment - 2014-04-25 19:30 I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).

Jeremy Cole added a comment - 2014-04-25 19:33

Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

Jeremy Cole added a comment - 2014-04-25 19:33 Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

Alexander Barkov added a comment - 2014-04-28 10:00

Jeremy, you're right.
"CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used
in tests to create a VARCHAR(N) column using the current value of @@collation_connection.
It worked fine with all tests not using binlog with statement format.

A fix for ~~MDEV-6170~~ added a new test into include/ctype_filesort.inc,
to cover as many collations as possible (not only utf8_bin).
As a result, the test binlog_stm_ctype_cp932 started to fail with the error
"Unsafe statement written to the binary log using statement...".
The easiest way was to make "SELECT .. LIMIT 0" deterministic,
which was erroneously not. So the patch actually fixes two problems.

The new behaviour IS tested in binlog_stm_ctype_cp932.

By the way, the old behaviour of "LIMIT 0" was not tested,
neither specifically, nor indirectly.

Alexander Barkov added a comment - 2014-04-28 10:00 Jeremy, you're right. "CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used in tests to create a VARCHAR(N) column using the current value of @@collation_connection. It worked fine with all tests not using binlog with statement format. A fix for MDEV-6170 added a new test into include/ctype_filesort.inc, to cover as many collations as possible (not only utf8_bin). As a result, the test binlog_stm_ctype_cp932 started to fail with the error "Unsafe statement written to the binary log using statement...". The easiest way was to make "SELECT .. LIMIT 0" deterministic, which was erroneously not. So the patch actually fixes two problems. The new behaviour IS tested in binlog_stm_ctype_cp932. By the way, the old behaviour of "LIMIT 0" was not tested, neither specifically, nor indirectly.

Alexander Barkov added a comment - 2014-04-28 10:08

Jeremy,

MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin,
and it is there in ctype-utf8.c, around 8158:

struct charset_info_st my_charset_utf8mb4_bin=

  46,0,0,             /* number       */

  MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE|

  MY_CS_UNICODE_SUPPLEMENT, /* state  */

MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin)
that use something more complex than just a prefix of the original string to create a binary sortable image.

utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences.
utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences.

These transformation are done for performance purposes: to reduce the length needed to sort data.

Alexander Barkov added a comment - 2014-04-28 10:08 Jeremy, MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin, and it is there in ctype-utf8.c, around 8158: struct charset_info_st my_charset_utf8mb4_bin= { 46,0,0, /* number */ MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE| MY_CS_UNICODE_SUPPLEMENT, /* state */ MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin) that use something more complex than just a prefix of the original string to create a binary sortable image. utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences. utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences. These transformation are done for performance purposes: to reduce the length needed to sort data.

Jeremy Cole added a comment - 2014-04-28 19:47

Bar,

Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case.

I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

Jeremy Cole added a comment - 2014-04-28 19:47 Bar, Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case. I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

People

Assignee:: Alexander Barkov

Reporter:: Jeremy Cole

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2014-04-24 21:17

Updated:: 2014-04-28 19:47

Resolved:: 2014-04-25 16:33

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration