Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6170

Incorrect ordering with utf8_bin and utf8mb4_bin collations

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.10
    • 10.0.11
    • None

    Description

      In MySQL 5.6 the utf8_bin and utf8mb4_bin collations were broken such that they ordered by only the high byte of the utf8 character (meaning ASCII sorts as though all characters were 0, resulting in random sort). This was fixed in the following commit (and maybe others):

      https://github.com/darnaut/mysql-server/commit/5fe5a5b

      This issue affects MariaDB 10 as well. See the attached simple test for this sorting bug.

      Attachments

        Activity

          jeremycole Jeremy Cole added a comment -

          Note that with only that patch applied, I get a failure in main.ctype_utf32:

          assertion failed at sql/filesort.cc:927 in void make_sortkey(Sort_param*, uchar*, uchar*): sort_field->length >= length

          Obviously some other change is required but I am not sure what.

          jeremycole Jeremy Cole added a comment - Note that with only that patch applied, I get a failure in main.ctype_utf32: assertion failed at sql/filesort.cc:927 in void make_sortkey(Sort_param*, uchar*, uchar*): sort_field->length >= length Obviously some other change is required but I am not sure what.
          jeremycole Jeremy Cole added a comment -

          Also note that the assert being hit seems correct; commenting it out causes utf32 to sort incorrectly and the test fails for that reason. On MySQL 5.6, the assert is never reached because `from != to` in the enclosing if(), which I think means the string is being reallocated elsewhere...

          jeremycole Jeremy Cole added a comment - Also note that the assert being hit seems correct; commenting it out causes utf32 to sort incorrectly and the test fails for that reason. On MySQL 5.6, the assert is never reached because `from != to` in the enclosing if(), which I think means the string is being reallocated elsewhere...
          jeremycole Jeremy Cole added a comment -

          Note that the upstream bug was this one:

          http://bugs.mysql.com/bug.php?id=69005

          jeremycole Jeremy Cole added a comment - Note that the upstream bug was this one: http://bugs.mysql.com/bug.php?id=69005

          An SQL script demonstrating the problem:

          DROP TABLE IF EXISTS t1;
          CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8;
          INSERT INTO t1 (a) VALUES ("a");
          INSERT INTO t1 (a) VALUES ("b");
          INSERT INTO t1 (a) VALUES ("c");
          SELECT * FROM t1 ORDER BY a COLLATE utf8_bin DESC;
          +---+
          | a |
          +---+
          | a |
          | b |
          | c |
          +---+
          3 rows in set (0.00 sec)

          Notice, the order is not descending.

          bar Alexander Barkov added a comment - An SQL script demonstrating the problem: DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8; INSERT INTO t1 (a) VALUES ("a"); INSERT INTO t1 (a) VALUES ("b"); INSERT INTO t1 (a) VALUES ("c"); SELECT * FROM t1 ORDER BY a COLLATE utf8_bin DESC; +---+ | a | +---+ | a | | b | | c | +---+ 3 rows in set (0.00 sec) Notice, the order is not descending.

          A similar script (with COLLATE moved to the table definition and with LOWER instead of COLLATE in ORDER BY):

          DROP TABLE IF EXISTS t1;
          CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8 COLLATE utf8_bin;
          INSERT INTO t1 (a) VALUES ("a");
          INSERT INTO t1 (a) VALUES ("b");
          INSERT INTO t1 (a) VALUES ("c");
          SELECT * FROM t1 ORDER BY LOWER(a) DESC;
          +---+
          | a |
          +---+
          | a |
          | b |
          | c |
          +---+

          The order is not descending.

          bar Alexander Barkov added a comment - A similar script (with COLLATE moved to the table definition and with LOWER instead of COLLATE in ORDER BY): DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a CHAR(5) NOT NULL) ENGINE=MEMORY CHARSET=utf8 COLLATE utf8_bin; INSERT INTO t1 (a) VALUES ("a"); INSERT INTO t1 (a) VALUES ("b"); INSERT INTO t1 (a) VALUES ("c"); SELECT * FROM t1 ORDER BY LOWER(a) DESC; +---+ | a | +---+ | a | | b | | c | +---+ The order is not descending.

          Pushed into 10.0

          bar Alexander Barkov added a comment - Pushed into 10.0
          pivanof Pavel Ivanov added a comment -

          I guess the patch in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/4164.
          I'm very interested to know how replication safety of statements with LIMIT is related to this? Could you please explain?

          pivanof Pavel Ivanov added a comment - I guess the patch in question is http://bazaar.launchpad.net/~maria-captains/maria/10.0/revision/4164 . I'm very interested to know how replication safety of statements with LIMIT is related to this? Could you please explain?
          jeremycole Jeremy Cole added a comment -

          I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).

          jeremycole Jeremy Cole added a comment - I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).
          jeremycole Jeremy Cole added a comment -

          Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

          jeremycole Jeremy Cole added a comment - Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

          Jeremy, you're right.
          "CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used
          in tests to create a VARCHAR(N) column using the current value of @@collation_connection.
          It worked fine with all tests not using binlog with statement format.

          A fix for MDEV-6170 added a new test into include/ctype_filesort.inc,
          to cover as many collations as possible (not only utf8_bin).
          As a result, the test binlog_stm_ctype_cp932 started to fail with the error
          "Unsafe statement written to the binary log using statement...".
          The easiest way was to make "SELECT .. LIMIT 0" deterministic,
          which was erroneously not. So the patch actually fixes two problems.

          The new behaviour IS tested in binlog_stm_ctype_cp932.

          By the way, the old behaviour of "LIMIT 0" was not tested,
          neither specifically, nor indirectly.

          bar Alexander Barkov added a comment - Jeremy, you're right. "CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used in tests to create a VARCHAR(N) column using the current value of @@collation_connection. It worked fine with all tests not using binlog with statement format. A fix for MDEV-6170 added a new test into include/ctype_filesort.inc, to cover as many collations as possible (not only utf8_bin). As a result, the test binlog_stm_ctype_cp932 started to fail with the error "Unsafe statement written to the binary log using statement...". The easiest way was to make "SELECT .. LIMIT 0" deterministic, which was erroneously not. So the patch actually fixes two problems. The new behaviour IS tested in binlog_stm_ctype_cp932. By the way, the old behaviour of "LIMIT 0" was not tested, neither specifically, nor indirectly.

          Jeremy,

          MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin,
          and it is there in ctype-utf8.c, around 8158:

          struct charset_info_st my_charset_utf8mb4_bin=
          {
            46,0,0,             /* number       */
            MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE|
            MY_CS_UNICODE_SUPPLEMENT, /* state  */

          MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin)
          that use something more complex than just a prefix of the original string to create a binary sortable image.

          • utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences.
          • utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences.

          These transformation are done for performance purposes: to reduce the length needed to sort data.

          bar Alexander Barkov added a comment - Jeremy, MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin, and it is there in ctype-utf8.c, around 8158: struct charset_info_st my_charset_utf8mb4_bin= { 46,0,0, /* number */ MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE| MY_CS_UNICODE_SUPPLEMENT, /* state */ MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin) that use something more complex than just a prefix of the original string to create a binary sortable image. utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences. utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences. These transformation are done for performance purposes: to reduce the length needed to sort data.
          jeremycole Jeremy Cole added a comment -

          Bar,

          Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case.

          I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

          jeremycole Jeremy Cole added a comment - Bar, Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case. I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

          People

            bar Alexander Barkov
            jeremycole Jeremy Cole
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.