Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6170

Incorrect ordering with utf8_bin and utf8mb4_bin collations

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.0.10
    • 10.0.11
    • None

    Description

      In MySQL 5.6 the utf8_bin and utf8mb4_bin collations were broken such that they ordered by only the high byte of the utf8 character (meaning ASCII sorts as though all characters were 0, resulting in random sort). This was fixed in the following commit (and maybe others):

      https://github.com/darnaut/mysql-server/commit/5fe5a5b

      This issue affects MariaDB 10 as well. See the attached simple test for this sorting bug.

      Attachments

        Activity

          jeremycole Jeremy Cole added a comment -

          I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).

          jeremycole Jeremy Cole added a comment - I'd guess the replication unsafety thing was because the "CREATE TABLE t1 ENGINE=MEMORY AS SELECT REPEAT('a',5) AS a LIMIT 0" being used in the test was being marked as unsafe for replication, despite having a LIMIT 0 making it deterministic. However, fixing that bug within this fix is kind of unexpected (and I would note that the new behavior is not being specifically tested here).
          jeremycole Jeremy Cole added a comment -

          Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

          jeremycole Jeremy Cole added a comment - Bar: Could you comment as to why MY_CS_STRNXFRM is not needed for my_charset_utf8mb4_bin? And/or why the other changes to filesort were not necessary here? Trying to understand the differences.

          Jeremy, you're right.
          "CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used
          in tests to create a VARCHAR(N) column using the current value of @@collation_connection.
          It worked fine with all tests not using binlog with statement format.

          A fix for MDEV-6170 added a new test into include/ctype_filesort.inc,
          to cover as many collations as possible (not only utf8_bin).
          As a result, the test binlog_stm_ctype_cp932 started to fail with the error
          "Unsafe statement written to the binary log using statement...".
          The easiest way was to make "SELECT .. LIMIT 0" deterministic,
          which was erroneously not. So the patch actually fixes two problems.

          The new behaviour IS tested in binlog_stm_ctype_cp932.

          By the way, the old behaviour of "LIMIT 0" was not tested,
          neither specifically, nor indirectly.

          bar Alexander Barkov added a comment - Jeremy, you're right. "CREATE TABLE t1 AS SELECT REPEAT('a',N) AS a LIMIT 0" has been always used in tests to create a VARCHAR(N) column using the current value of @@collation_connection. It worked fine with all tests not using binlog with statement format. A fix for MDEV-6170 added a new test into include/ctype_filesort.inc, to cover as many collations as possible (not only utf8_bin). As a result, the test binlog_stm_ctype_cp932 started to fail with the error "Unsafe statement written to the binary log using statement...". The easiest way was to make "SELECT .. LIMIT 0" deterministic, which was erroneously not. So the patch actually fixes two problems. The new behaviour IS tested in binlog_stm_ctype_cp932. By the way, the old behaviour of "LIMIT 0" was not tested, neither specifically, nor indirectly.

          Jeremy,

          MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin,
          and it is there in ctype-utf8.c, around 8158:

          struct charset_info_st my_charset_utf8mb4_bin=
          {
            46,0,0,             /* number       */
            MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE|
            MY_CS_UNICODE_SUPPLEMENT, /* state  */

          MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin)
          that use something more complex than just a prefix of the original string to create a binary sortable image.

          • utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences.
          • utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences.

          These transformation are done for performance purposes: to reduce the length needed to sort data.

          bar Alexander Barkov added a comment - Jeremy, MY_CS_STRNXFRM is needed for my_charset_utf8mb4_bin, and it is there in ctype-utf8.c, around 8158: struct charset_info_st my_charset_utf8mb4_bin= { 46,0,0, /* number */ MY_CS_COMPILED|MY_CS_BINSORT|MY_CS_STRNXFRM|MY_CS_UNICODE| MY_CS_UNICODE_SUPPLEMENT, /* state */ MY_CS_STRNXFRM is needed for all collations (including utf8_bin and utf8mb4_bin) that use something more complex than just a prefix of the original string to create a binary sortable image. utf8_bin transforms variable length 1-3 bytes sequences into fixed length 2 byte sequences. utf8mb4_bin transforms variable length 1-4 bytes sequences into fixed length 3 byte sequences. These transformation are done for performance purposes: to reduce the length needed to sort data.
          jeremycole Jeremy Cole added a comment -

          Bar,

          Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case.

          I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

          jeremycole Jeremy Cole added a comment - Bar, Yes, I see that MY_CS_STRNXFRM is present already for utf8mb4, I was looking at a too-old version in my case. I do however disagree that the new behavior is specifically tested; it's indirectly/incidentally tested as part of the ctype test. It should get a new test, with a decent name, to test specifically that LIMIT 0 causes the warning to not be generated while e.g. LIMIT 1 does not. It doesn't matter what was there before, it matters what you're changing and that we're improving the situation. This should get a new, specific test.

          People

            bar Alexander Barkov
            jeremycole Jeremy Cole
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.