Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30556

UPPER() returns an empty string for U+0251 in Unicode-5.2.0+ collations for utf8

Details

    Description

      This script erroneously returns an empty string in the column UPPER(c) on the second row:

      CREATE OR REPLACE TABLE t1
      (
        c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
      );
      INSERT INTO t1 VALUES
      (_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
      (_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
      SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;
      

      +------+--------+----------+----------+
      | c    | hex(c) | UPPER(c) | LOWER(c) |
      +------+--------+----------+----------+
      | Ɑ    | E2B1AD | Ɑ        | ɑ        |
      | ɑ    | C991   |          | ɑ        |
      +------+--------+----------+----------+
      

      So does this:

      CREATE OR REPLACE TABLE t1
      (
        c VARCHAR(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_unicode_520_ci
      );
      INSERT INTO t1 VALUES
      (_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
      (_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
      SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;
      

      +------+--------+----------+----------+
      | c    | hex(c) | UPPER(c) | LOWER(c) |
      +------+--------+----------+----------+
      | Ɑ    | E2B1AD | Ɑ        | ɑ        |
      | ɑ    | C991   |          | ɑ        |
      +------+--------+----------+----------+
      

      So does this:

      CREATE OR REPLACE TABLE t1
      (
        c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE uca1400_ai_ci
      );
      INSERT INTO t1 VALUES
      (_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
      (_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
      SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;
      

      +------+--------+----------+----------+
      | c    | hex(c) | UPPER(c) | LOWER(c) |
      +------+--------+----------+----------+
      | Ɑ    | E2B1AD | Ɑ        | ɑ        |
      | ɑ    | C991   |          | ɑ        |
      +------+--------+----------+----------+
      

      So does this:

      CREATE OR REPLACE TABLE t1
      (
        c VARCHAR(32) CHARACTER SET utf8mb3 COLLATE uca1400_ai_ci
      );
      INSERT INTO t1 VALUES
      (_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
      (_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
      SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;
      

      +------+--------+----------+----------+
      | c    | hex(c) | UPPER(c) | LOWER(c) |
      +------+--------+----------+----------+
      | Ɑ    | E2B1AD | Ɑ        | ɑ        |
      | ɑ    | C991   |          | ɑ        |
      +------+--------+----------+----------+
      

      With utf16 collations it works fine. For example:

      CREATE OR REPLACE TABLE t1
      (
        c VARCHAR(32) CHARACTER SET utf16 COLLATE utf16_unicode_520_ci
      );
      INSERT INTO t1 VALUES
      (_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
      (_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
      SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;
      

      +------+--------+----------+----------+
      | c    | hex(c) | UPPER(c) | LOWER(c) |
      +------+--------+----------+----------+
      | Ɑ    | 2C6D   | Ɑ        | ɑ        |
      | ɑ    | 0251   | Ɑ        | ɑ        |
      +------+--------+----------+----------+
      

      Attachments

        Issue Links

          Activity

            Affected characters

            This script returns the full list of problematic characters:

            CREATE OR REPLACE TABLE bad_case_folding
            (
              code INT NOT NULL,
              c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL
            );
             
            DELIMITER $$
            FOR code IN 0..0x10FFFF
            DO
              BEGIN
                DECLARE str TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci DEFAULT CHAR(code USING utf32);
                IF LENGTH(LOWER(str))=0 OR LENGTH(UPPER(str))=0 THEN
                  INSERT INTO bad_case_folding VALUES (code, str);
                END IF;
              END;
            END FOR;
            $$
            DELIMITER ;
             
            SELECT HEX(code), HEX(LOWER(c)), HEX(UPPER(c)), c FROM bad_case_folding;
            

            +-----------+---------------+---------------+----+
            | HEX(code) | HEX(LOWER(c)) | HEX(UPPER(c)) | c  |
            +-----------+---------------+---------------+----+
            | 23A       |               | C8BA          | Ⱥ  |
            | 23E       |               | C8BE          | Ⱦ  |
            | 23F       | C8BF          |               | ȿ  |
            | 240       | C980          |               | ɀ  |
            | 250       | C990          |               | ɐ  |
            | 251       | C991          |               | ɑ  |
            | 252       | C992          |               | ɒ  |
            | 26B       | C9AB          |               | ɫ  |
            | 271       | C9B1          |               | ɱ  |
            | 27D       | C9BD          |               | ɽ  |
            +-----------+---------------+---------------+----+
            

            bar Alexander Barkov added a comment - Affected characters This script returns the full list of problematic characters: CREATE OR REPLACE TABLE bad_case_folding ( code INT NOT NULL , c VARCHAR (32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL );   DELIMITER $$ FOR code IN 0..0x10FFFF DO BEGIN DECLARE str TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci DEFAULT CHAR (code USING utf32); IF LENGTH( LOWER (str))=0 OR LENGTH( UPPER (str))=0 THEN INSERT INTO bad_case_folding VALUES (code, str); END IF ; END ; END FOR ; $$ DELIMITER ;   SELECT HEX(code), HEX( LOWER (c)), HEX( UPPER (c)), c FROM bad_case_folding; +-----------+---------------+---------------+----+ | HEX(code) | HEX(LOWER(c)) | HEX(UPPER(c)) | c | +-----------+---------------+---------------+----+ | 23A | | C8BA | Ⱥ | | 23E | | C8BE | Ⱦ | | 23F | C8BF | | ȿ | | 240 | C980 | | ɀ | | 250 | C990 | | ɐ | | 251 | C991 | | ɑ | | 252 | C992 | | ɒ | | 26B | C9AB | | ɫ | | 271 | C9B1 | | ɱ | | 27D | C9BD | | ɽ | +-----------+---------------+---------------+----+
            bar Alexander Barkov added a comment - - edited

            As the above script takes a few seconds, here's a shorter script proposed for an MTR test:

            CREATE OR REPLACE TABLE bad_case_folding
            (
              code INT NOT NULL,
              c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL DEFAULT ''
            );
            INSERT INTO bad_case_folding (code) VALUES (0x23A),(0x23E),(0x23F),(0x240),(0x250),(0x251),(0x252),(0x26B),(0x271),(0x27D);
            UPDATE bad_case_folding SET c=CHAR(code USING utf32);
            SELECT HEX(code), HEX(LOWER(c)), HEX(UPPER(c)), c FROM bad_case_folding ORDER BY code;
            

            +-----------+---------------+---------------+----+
            | HEX(code) | HEX(LOWER(c)) | HEX(UPPER(c)) | c  |
            +-----------+---------------+---------------+----+
            | 23A       |               | C8BA          | Ⱥ  |
            | 23E       |               | C8BE          | Ⱦ  |
            | 23F       | C8BF          |               | ȿ  |
            | 240       | C980          |               | ɀ  |
            | 250       | C990          |               | ɐ  |
            | 251       | C991          |               | ɑ  |
            | 252       | C992          |               | ɒ  |
            | 26B       | C9AB          |               | ɫ  |
            | 271       | C9B1          |               | ɱ  |
            | 27D       | C9BD          |               | ɽ  |
            +-----------+---------------+---------------+----+
            

            bar Alexander Barkov added a comment - - edited As the above script takes a few seconds, here's a shorter script proposed for an MTR test: CREATE OR REPLACE TABLE bad_case_folding ( code INT NOT NULL , c VARCHAR (32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL DEFAULT '' ); INSERT INTO bad_case_folding (code) VALUES (0x23A),(0x23E),(0x23F),(0x240),(0x250),(0x251),(0x252),(0x26B),(0x271),(0x27D); UPDATE bad_case_folding SET c= CHAR (code USING utf32); SELECT HEX(code), HEX( LOWER (c)), HEX( UPPER (c)), c FROM bad_case_folding ORDER BY code; +-----------+---------------+---------------+----+ | HEX(code) | HEX(LOWER(c)) | HEX(UPPER(c)) | c | +-----------+---------------+---------------+----+ | 23A | | C8BA | Ⱥ | | 23E | | C8BE | Ⱦ | | 23F | C8BF | | ȿ | | 240 | C980 | | ɀ | | 250 | C990 | | ɐ | | 251 | C991 | | ɑ | | 252 | C992 | | ɒ | | 26B | C9AB | | ɫ | | 271 | C9B1 | | ɱ | | 27D | C9BD | | ɽ | +-----------+---------------+---------------+----+
            rjasdfiii Rick James added a comment -

            8.0.33 has the same failure on upper/lower casing those letters.

            rjasdfiii Rick James added a comment - 8.0.33 has the same failure on upper/lower casing those letters.

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.