[MDEV-30556] UPPER() returns an empty string for U+0251 in Unicode-5.2.0+ collations for utf8 Created: 2023-02-03  Updated: 2023-05-17  Resolved: 2023-02-06

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.4
Fix Version/s: 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-19123 Change default charset from latin1 to... Open
blocks MDEV-25829 Change default collation to utf8mb4_u... In Review
blocks MDEV-27490 Allow full utf8mb4 for identifiers Stalled
Relates
relates to MDEV-30661 UPPER() returns an empty string for U... Closed

 Description   

This script erroneously returns an empty string in the column UPPER(c) on the second row:

CREATE OR REPLACE TABLE t1
(
  c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci
);
INSERT INTO t1 VALUES
(_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
(_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;

+------+--------+----------+----------+
| c    | hex(c) | UPPER(c) | LOWER(c) |
+------+--------+----------+----------+
| Ɑ    | E2B1AD | Ɑ        | ɑ        |
| ɑ    | C991   |          | ɑ        |
+------+--------+----------+----------+

So does this:

CREATE OR REPLACE TABLE t1
(
  c VARCHAR(32) CHARACTER SET utf8mb3 COLLATE utf8mb3_unicode_520_ci
);
INSERT INTO t1 VALUES
(_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
(_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;

+------+--------+----------+----------+
| c    | hex(c) | UPPER(c) | LOWER(c) |
+------+--------+----------+----------+
| Ɑ    | E2B1AD | Ɑ        | ɑ        |
| ɑ    | C991   |          | ɑ        |
+------+--------+----------+----------+

So does this:

CREATE OR REPLACE TABLE t1
(
  c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE uca1400_ai_ci
);
INSERT INTO t1 VALUES
(_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
(_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;

+------+--------+----------+----------+
| c    | hex(c) | UPPER(c) | LOWER(c) |
+------+--------+----------+----------+
| Ɑ    | E2B1AD | Ɑ        | ɑ        |
| ɑ    | C991   |          | ɑ        |
+------+--------+----------+----------+

So does this:

CREATE OR REPLACE TABLE t1
(
  c VARCHAR(32) CHARACTER SET utf8mb3 COLLATE uca1400_ai_ci
);
INSERT INTO t1 VALUES
(_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
(_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;

+------+--------+----------+----------+
| c    | hex(c) | UPPER(c) | LOWER(c) |
+------+--------+----------+----------+
| Ɑ    | E2B1AD | Ɑ        | ɑ        |
| ɑ    | C991   |          | ɑ        |
+------+--------+----------+----------+

With utf16 collations it works fine. For example:

CREATE OR REPLACE TABLE t1
(
  c VARCHAR(32) CHARACTER SET utf16 COLLATE utf16_unicode_520_ci
);
INSERT INTO t1 VALUES
(_ucs2 0x2C6D /* U+0251 LATIN SMALL LETTER ALPHA */),
(_ucs2 0x0251 /* U+2C6D LATIN CAPITAL LETTER ALPHA */);
SELECT c, hex(c), UPPER(c), LOWER(c) FROM t1;

+------+--------+----------+----------+
| c    | hex(c) | UPPER(c) | LOWER(c) |
+------+--------+----------+----------+
| Ɑ    | 2C6D   | Ɑ        | ɑ        |
| ɑ    | 0251   | Ɑ        | ɑ        |
+------+--------+----------+----------+



 Comments   
Comment by Alexander Barkov [ 2023-02-03 ]

Affected characters

This script returns the full list of problematic characters:

CREATE OR REPLACE TABLE bad_case_folding
(
  code INT NOT NULL,
  c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL
);
 
DELIMITER $$
FOR code IN 0..0x10FFFF
DO
  BEGIN
    DECLARE str TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci DEFAULT CHAR(code USING utf32);
    IF LENGTH(LOWER(str))=0 OR LENGTH(UPPER(str))=0 THEN
      INSERT INTO bad_case_folding VALUES (code, str);
    END IF;
  END;
END FOR;
$$
DELIMITER ;
 
SELECT HEX(code), HEX(LOWER(c)), HEX(UPPER(c)), c FROM bad_case_folding;

+-----------+---------------+---------------+----+
| HEX(code) | HEX(LOWER(c)) | HEX(UPPER(c)) | c  |
+-----------+---------------+---------------+----+
| 23A       |               | C8BA          | Ⱥ  |
| 23E       |               | C8BE          | Ⱦ  |
| 23F       | C8BF          |               | ȿ  |
| 240       | C980          |               | ɀ  |
| 250       | C990          |               | ɐ  |
| 251       | C991          |               | ɑ  |
| 252       | C992          |               | ɒ  |
| 26B       | C9AB          |               | ɫ  |
| 271       | C9B1          |               | ɱ  |
| 27D       | C9BD          |               | ɽ  |
+-----------+---------------+---------------+----+

Comment by Alexander Barkov [ 2023-02-03 ]

As the above script takes a few seconds, here's a shorter script proposed for an MTR test:

CREATE OR REPLACE TABLE bad_case_folding
(
  code INT NOT NULL,
  c VARCHAR(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci NOT NULL DEFAULT ''
);
INSERT INTO bad_case_folding (code) VALUES (0x23A),(0x23E),(0x23F),(0x240),(0x250),(0x251),(0x252),(0x26B),(0x271),(0x27D);
UPDATE bad_case_folding SET c=CHAR(code USING utf32);
SELECT HEX(code), HEX(LOWER(c)), HEX(UPPER(c)), c FROM bad_case_folding ORDER BY code;

+-----------+---------------+---------------+----+
| HEX(code) | HEX(LOWER(c)) | HEX(UPPER(c)) | c  |
+-----------+---------------+---------------+----+
| 23A       |               | C8BA          | Ⱥ  |
| 23E       |               | C8BE          | Ⱦ  |
| 23F       | C8BF          |               | ȿ  |
| 240       | C980          |               | ɀ  |
| 250       | C990          |               | ɐ  |
| 251       | C991          |               | ɑ  |
| 252       | C992          |               | ɒ  |
| 26B       | C9AB          |               | ɫ  |
| 271       | C9B1          |               | ɱ  |
| 27D       | C9BD          |               | ɽ  |
+-----------+---------------+---------------+----+

Comment by Rick James [ 2023-05-17 ]

8.0.33 has the same failure on upper/lower casing those letters.

Generated at Thu Feb 08 10:17:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.