Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.10(EOL)
-
None
Description
UCA1400 collations (added by MDEV-27009) currently use Unicode-5.2.0 case folding tables.
They should use Unicode-14.0.0 tables instead.
The difference (see attached diff-520-1400.diff) between these two files:
- https://www.unicode.org/Public/5.2.0/ucd/CaseFolding.txt
- https://www.unicode.org/Public/14.0.0/ucd/CaseFolding.txt
shows that a few hundred new case folding mapping pairs where added in these letter scripts:
Cyrillic, Gergian, Cherokee, Glagolitic, Coptic, Latin, Osage, Vithkuqi, Old Hungarian, Warang Citi, Medefaidrin, Adlam.
This SQL script demonstrates the out-dated case folding:
CREATE OR REPLACE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8 COLLATE uca1400_ai_ci); |
# Insert letters appeared in Unicode-6.1 (released in January 2012) |
INSERT INTO t1 VALUES (_ucs2 0xA792) /* U+A792 LATIN CAPITAL LETTER C WITH BAR */; |
INSERT INTO t1 VALUES (_ucs2 0xA793) /* U+A793 LATIN SMALL LETTER C WITH BAR */; |
SELECT HEX(a), HEX(LOWER(a)), HEX(UPPER(a)), a, LOWER(a), UPPER(a) FROM t1; |
+--------+---------------+---------------+------+----------+----------+
|
| HEX(a) | HEX(LOWER(a)) | HEX(UPPER(a)) | a | LOWER(a) | UPPER(a) |
|
+--------+---------------+---------------+------+----------+----------+
|
| EA9E92 | EA9E92 | EA9E92 | Ꞓ | Ꞓ | Ꞓ |
|
| EA9E93 | EA9E93 | EA9E93 | ꞓ | ꞓ | ꞓ |
|
+--------+---------------+---------------+------+----------+----------+
|
The above two characters (first appeared in Unicode-6.1) are expected to map to each other by functions UPPER and LOWER.
Attachments
Issue Links
- blocks
-
MDEV-19123 Change default charset from latin1 to utf8mb4
- Closed
-
MDEV-25829 Change default Unicode collation to uca1400_ai_ci
- Closed
-
MDEV-27490 Allow full utf8mb4 for identifiers
- Stalled
- is blocked by
-
MDEV-30692 conf_to_src is not up to date
- Closed
-
MDEV-30695 Refactor case folding data types in Asian collation
- Closed
-
MDEV-30716 Wrong casefolding in xxx_unicode_520_ci for U+0700..U+07FF
- Closed
-
MDEV-30746 Regression in ucs2_general_mysql500_ci
- Closed
-
MDEV-31068 Reuse duplicate case conversion code in ctype-utf8.c and ctype-ucs2.c
- Closed
-
MDEV-31069 Reuse duplicate char-to-weight conversion code in ctype-utf8.c and ctype-ucs2.c
- Closed
-
MDEV-31071 Refactor case folding data types in Unicode collations
- Closed
- relates to
-
MDEV-27009 Add UCA-14.0.0 collations
- Closed
-
MDEV-30661 UPPER() returns an empty string for U+0251 in uca1400 collations for utf8
- Closed