[MDEV-30577] Case folding for uca1400 collations is not up to date Created: 2023-02-06  Updated: 2023-10-02  Resolved: 2023-04-18

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.10
Fix Version/s: 11.1.1, 10.11.3, 11.0.2, 10.10.4

Type: Bug Priority: Critical
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Attachments: File diff-500-1400.diff    
Issue Links:
Blocks
blocks MDEV-19123 Change default charset from latin1 to... Open
blocks MDEV-25829 Change default collation to utf8mb4_u... In Review
blocks MDEV-27490 Allow full utf8mb4 for identifiers Stalled
is blocked by MDEV-30692 conf_to_src is not up to date Closed
is blocked by MDEV-30695 Refactor case folding data types in A... Closed
is blocked by MDEV-30716 Wrong casefolding in xxx_unicode_520_... Closed
is blocked by MDEV-30746 Regression in ucs2_general_mysql500_ci Closed
is blocked by MDEV-31068 Reuse duplicate case conversion code ... Closed
is blocked by MDEV-31069 Reuse duplicate char-to-weight conver... Closed
is blocked by MDEV-31071 Refactor case folding data types in U... Closed
Relates
relates to MDEV-27009 Add UCA-14.0.0 collations Closed
relates to MDEV-30661 UPPER() returns an empty string for U... Closed

 Description   

UCA1400 collations (added by MDEV-27009) currently use Unicode-5.2.0 case folding tables.

They should use Unicode-14.0.0 tables instead.

The difference (see attached diff-520-1400.diff) between these two files:

shows that a few hundred new case folding mapping pairs where added in these letter scripts:

Cyrillic, Gergian, Cherokee, Glagolitic, Coptic, Latin, Osage, Vithkuqi, Old Hungarian, Warang Citi, Medefaidrin, Adlam.

This SQL script demonstrates the out-dated case folding:

CREATE OR REPLACE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8 COLLATE uca1400_ai_ci);
# Insert letters appeared in Unicode-6.1 (released in January 2012)
INSERT INTO t1 VALUES (_ucs2 0xA792) /* U+A792 LATIN CAPITAL LETTER C WITH BAR */;
INSERT INTO t1 VALUES (_ucs2 0xA793) /* U+A793 LATIN SMALL LETTER C WITH BAR */;
SELECT HEX(a), HEX(LOWER(a)), HEX(UPPER(a)), a, LOWER(a), UPPER(a) FROM t1;

+--------+---------------+---------------+------+----------+----------+
| HEX(a) | HEX(LOWER(a)) | HEX(UPPER(a)) | a    | LOWER(a) | UPPER(a) |
+--------+---------------+---------------+------+----------+----------+
| EA9E92 | EA9E92        | EA9E92        | Ꞓ    | Ꞓ        | Ꞓ        |
| EA9E93 | EA9E93        | EA9E93        | ꞓ    | ꞓ        | ꞓ        |
+--------+---------------+---------------+------+----------+----------+

The above two characters (first appeared in Unicode-6.1) are expected to map to each other by functions UPPER and LOWER.


Generated at Thu Feb 08 10:17:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.