Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
None
Description
There are two parallel comparison systems in MariaDB collation library, implemented as virtual functions in MY_COLLATION_HANDLER:
- Comparison according to the collation, provided by these functions
int (*strnncoll)(CHARSET_INFO *,
const uchar *, size_t, const uchar *, size_t, my_bool);
int (*strnncollsp)(CHARSET_INFO *,
const uchar *, size_t, const uchar *, size_t);
int (*strnncollsp_nchars)(CHARSET_INFO *,
const uchar *str1, size_t len1,
const uchar *str2, size_t len2,
size_t nchars,
uint flags);
- Comparison in case insensitive (but accent sensitive) style, implemented by this function:
int (*strcasecmp)(CHARSET_INFO *, const char *, const char *);
Note, accent and case sensitivity of the collation does not matter. strcasecmp() always works using accent sensitive case insensitive comparison style.
These two parallel systems are redundant.
Note, strcasecmp() is used mostly to compare identifiers, while the functions of the first group are used to compare data.
Let's get rid of the second comparison system:
1. Remove MY_COLLATION_HANDLER::strcasecmp()
2. Introduce a new collation utf8mb4_general1400_as_ci. Note, it should work for the entire Unicode range U+0000..U+10FFFF.
3. Replace all calls for:
system_charset_info->coll->strcasecmp()
|
to calls for
my_charset_utf8mb4_general1400_as_ci->coll->strnncoll***()
|
The change would generally be quite mechanic. However, there is one small problem: strcasecmp() accepts 0-terminated strings, while the strnncoll-alike functions accept the pointer and the length. So some refactoring will be needed. Note, Monty earlier changed many MariaDB C data types to use LEX_CSTRING (instead of just a const char pointer) to store names. So this part of the current task will switch some more C data types to LEX_CSTRING.
tolower vs toupper comparison
Another option is to implement utf8mb3_general1400_as_ci which will compare upper cases (instead of lower cases).
The difference is in a few dozen BMP characters. This script finds all those characters:
CREATE OR REPLACE TABLE t1 (a CHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci) ENGINE=MyISAM; |
DELIMITER $$
|
FOR i IN 0..0xFFFF |
DO
|
INSERT INTO t1 VALUES (CHAR(i USING ucs2)); |
END FOR; |
$$
|
DELIMITER ;
|
ALTER TABLE t1 |
ADD has_casefolding INT DEFAULT (BINARY LOWER(a)<>a OR BINARY UPPER(a)<>a), |
ADD KEY(has_casefolding, a); |
|
CREATE OR REPLACE TABLE t21 AS |
SELECT
|
HEX(t1.a) AS hex_a, HEX(t2.a) AS hex_b, |
BINARY LOWER(t1.a)=LOWER(t2.a) eq_lower, |
BINARY UPPER(t1.a)=UPPER(t2.a) AS eq_upper, |
t1.a AS a, t2.a AS b |
FROM
|
t1 t1, t1 t2
|
WHERE
|
t1.has_casefolding=1
|
AND (BINARY LOWER(t1.a)=LOWER(t2.a))<>(BINARY UPPER(t1.a)=UPPER(t2.a)); |
|
SELECT
|
HEX(CONVERT(a USING ucs2)) AS unicode_a, |
HEX(CONVERT(b USING ucs2)) AS unicode_b, |
t21.* FROM t21; |
+-----------+-----------+--------+--------+----------+----------+------+------+
|
| unicode_a | unicode_b | hex_a | hex_b | eq_lower | eq_upper | a | b |
|
+-----------+-----------+--------+--------+----------+----------+------+------+
|
| 1E9E | 00DF | E1BA9E | C39F | 1 | 0 | ẞ | ß |
|
| 0399 | 0345 | CE99 | CD85 | 0 | 1 | Ι | ͅ |
|
| 03B9 | 0345 | CEB9 | CD85 | 0 | 1 | ι | ͅ |
|
| 1FBE | 0345 | E1BEBE | CD85 | 0 | 1 | ι | ͅ |
|
| 212B | 00C5 | E284AB | C385 | 1 | 0 | Å | Å |
|
| 212B | 00E5 | E284AB | C3A5 | 1 | 0 | Å | å |
|
| 00C5 | 212B | C385 | E284AB | 1 | 0 | Å | Å |
|
| 00E5 | 212B | C3A5 | E284AB | 1 | 0 | å | Å |
|
| 0130 | 0049 | C4B0 | 49 | 1 | 0 | İ | I |
|
| 0131 | 0049 | C4B1 | 49 | 0 | 1 | ı | I |
|
| 0130 | 0069 | C4B0 | 69 | 1 | 0 | İ | i |
|
| 0131 | 0069 | C4B1 | 69 | 0 | 1 | ı | i |
|
| 0049 | 0130 | 49 | C4B0 | 1 | 0 | I | İ |
|
| 0069 | 0130 | 69 | C4B0 | 1 | 0 | i | İ |
|
| 0049 | 0131 | 49 | C4B1 | 0 | 1 | I | ı |
|
| 0069 | 0131 | 69 | C4B1 | 0 | 1 | i | ı |
|
| 212A | 004B | E284AA | 4B | 1 | 0 | K | K |
|
| 212A | 006B | E284AA | 6B | 1 | 0 | K | k |
|
| 004B | 212A | 4B | E284AA | 1 | 0 | K | K |
|
| 006B | 212A | 6B | E284AA | 1 | 0 | k | K |
|
| 017F | 0053 | C5BF | 53 | 0 | 1 | ſ | S |
|
| 017F | 0073 | C5BF | 73 | 0 | 1 | ſ | s |
|
| 0053 | 017F | 53 | C5BF | 0 | 1 | S | ſ |
|
| 0073 | 017F | 73 | C5BF | 0 | 1 | s | ſ |
|
| 1E9B | 1E60 | E1BA9B | E1B9A0 | 0 | 1 | ẛ | Ṡ |
|
| 1E9B | 1E61 | E1BA9B | E1B9A1 | 0 | 1 | ẛ | ṡ |
|
| 1E60 | 1E9B | E1B9A0 | E1BA9B | 0 | 1 | Ṡ | ẛ |
|
| 1E61 | 1E9B | E1B9A1 | E1BA9B | 0 | 1 | ṡ | ẛ |
|
| 03D0 | 0392 | CF90 | CE92 | 0 | 1 | ϐ | Β |
|
| 03D0 | 03B2 | CF90 | CEB2 | 0 | 1 | ϐ | β |
|
| 0392 | 03D0 | CE92 | CF90 | 0 | 1 | Β | ϐ |
|
| 03B2 | 03D0 | CEB2 | CF90 | 0 | 1 | β | ϐ |
|
| 03F5 | 0395 | CFB5 | CE95 | 0 | 1 | ϵ | Ε |
|
| 03F5 | 03B5 | CFB5 | CEB5 | 0 | 1 | ϵ | ε |
|
| 0395 | 03F5 | CE95 | CFB5 | 0 | 1 | Ε | ϵ |
|
| 03B5 | 03F5 | CEB5 | CFB5 | 0 | 1 | ε | ϵ |
|
| 03D1 | 0398 | CF91 | CE98 | 0 | 1 | ϑ | Θ |
|
| 03F4 | 0398 | CFB4 | CE98 | 1 | 0 | ϴ | Θ |
|
| 03D1 | 03B8 | CF91 | CEB8 | 0 | 1 | ϑ | θ |
|
| 03F4 | 03B8 | CFB4 | CEB8 | 1 | 0 | ϴ | θ |
|
| 0398 | 03D1 | CE98 | CF91 | 0 | 1 | Θ | ϑ |
|
| 03B8 | 03D1 | CEB8 | CF91 | 0 | 1 | θ | ϑ |
|
| 0398 | 03F4 | CE98 | CFB4 | 1 | 0 | Θ | ϴ |
|
| 03B8 | 03F4 | CEB8 | CFB4 | 1 | 0 | θ | ϴ |
|
| 0345 | 0399 | CD85 | CE99 | 0 | 1 | ͅ | Ι |
|
| 1FBE | 0399 | E1BEBE | CE99 | 0 | 1 | ι | Ι |
|
| 0345 | 03B9 | CD85 | CEB9 | 0 | 1 | ͅ | ι |
|
| 1FBE | 03B9 | E1BEBE | CEB9 | 0 | 1 | ι | ι |
|
| 0345 | 1FBE | CD85 | E1BEBE | 0 | 1 | ͅ | ι |
|
| 0399 | 1FBE | CE99 | E1BEBE | 0 | 1 | Ι | ι |
|
| 03B9 | 1FBE | CEB9 | E1BEBE | 0 | 1 | ι | ι |
|
| 03F0 | 039A | CFB0 | CE9A | 0 | 1 | ϰ | Κ |
|
| 03F0 | 03BA | CFB0 | CEBA | 0 | 1 | ϰ | κ |
|
| 039A | 03F0 | CE9A | CFB0 | 0 | 1 | Κ | ϰ |
|
| 03BA | 03F0 | CEBA | CFB0 | 0 | 1 | κ | ϰ |
|
| 039C | 00B5 | CE9C | C2B5 | 0 | 1 | Μ | µ |
|
| 03BC | 00B5 | CEBC | C2B5 | 0 | 1 | μ | µ |
|
| 00B5 | 039C | C2B5 | CE9C | 0 | 1 | µ | Μ |
|
| 00B5 | 03BC | C2B5 | CEBC | 0 | 1 | µ | μ |
|
| 03D6 | 03A0 | CF96 | CEA0 | 0 | 1 | ϖ | Π |
|
| 03D6 | 03C0 | CF96 | CF80 | 0 | 1 | ϖ | π |
|
| 03A0 | 03D6 | CEA0 | CF96 | 0 | 1 | Π | ϖ |
|
| 03C0 | 03D6 | CF80 | CF96 | 0 | 1 | π | ϖ |
|
| 03F1 | 03A1 | CFB1 | CEA1 | 0 | 1 | ϱ | Ρ |
|
| 03F1 | 03C1 | CFB1 | CF81 | 0 | 1 | ϱ | ρ |
|
| 03A1 | 03F1 | CEA1 | CFB1 | 0 | 1 | Ρ | ϱ |
|
| 03C1 | 03F1 | CF81 | CFB1 | 0 | 1 | ρ | ϱ |
|
| 03C2 | 03A3 | CF82 | CEA3 | 0 | 1 | ς | Σ |
|
| 03A3 | 03C2 | CEA3 | CF82 | 0 | 1 | Σ | ς |
|
| 03C3 | 03C2 | CF83 | CF82 | 0 | 1 | σ | ς |
|
| 03C2 | 03C3 | CF82 | CF83 | 0 | 1 | ς | σ |
|
| 03D5 | 03A6 | CF95 | CEA6 | 0 | 1 | ϕ | Φ |
|
| 03D5 | 03C6 | CF95 | CF86 | 0 | 1 | ϕ | φ |
|
| 03A6 | 03D5 | CEA6 | CF95 | 0 | 1 | Φ | ϕ |
|
| 03C6 | 03D5 | CF86 | CF95 | 0 | 1 | φ | ϕ |
|
| 2126 | 03A9 | E284A6 | CEA9 | 1 | 0 | Ω | Ω |
|
| 2126 | 03C9 | E284A6 | CF89 | 1 | 0 | Ω | ω |
|
| 03A9 | 2126 | CEA9 | E284A6 | 1 | 0 | Ω | Ω |
|
| 03C9 | 2126 | CF89 | E284A6 | 1 | 0 | ω | Ω |
|
| 1C80 | 0412 | E1B280 | D092 | 0 | 1 | ᲀ | В |
|
| 1C80 | 0432 | E1B280 | D0B2 | 0 | 1 | ᲀ | в |
|
| 0412 | 1C80 | D092 | E1B280 | 0 | 1 | В | ᲀ |
|
| 0432 | 1C80 | D0B2 | E1B280 | 0 | 1 | в | ᲀ |
|
| 1C81 | 0414 | E1B281 | D094 | 0 | 1 | ᲁ | Д |
|
| 1C81 | 0434 | E1B281 | D0B4 | 0 | 1 | ᲁ | д |
|
| 0414 | 1C81 | D094 | E1B281 | 0 | 1 | Д | ᲁ |
|
| 0434 | 1C81 | D0B4 | E1B281 | 0 | 1 | д | ᲁ |
|
| 1C82 | 041E | E1B282 | D09E | 0 | 1 | ᲂ | О |
|
| 1C82 | 043E | E1B282 | D0BE | 0 | 1 | ᲂ | о |
|
| 041E | 1C82 | D09E | E1B282 | 0 | 1 | О | ᲂ |
|
| 043E | 1C82 | D0BE | E1B282 | 0 | 1 | о | ᲂ |
|
| 1C83 | 0421 | E1B283 | D0A1 | 0 | 1 | ᲃ | С |
|
| 1C83 | 0441 | E1B283 | D181 | 0 | 1 | ᲃ | с |
|
| 0421 | 1C83 | D0A1 | E1B283 | 0 | 1 | С | ᲃ |
|
| 0441 | 1C83 | D181 | E1B283 | 0 | 1 | с | ᲃ |
|
| 1C84 | 0422 | E1B284 | D0A2 | 0 | 1 | ᲄ | Т |
|
| 1C85 | 0422 | E1B285 | D0A2 | 0 | 1 | ᲅ | Т |
|
| 1C84 | 0442 | E1B284 | D182 | 0 | 1 | ᲄ | т |
|
| 1C85 | 0442 | E1B285 | D182 | 0 | 1 | ᲅ | т |
|
| 0422 | 1C84 | D0A2 | E1B284 | 0 | 1 | Т | ᲄ |
|
| 0442 | 1C84 | D182 | E1B284 | 0 | 1 | т | ᲄ |
|
| 1C85 | 1C84 | E1B285 | E1B284 | 0 | 1 | ᲅ | ᲄ |
|
| 0422 | 1C85 | D0A2 | E1B285 | 0 | 1 | Т | ᲅ |
|
| 0442 | 1C85 | D182 | E1B285 | 0 | 1 | т | ᲅ |
|
| 1C84 | 1C85 | E1B284 | E1B285 | 0 | 1 | ᲄ | ᲅ |
|
| A64A | 1C88 | EA998A | E1B288 | 0 | 1 | Ꙋ | ᲈ |
|
| A64B | 1C88 | EA998B | E1B288 | 0 | 1 | ꙋ | ᲈ |
|
| 1C88 | A64A | E1B288 | EA998A | 0 | 1 | ᲈ | Ꙋ |
|
| 1C88 | A64B | E1B288 | EA998B | 0 | 1 | ᲈ | ꙋ |
|
| 1C86 | 042A | E1B286 | D0AA | 0 | 1 | ᲆ | Ъ |
|
| 1C86 | 044A | E1B286 | D18A | 0 | 1 | ᲆ | ъ |
|
| 042A | 1C86 | D0AA | E1B286 | 0 | 1 | Ъ | ᲆ |
|
| 044A | 1C86 | D18A | E1B286 | 0 | 1 | ъ | ᲆ |
|
| 1C87 | 0462 | E1B287 | D1A2 | 0 | 1 | ᲇ | Ѣ |
|
| 1C87 | 0463 | E1B287 | D1A3 | 0 | 1 | ᲇ | ѣ |
|
| 0462 | 1C87 | D1A2 | E1B287 | 0 | 1 | Ѣ | ᲇ |
|
| 0463 | 1C87 | D1A3 | E1B287 | 0 | 1 | ѣ | ᲇ |
|
+-----------+-----------+--------+--------+----------+----------+------+------+
|
Note, toupper comparison considers more pairs as equal than tolower comparison:
SELECT SUM(eq_lower), SUM(eq_upper) FROM t21; |
+---------------+---------------+
|
| SUM(eq_lower) | SUM(eq_upper) |
|
+---------------+---------------+
|
| 21 | 96 |
|
+---------------+---------------+
|
A more compact table with distinct pairs (the character with a smaller code point is on the left)
SELECT
|
HEX(CONVERT(a USING ucs2)) AS unicode_a, |
HEX(CONVERT(b USING ucs2)) AS unicode_b, |
HEX(a), HEX(b),
|
BINARY LOWER(a)=LOWER(b) AS eq_lower, |
BINARY UPPER(a)=UPPER(b) AS eq_upper, |
a,b
|
FROM
|
(
|
SELECT DISTINCT IF(BINARY a<b,a,b) AS a,IF(binary a>=b,a,b) AS b from t21 |
) d1
|
ORDER BY eq_lower, unicode_a, unicode_b; |
+-----------+-----------+--------+--------+----------+----------+-----+-----+
|
| unicode_a | unicode_b | HEX(a) | HEX(b) | eq_lower | eq_upper | a | b |
|
+-----------+-----------+--------+--------+----------+----------+-----+-----+
|
| 0049 | 0131 | 49 | C4B1 | 0 | 1 | I | ı |
|
| 0053 | 017F | 53 | C5BF | 0 | 1 | S | ſ |
|
| 00B5 | 039C | C2B5 | CE9C | 0 | 1 | µ | Μ |
|
| 0345 | 0399 | CD85 | CE99 | 0 | 1 | ͅ | Ι |
|
| 0392 | 03D0 | CE92 | CF90 | 0 | 1 | Β | ϐ |
|
| 0395 | 03F5 | CE95 | CFB5 | 0 | 1 | Ε | ϵ |
|
| 0398 | 03D1 | CE98 | CF91 | 0 | 1 | Θ | ϑ |
|
| 0399 | 1FBE | CE99 | E1BEBE | 0 | 1 | Ι | ι |
|
| 039A | 03F0 | CE9A | CFB0 | 0 | 1 | Κ | ϰ |
|
| 03A0 | 03D6 | CEA0 | CF96 | 0 | 1 | Π | ϖ |
|
| 03A1 | 03F1 | CEA1 | CFB1 | 0 | 1 | Ρ | ϱ |
|
| 03A3 | 03C2 | CEA3 | CF82 | 0 | 1 | Σ | ς |
|
| 03A6 | 03D5 | CEA6 | CF95 | 0 | 1 | Φ | ϕ |
|
| 0412 | 1C80 | D092 | E1B280 | 0 | 1 | В | ᲀ |
|
| 0414 | 1C81 | D094 | E1B281 | 0 | 1 | Д | ᲁ |
|
| 041E | 1C82 | D09E | E1B282 | 0 | 1 | О | ᲂ |
|
| 0421 | 1C83 | D0A1 | E1B283 | 0 | 1 | С | ᲃ |
|
| 0422 | 1C84 | D0A2 | E1B284 | 0 | 1 | Т | ᲄ |
|
| 042A | 1C86 | D0AA | E1B286 | 0 | 1 | Ъ | ᲆ |
|
| 0462 | 1C87 | D1A2 | E1B287 | 0 | 1 | Ѣ | ᲇ |
|
| 1C88 | A64A | E1B288 | EA998A | 0 | 1 | ᲈ | Ꙋ |
|
| 0049 | 0130 | 49 | C4B0 | 1 | 0 | I | İ |
|
| 004B | 212A | 4B | E284AA | 1 | 0 | K | K |
|
| 00C5 | 212B | C385 | E284AB | 1 | 0 | Å | Å |
|
| 00DF | 1E9E | C39F | E1BA9E | 1 | 0 | ß | ẞ |
|
| 03A9 | 2126 | CEA9 | E284A6 | 1 | 0 | Ω | Ω |
|
+-----------+-----------+--------+--------+----------+----------+-----+-----+
|
26 rows in set (0.005 sec)
|
Cons of toupper vs tolower comparison
- The tolower variant will compare exactly like strcasecmp() did
- The toupper variant will compare close to how utf8mb3_general_ci works
Attachments
Issue Links
- blocks
-
MDEV-27490 Allow full utf8mb4 for identifiers
- Stalled
- causes
-
MDEV-33806 Server crashes when executing Admin SQL/DML after setting character_set_collations to utf8mb3_general1400_as_ci
- Closed
- is duplicated by
-
MDEV-23465 Implement a collation for identifiers
- Closed
- relates to
-
CONC-682 Collation IDs for utf8mbX_general_as_ci
- Open
-
MDEV-31531 Remove my_casedn_str() and my_caseup_str()
- Closed
-
MDEV-33084 LASTVAL(t1) and LASTVAL(T1) do not work well with lower-case-table-names=0
- Closed
-
MDEV-33086 SHOW OPEN TABLES IN DB1 -- is case insensitive with lower-case-table-names=0
- Closed
-
MDEV-33088 Cannot create triggers in the database `MYSQL`
- Closed
-
MDEV-33110 HANDLER commands are case insensitive with lower-case-table-names=0
- Closed
-
MDEV-33119 User is case insensitive in INFORMATION_SCHEMA.VIEWS
- Closed
-
MDEV-33120 System log table names are case insensitive with lower-cast-table-names=0
- Closed
-
MDEV-23465 Implement a collation for identifiers
- Closed
-
MDEV-30716 Wrong casefolding in xxx_unicode_520_ci for U+0700..U+07FF
- Closed
-
MDEV-33050 Build-in schemas like oracle_schema are accent insensitive
- Closed
-
MDEV-33085 Tables T1 and t1 do not work well with ENGINE=CSV and lower-case-table-names=0
- Closed
-
MDEV-33103 LOCK TABLE t1 AS t2 -- alias is not case sensitive with lower-case-table-names=0
- Closed
-
MDEV-33109 DROP DATABASE MYSQL -- does not drop SP with lower-case-table-names=0
- Closed
-
MDEV-33621 Unify duplicate code in my_wildcmp_uca_impl() and my_wildcmp_unicode_impl()
- Closed