Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31340

Remove MY_COLLATION_HANDLER::strcasecmp()

    XMLWordPrintable

Details

    Description

      There are two parallel comparison systems in MariaDB collation library, implemented as virtual functions in MY_COLLATION_HANDLER:

      • Comparison according to the collation, provided by these functions

          int     (*strnncoll)(CHARSET_INFO *,
                               const uchar *, size_t, const uchar *, size_t, my_bool);
          int     (*strnncollsp)(CHARSET_INFO *,
                                 const uchar *, size_t, const uchar *, size_t);
         
          int     (*strnncollsp_nchars)(CHARSET_INFO *,
                                        const uchar *str1, size_t len1,
                                        const uchar *str2, size_t len2,
                                        size_t nchars,
                                        uint flags);
        

      • Comparison in case insensitive (but accent sensitive) style, implemented by this function:

          int  (*strcasecmp)(CHARSET_INFO *, const char *, const char *);
        

        Note, accent and case sensitivity of the collation does not matter. strcasecmp() always works using accent sensitive case insensitive comparison style.

      These two parallel systems are redundant.

      Note, strcasecmp() is used mostly to compare identifiers, while the functions of the first group are used to compare data.

      Let's get rid of the second comparison system:
      1. Remove MY_COLLATION_HANDLER::strcasecmp()
      2. Introduce a new collation utf8mb4_general_as_ci. Note, it should work for the entire Unicode range U+0000..U+10FFFF.
      3. Replace all calls for:

      system_charset_info->coll->strcasecmp()
      

      to calls for

      my_charset_utf8mb4_general_as_ci->coll->strnncoll***()
      

      The change would generally be quite mechanic. However, there is one small problem: strcasecmp() accepts 0-terminated strings, while the strnncoll-alike functions accept the pointer and the length. So some refactoring will be needed. Note, Monty earlier changed many MariaDB C data types to use LEX_CSTRING (instead of just a const char pointer) to store names. So this part of the current task will switch some more C data types to LEX_CSTRING.

      tolower vs toupper comparison

      Another option is to implement utf8mb3_general_as_ci which will compare upper cases (instead of lower cases).

      The difference is in a few dozen BMP characters. This script finds all those characters:

      CREATE OR REPLACE TABLE t1 (a CHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci) ENGINE=MyISAM;
      DELIMITER $$
      FOR i IN 0..0xFFFF
      DO
        INSERT INTO t1 VALUES (CHAR(i USING ucs2));
      END FOR;
      $$
      DELIMITER ;
      ALTER TABLE t1
        ADD has_casefolding INT DEFAULT (BINARY LOWER(a)<>a OR BINARY UPPER(a)<>a),
        ADD KEY(has_casefolding, a);
       
      CREATE OR REPLACE TABLE t21 AS
      SELECT
        HEX(t1.a) AS hex_a, HEX(t2.a) AS hex_b,
        BINARY LOWER(t1.a)=LOWER(t2.a) eq_lower,
        BINARY UPPER(t1.a)=UPPER(t2.a) AS eq_upper,
        t1.a AS a, t2.a AS b
      FROM
        t1 t1, t1 t2
      WHERE
        t1.has_casefolding=1
      AND (BINARY LOWER(t1.a)=LOWER(t2.a))<>(BINARY UPPER(t1.a)=UPPER(t2.a));
       
      SELECT
        HEX(CONVERT(a USING ucs2)) AS unicode_a,
        HEX(CONVERT(b USING ucs2)) AS unicode_b,
        t21.* FROM t21;
      

      +-----------+-----------+--------+--------+----------+----------+------+------+
      | unicode_a | unicode_b | hex_a  | hex_b  | eq_lower | eq_upper | a    | b    |
      +-----------+-----------+--------+--------+----------+----------+------+------+
      |      1E9E |      00DF | E1BA9E |   C39F |        1 |        0 |    ẞ |    ß |
      |      0399 |      0345 |   CE99 |   CD85 |        0 |        1 |    Ι |    ͅ |
      |      03B9 |      0345 |   CEB9 |   CD85 |        0 |        1 |    ι |    ͅ |
      |      1FBE |      0345 | E1BEBE |   CD85 |        0 |        1 |    ι |    ͅ |
      |      212B |      00C5 | E284AB |   C385 |        1 |        0 |    Å |    Å |
      |      212B |      00E5 | E284AB |   C3A5 |        1 |        0 |    Å |    å |
      |      00C5 |      212B |   C385 | E284AB |        1 |        0 |    Å |    Å |
      |      00E5 |      212B |   C3A5 | E284AB |        1 |        0 |    å |    Å |
      |      0130 |      0049 |   C4B0 |     49 |        1 |        0 |    İ |    I |
      |      0131 |      0049 |   C4B1 |     49 |        0 |        1 |    ı |    I |
      |      0130 |      0069 |   C4B0 |     69 |        1 |        0 |    İ |    i |
      |      0131 |      0069 |   C4B1 |     69 |        0 |        1 |    ı |    i |
      |      0049 |      0130 |     49 |   C4B0 |        1 |        0 |    I |    İ |
      |      0069 |      0130 |     69 |   C4B0 |        1 |        0 |    i |    İ |
      |      0049 |      0131 |     49 |   C4B1 |        0 |        1 |    I |    ı |
      |      0069 |      0131 |     69 |   C4B1 |        0 |        1 |    i |    ı |
      |      212A |      004B | E284AA |     4B |        1 |        0 |    K |    K |
      |      212A |      006B | E284AA |     6B |        1 |        0 |    K |    k |
      |      004B |      212A |     4B | E284AA |        1 |        0 |    K |    K |
      |      006B |      212A |     6B | E284AA |        1 |        0 |    k |    K |
      |      017F |      0053 |   C5BF |     53 |        0 |        1 |    ſ |    S |
      |      017F |      0073 |   C5BF |     73 |        0 |        1 |    ſ |    s |
      |      0053 |      017F |     53 |   C5BF |        0 |        1 |    S |    ſ |
      |      0073 |      017F |     73 |   C5BF |        0 |        1 |    s |    ſ |
      |      1E9B |      1E60 | E1BA9B | E1B9A0 |        0 |        1 |    ẛ |    Ṡ |
      |      1E9B |      1E61 | E1BA9B | E1B9A1 |        0 |        1 |    ẛ |    ṡ |
      |      1E60 |      1E9B | E1B9A0 | E1BA9B |        0 |        1 |    Ṡ |    ẛ |
      |      1E61 |      1E9B | E1B9A1 | E1BA9B |        0 |        1 |    ṡ |    ẛ |
      |      03D0 |      0392 |   CF90 |   CE92 |        0 |        1 |    ϐ |    Β |
      |      03D0 |      03B2 |   CF90 |   CEB2 |        0 |        1 |    ϐ |    β |
      |      0392 |      03D0 |   CE92 |   CF90 |        0 |        1 |    Β |    ϐ |
      |      03B2 |      03D0 |   CEB2 |   CF90 |        0 |        1 |    β |    ϐ |
      |      03F5 |      0395 |   CFB5 |   CE95 |        0 |        1 |    ϵ |    Ε |
      |      03F5 |      03B5 |   CFB5 |   CEB5 |        0 |        1 |    ϵ |    ε |
      |      0395 |      03F5 |   CE95 |   CFB5 |        0 |        1 |    Ε |    ϵ |
      |      03B5 |      03F5 |   CEB5 |   CFB5 |        0 |        1 |    ε |    ϵ |
      |      03D1 |      0398 |   CF91 |   CE98 |        0 |        1 |    ϑ |    Θ |
      |      03F4 |      0398 |   CFB4 |   CE98 |        1 |        0 |    ϴ |    Θ |
      |      03D1 |      03B8 |   CF91 |   CEB8 |        0 |        1 |    ϑ |    θ |
      |      03F4 |      03B8 |   CFB4 |   CEB8 |        1 |        0 |    ϴ |    θ |
      |      0398 |      03D1 |   CE98 |   CF91 |        0 |        1 |    Θ |    ϑ |
      |      03B8 |      03D1 |   CEB8 |   CF91 |        0 |        1 |    θ |    ϑ |
      |      0398 |      03F4 |   CE98 |   CFB4 |        1 |        0 |    Θ |    ϴ |
      |      03B8 |      03F4 |   CEB8 |   CFB4 |        1 |        0 |    θ |    ϴ |
      |      0345 |      0399 |   CD85 |   CE99 |        0 |        1 |    ͅ |    Ι |
      |      1FBE |      0399 | E1BEBE |   CE99 |        0 |        1 |    ι |    Ι |
      |      0345 |      03B9 |   CD85 |   CEB9 |        0 |        1 |    ͅ |    ι |
      |      1FBE |      03B9 | E1BEBE |   CEB9 |        0 |        1 |    ι |    ι |
      |      0345 |      1FBE |   CD85 | E1BEBE |        0 |        1 |    ͅ |    ι |
      |      0399 |      1FBE |   CE99 | E1BEBE |        0 |        1 |    Ι |    ι |
      |      03B9 |      1FBE |   CEB9 | E1BEBE |        0 |        1 |    ι |    ι |
      |      03F0 |      039A |   CFB0 |   CE9A |        0 |        1 |    ϰ |    Κ |
      |      03F0 |      03BA |   CFB0 |   CEBA |        0 |        1 |    ϰ |    κ |
      |      039A |      03F0 |   CE9A |   CFB0 |        0 |        1 |    Κ |    ϰ |
      |      03BA |      03F0 |   CEBA |   CFB0 |        0 |        1 |    κ |    ϰ |
      |      039C |      00B5 |   CE9C |   C2B5 |        0 |        1 |    Μ |    µ |
      |      03BC |      00B5 |   CEBC |   C2B5 |        0 |        1 |    μ |    µ |
      |      00B5 |      039C |   C2B5 |   CE9C |        0 |        1 |    µ |    Μ |
      |      00B5 |      03BC |   C2B5 |   CEBC |        0 |        1 |    µ |    μ |
      |      03D6 |      03A0 |   CF96 |   CEA0 |        0 |        1 |    ϖ |    Π |
      |      03D6 |      03C0 |   CF96 |   CF80 |        0 |        1 |    ϖ |    π |
      |      03A0 |      03D6 |   CEA0 |   CF96 |        0 |        1 |    Π |    ϖ |
      |      03C0 |      03D6 |   CF80 |   CF96 |        0 |        1 |    π |    ϖ |
      |      03F1 |      03A1 |   CFB1 |   CEA1 |        0 |        1 |    ϱ |    Ρ |
      |      03F1 |      03C1 |   CFB1 |   CF81 |        0 |        1 |    ϱ |    ρ |
      |      03A1 |      03F1 |   CEA1 |   CFB1 |        0 |        1 |    Ρ |    ϱ |
      |      03C1 |      03F1 |   CF81 |   CFB1 |        0 |        1 |    ρ |    ϱ |
      |      03C2 |      03A3 |   CF82 |   CEA3 |        0 |        1 |    ς |    Σ |
      |      03A3 |      03C2 |   CEA3 |   CF82 |        0 |        1 |    Σ |    ς |
      |      03C3 |      03C2 |   CF83 |   CF82 |        0 |        1 |    σ |    ς |
      |      03C2 |      03C3 |   CF82 |   CF83 |        0 |        1 |    ς |    σ |
      |      03D5 |      03A6 |   CF95 |   CEA6 |        0 |        1 |    ϕ |    Φ |
      |      03D5 |      03C6 |   CF95 |   CF86 |        0 |        1 |    ϕ |    φ |
      |      03A6 |      03D5 |   CEA6 |   CF95 |        0 |        1 |    Φ |    ϕ |
      |      03C6 |      03D5 |   CF86 |   CF95 |        0 |        1 |    φ |    ϕ |
      |      2126 |      03A9 | E284A6 |   CEA9 |        1 |        0 |    Ω |    Ω |
      |      2126 |      03C9 | E284A6 |   CF89 |        1 |        0 |    Ω |    ω |
      |      03A9 |      2126 |   CEA9 | E284A6 |        1 |        0 |    Ω |    Ω |
      |      03C9 |      2126 |   CF89 | E284A6 |        1 |        0 |    ω |    Ω |
      |      1C80 |      0412 | E1B280 |   D092 |        0 |        1 |    ᲀ |    В |
      |      1C80 |      0432 | E1B280 |   D0B2 |        0 |        1 |    ᲀ |    в |
      |      0412 |      1C80 |   D092 | E1B280 |        0 |        1 |    В |    ᲀ |
      |      0432 |      1C80 |   D0B2 | E1B280 |        0 |        1 |    в |    ᲀ |
      |      1C81 |      0414 | E1B281 |   D094 |        0 |        1 |    ᲁ |    Д |
      |      1C81 |      0434 | E1B281 |   D0B4 |        0 |        1 |    ᲁ |    д |
      |      0414 |      1C81 |   D094 | E1B281 |        0 |        1 |    Д |    ᲁ |
      |      0434 |      1C81 |   D0B4 | E1B281 |        0 |        1 |    д |    ᲁ |
      |      1C82 |      041E | E1B282 |   D09E |        0 |        1 |    ᲂ |    О |
      |      1C82 |      043E | E1B282 |   D0BE |        0 |        1 |    ᲂ |    о |
      |      041E |      1C82 |   D09E | E1B282 |        0 |        1 |    О |    ᲂ |
      |      043E |      1C82 |   D0BE | E1B282 |        0 |        1 |    о |    ᲂ |
      |      1C83 |      0421 | E1B283 |   D0A1 |        0 |        1 |    ᲃ |    С |
      |      1C83 |      0441 | E1B283 |   D181 |        0 |        1 |    ᲃ |    с |
      |      0421 |      1C83 |   D0A1 | E1B283 |        0 |        1 |    С |    ᲃ |
      |      0441 |      1C83 |   D181 | E1B283 |        0 |        1 |    с |    ᲃ |
      |      1C84 |      0422 | E1B284 |   D0A2 |        0 |        1 |    ᲄ |    Т |
      |      1C85 |      0422 | E1B285 |   D0A2 |        0 |        1 |    ᲅ |    Т |
      |      1C84 |      0442 | E1B284 |   D182 |        0 |        1 |    ᲄ |    т |
      |      1C85 |      0442 | E1B285 |   D182 |        0 |        1 |    ᲅ |    т |
      |      0422 |      1C84 |   D0A2 | E1B284 |        0 |        1 |    Т |    ᲄ |
      |      0442 |      1C84 |   D182 | E1B284 |        0 |        1 |    т |    ᲄ |
      |      1C85 |      1C84 | E1B285 | E1B284 |        0 |        1 |    ᲅ |    ᲄ |
      |      0422 |      1C85 |   D0A2 | E1B285 |        0 |        1 |    Т |    ᲅ |
      |      0442 |      1C85 |   D182 | E1B285 |        0 |        1 |    т |    ᲅ |
      |      1C84 |      1C85 | E1B284 | E1B285 |        0 |        1 |    ᲄ |    ᲅ |
      |      A64A |      1C88 | EA998A | E1B288 |        0 |        1 |    Ꙋ |    ᲈ |
      |      A64B |      1C88 | EA998B | E1B288 |        0 |        1 |    ꙋ |    ᲈ |
      |      1C88 |      A64A | E1B288 | EA998A |        0 |        1 |    ᲈ |    Ꙋ |
      |      1C88 |      A64B | E1B288 | EA998B |        0 |        1 |    ᲈ |    ꙋ |
      |      1C86 |      042A | E1B286 |   D0AA |        0 |        1 |    ᲆ |    Ъ |
      |      1C86 |      044A | E1B286 |   D18A |        0 |        1 |    ᲆ |    ъ |
      |      042A |      1C86 |   D0AA | E1B286 |        0 |        1 |    Ъ |    ᲆ |
      |      044A |      1C86 |   D18A | E1B286 |        0 |        1 |    ъ |    ᲆ |
      |      1C87 |      0462 | E1B287 |   D1A2 |        0 |        1 |    ᲇ |    Ѣ |
      |      1C87 |      0463 | E1B287 |   D1A3 |        0 |        1 |    ᲇ |    ѣ |
      |      0462 |      1C87 |   D1A2 | E1B287 |        0 |        1 |    Ѣ |    ᲇ |
      |      0463 |      1C87 |   D1A3 | E1B287 |        0 |        1 |    ѣ |    ᲇ |
      +-----------+-----------+--------+--------+----------+----------+------+------+
      

      Note, toupper comparison considers more pairs as equal than tolower comparison:

      SELECT SUM(eq_lower), SUM(eq_upper) FROM t21;
      

      +---------------+---------------+
      | SUM(eq_lower) | SUM(eq_upper) |
      +---------------+---------------+
      |            21 |            96 |
      +---------------+---------------+
      

      A more compact table with distinct pairs (the character with a smaller code point is on the left)

      SELECT
        HEX(CONVERT(a USING ucs2)) AS unicode_a,
        HEX(CONVERT(b USING ucs2)) AS unicode_b,
        HEX(a), HEX(b),
        BINARY LOWER(a)=LOWER(b) AS eq_lower,
        BINARY UPPER(a)=UPPER(b) AS eq_upper,
        a,b
      FROM
      (
        SELECT DISTINCT IF(BINARY a<b,a,b) AS a,IF(binary a>=b,a,b) AS b from t21
      ) d1
      ORDER BY eq_lower, unicode_a, unicode_b;
      

      +-----------+-----------+--------+--------+----------+----------+-----+-----+
      | unicode_a | unicode_b | HEX(a) | HEX(b) | eq_lower | eq_upper | a   | b   |
      +-----------+-----------+--------+--------+----------+----------+-----+-----+
      |      0049 |      0131 |     49 |   C4B1 |        0 |        1 |   I |   ı |
      |      0053 |      017F |     53 |   C5BF |        0 |        1 |   S |   ſ |
      |      00B5 |      039C |   C2B5 |   CE9C |        0 |        1 |   µ |   Μ |
      |      0345 |      0399 |   CD85 |   CE99 |        0 |        1 |   ͅ |   Ι |
      |      0392 |      03D0 |   CE92 |   CF90 |        0 |        1 |   Β |   ϐ |
      |      0395 |      03F5 |   CE95 |   CFB5 |        0 |        1 |   Ε |   ϵ |
      |      0398 |      03D1 |   CE98 |   CF91 |        0 |        1 |   Θ |   ϑ |
      |      0399 |      1FBE |   CE99 | E1BEBE |        0 |        1 |   Ι |   ι |
      |      039A |      03F0 |   CE9A |   CFB0 |        0 |        1 |   Κ |   ϰ |
      |      03A0 |      03D6 |   CEA0 |   CF96 |        0 |        1 |   Π |   ϖ |
      |      03A1 |      03F1 |   CEA1 |   CFB1 |        0 |        1 |   Ρ |   ϱ |
      |      03A3 |      03C2 |   CEA3 |   CF82 |        0 |        1 |   Σ |   ς |
      |      03A6 |      03D5 |   CEA6 |   CF95 |        0 |        1 |   Φ |   ϕ |
      |      0412 |      1C80 |   D092 | E1B280 |        0 |        1 |   В |   ᲀ |
      |      0414 |      1C81 |   D094 | E1B281 |        0 |        1 |   Д |   ᲁ |
      |      041E |      1C82 |   D09E | E1B282 |        0 |        1 |   О |   ᲂ |
      |      0421 |      1C83 |   D0A1 | E1B283 |        0 |        1 |   С |   ᲃ |
      |      0422 |      1C84 |   D0A2 | E1B284 |        0 |        1 |   Т |   ᲄ |
      |      042A |      1C86 |   D0AA | E1B286 |        0 |        1 |   Ъ |   ᲆ |
      |      0462 |      1C87 |   D1A2 | E1B287 |        0 |        1 |   Ѣ |   ᲇ |
      |      1C88 |      A64A | E1B288 | EA998A |        0 |        1 |   ᲈ |   Ꙋ |
      |      0049 |      0130 |     49 |   C4B0 |        1 |        0 |   I |   İ |
      |      004B |      212A |     4B | E284AA |        1 |        0 |   K |   K |
      |      00C5 |      212B |   C385 | E284AB |        1 |        0 |   Å |   Å |
      |      00DF |      1E9E |   C39F | E1BA9E |        1 |        0 |   ß |   ẞ |
      |      03A9 |      2126 |   CEA9 | E284A6 |        1 |        0 |   Ω |   Ω |
      +-----------+-----------+--------+--------+----------+----------+-----+-----+
      26 rows in set (0.005 sec)
      

      Cons of toupper vs tolower comparison

      • The tolower variant will compare exactly like strcasecmp() did
      • The toupper variant will compare close to how utf8mb3_general_ci works

      Attachments

        Issue Links

          Activity

            People

              bar Alexander Barkov
              bar Alexander Barkov
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.