[MDEV-23465] Implement a collation for identifiers Created: 2020-08-13  Updated: 2023-10-25

Status: Open
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: None

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Blocks
is blocked by MDEV-27009 Add UCA-14.0.0 collations Closed

 Description   

In the code we use two things to compare identifiers:

  • my_strcasecmp(system_charset_info, ...), or just system_charset_info->strcasecmp(..) when used directly without the macro. This comparison way is used when the two strings are known to be 0-terminated.
  • system_charset_info->strnncoll(...). This way is used when we know the lengths of the operands and/or the strings are not necessarily 0-terminated.

There is a problem with this choice:

  • strcasecmp() performs accent sensitive case insensitive comparison
  • strnncoll() performs accent insensitive case insensitive comparison

This mismatch causes troubles (see examples below).

To avoid this mismatch, we need a dedicated collation for identifiers, whose strnncoll() compares in the accent sensitive case insensitive way, to have both strcasecmp() and strnncoll() compare in the same style.

A tentative name is utf8_general_as_ci, or even utf8_identifier_ci.
Under terms of this task, it may not even be necessary to expose this collation to the SQL level. But later, we should revise INFORMATION_SCHEMA to use this collation for certain columns.

The following scripts demonstrate the difference:

  • Column names are accent sensitive (correct)

    CREATE OR REPLACE TABLE t1 (a INT, ä INT);
    

    SHOW CREATE TABLE t1;
    +-------+--------------------------------------------------------------------------------------------------------------------+
    | Table | Create Table                                                                                                       |
    +-------+--------------------------------------------------------------------------------------------------------------------+
    | t1    | CREATE TABLE `t1` (
      `a` int(11) DEFAULT NULL,
      `ä` int(11) DEFAULT NULL
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1  |
    +-------+--------------------------------------------------------------------------------------------------------------------+
    1 row in set (0.002 sec)
    

    Notice, the two columns co-exist.

  • Index names demonstrate both accent sensitive (correct) and accent insensitive (wrong) behavior

    CREATE OR REPLACE TABLE t1 (a INT, KEY a(a), KEY ä(a));
    SHOW WARNINGS;
    

    +-------+------+-------------------------------------------------------------------------------------+
    | Level | Code | Message                                                                             |
    +-------+------+-------------------------------------------------------------------------------------+
    | Note  | 1831 | Duplicate index `ä`. This is deprecated and will be disallowed in a future release  |
    +-------+------+-------------------------------------------------------------------------------------+
    

    Notice, index names were compared in accent insensitive way to display the warning.
    However, on ALTER, they are compared in accent sensitive way as expected:

    ALTER TABLE t1 DROP INDEX ä;
    SHOW CREATE TABLE t1;
    

    +-------+--------------------------------------------------------------------------------------------------------+
    | Table | Create Table                                                                                           |
    +-------+--------------------------------------------------------------------------------------------------------+
    | t1    | CREATE TABLE `t1` (
      `a` int(11) DEFAULT NULL,
      KEY `a` (`a`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
    +-------+--------------------------------------------------------------------------------------------------------+
    

    The second ALTER does not drop anything:

    ALTER TABLE t1 DROP INDEX ä;
    

    ERROR 1091 (42000): Can't DROP INDEX `ä`; check that it exists
    

  • SP variables are accent insensitive (wrong)

    DELIMITER $$
    BEGIN NOT ATOMIC
      DECLARE a INT;
      DECLARE ä INT;
    END;
    $$
    DELIMITER ;
    

    ERROR 1331 (42000): Duplicate variable: ä
    

  • ROW fields are accent sensitive (correct) on explicit declaration but accent insensitive (wrong) on anchored declaration

    DELIMITER $$
    BEGIN NOT ATOMIC
      DECLARE rec ROW(a INT, ä INT) DEFAULT (1,2);
      SELECT rec.a, rec.ä;
    END;
    $$
    DELIMITER ;
    

    +-------+--------+
    | rec.a | rec.ä  |
    +-------+--------+
    |     1 |      2 |
    +-------+--------+
    

    The above is correct.

    CREATE OR REPLACE TABLE t1 (a INT, ä INT);
    INSERT INTO t1 VALUES (1,2);
    DELIMITER $$
    BEGIN NOT ATOMIC
      DECLARE rec ROW TYPE OF t1 DEFAULT (1,2);
      SELECT rec.a, rec.ä;
    END;
    $$
    DELIMITER ;
    

    +-------+--------+
    | rec.a | rec.ä  |
    +-------+--------+
    |     1 |      1 |
    +-------+--------+
    

    The above is wrong. Notice, the expected value for rec.ä should be 2, like in the first example.


Generated at Thu Feb 08 09:22:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.