[MDEV-23465] Implement a collation for identifiers - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Fix Version/s: N/A
Component/s: Character Sets
Labels:
None

Description

In the code we use two things to compare identifiers:

my_strcasecmp(system_charset_info, ...), or just system_charset_info->strcasecmp(..) when used directly without the macro. This comparison way is used when the two strings are known to be 0-terminated.
system_charset_info->strnncoll(...). This way is used when we know the lengths of the operands and/or the strings are not necessarily 0-terminated.

There is a problem with this choice:

strcasecmp() performs accent sensitive case insensitive comparison
strnncoll() performs accent insensitive case insensitive comparison

This mismatch causes troubles (see examples below).

To avoid this mismatch, we need a dedicated collation for identifiers, whose strnncoll() compares in the accent sensitive case insensitive way, to have both strcasecmp() and strnncoll() compare in the same style.

A tentative name is utf8_general_as_ci, or even utf8_identifier_ci.
Under terms of this task, it may not even be necessary to expose this collation to the SQL level. But later, we should revise INFORMATION_SCHEMA to use this collation for certain columns.

The following scripts demonstrate the difference:

Column names are accent sensitive (correct)

CREATE OR REPLACE TABLE t1 (a INT, ä INT);

SHOW CREATE TABLE t1;

+-------+--------------------------------------------------------------------------------------------------------------------+

| Table | Create Table                                                                                                       |

+-------+--------------------------------------------------------------------------------------------------------------------+

| t1    | CREATE TABLE `t1` (

  `a` int(11) DEFAULT NULL,

  `ä` int(11) DEFAULT NULL

) ENGINE=InnoDB DEFAULT CHARSET=latin1  |

+-------+--------------------------------------------------------------------------------------------------------------------+

1 row in set (0.002 sec)

Notice, the two columns co-exist.

Index names demonstrate both accent sensitive (correct) and accent insensitive (wrong) behavior

CREATE OR REPLACE TABLE t1 (a INT, KEY a(a), KEY ä(a));

SHOW WARNINGS;

+-------+------+-------------------------------------------------------------------------------------+

| Level | Code | Message                                                                             |

+-------+------+-------------------------------------------------------------------------------------+

| Note  | 1831 | Duplicate index `ä`. This is deprecated and will be disallowed in a future release  |

+-------+------+-------------------------------------------------------------------------------------+

Notice, index names were compared in accent insensitive way to display the warning.
However, on ALTER, they are compared in accent sensitive way as expected:

ALTER TABLE t1 DROP INDEX ä;

SHOW CREATE TABLE t1;

+-------+--------------------------------------------------------------------------------------------------------+

| Table | Create Table                                                                                           |

+-------+--------------------------------------------------------------------------------------------------------+

| t1    | CREATE TABLE `t1` (

  `a` int(11) DEFAULT NULL,

  KEY `a` (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 |

+-------+--------------------------------------------------------------------------------------------------------+

The second ALTER does not drop anything:

ALTER TABLE t1 DROP INDEX ä;

ERROR 1091 (42000): Can't DROP INDEX `ä`; check that it exists

SP variables are accent insensitive (wrong)

DELIMITER $$

BEGIN NOT ATOMIC

  DECLARE a INT;

  DECLARE ä INT;

END;

$$

DELIMITER ;

ERROR 1331 (42000): Duplicate variable: ä

ROW fields are accent sensitive (correct) on explicit declaration but accent insensitive (wrong) on anchored declaration

DELIMITER $$

BEGIN NOT ATOMIC

  DECLARE rec ROW(a INT, ä INT) DEFAULT (1,2);

  SELECT rec.a, rec.ä;

END;

$$

DELIMITER ;

+-------+--------+

| rec.a | rec.ä  |

+-------+--------+

|     1 |      2 |

+-------+--------+

The above is correct.

CREATE OR REPLACE TABLE t1 (a INT, ä INT);

INSERT INTO t1 VALUES (1,2);

DELIMITER $$

BEGIN NOT ATOMIC

  DECLARE rec ROW TYPE OF t1 DEFAULT (1,2);

  SELECT rec.a, rec.ä;

END;

$$

DELIMITER ;

+-------+--------+

| rec.a | rec.ä  |

+-------+--------+

|     1 |      1 |

+-------+--------+

The above is wrong. Notice, the expected value for rec.ä should be 2, like in the first example.

Attachments

Issue Links

duplicates

MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp()

Closed

is blocked by

MDEV-27009 Add UCA-14.0.0 collations

Closed

relates to

MDEV-31340 Remove MY_COLLATION_HANDLER::strcasecmp()

Closed

MDEV-19123 Change default charset from latin1 to utf8mb4

Closed

Activity

Transition	Time In Source Status	Execution Times

Alexander Barkov made transition - 2024-07-09 06:44

Open

Closed

1426d 4h 42m

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 1 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2020-08-13 02:01

Updated:: 2024-07-09 06:44

Resolved:: 2024-07-09 06:44

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server