[MDEV-24009] alter table ~collate utf8mb4_unicode_ci Created: 2020-10-22 Updated: 2022-06-27 Resolved: 2022-06-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets, Data Definition - Alter Table |
| Affects Version/s: | 10.3.24, 10.2, 10.3, 10.4, 10.5 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | ssauravy | Assignee: | Alexander Barkov |
| Resolution: | Won't Fix | Votes: | 1 |
| Labels: | character-set, collation, upstream-wontfix | ||
| Environment: |
CentOS Linux release 7.7.1908 (Core) |
||
| Description |
|
We have confirmed that there is a problem with the collation process of utf8mb4_unicode_ci.
|
| Comments |
| Comment by Alice Sherepa [ 2020-10-22 ] | |||||||||||||||
|
the same behavior on 5.5-10.5
| |||||||||||||||
| Comment by ssauravy [ 2020-10-28 ] | |||||||||||||||
|
From the current mysql point of view, it has been confirmed as a bug of utf8mb4_general_ci. | |||||||||||||||
| Comment by Alexander Barkov [ 2021-01-03 ] | |||||||||||||||
|
CHAR(0) also known as U+0000 is an ignorable character in utf8mb4_unicode_ci. It does not have any weight. utf8_general_ci is a simpler collation, with one-to-one mapping between characters and weights. It does not support ignorable characters, so it treats CHAR(0) differently. | |||||||||||||||
| Comment by ssauravy [ 2021-01-03 ] | |||||||||||||||
|
What I want to point out is the part about PK. This fact is a PK consisting of utf8mb4_unicode_ci | |||||||||||||||
| Comment by Alexander Barkov [ 2021-01-03 ] | |||||||||||||||
|
Every collation has its own rules for sorting and uniqueness. In this example, utf8mb4_general_ci and utf8mb4_unicode_ci have different rules.. utf8mb4_unicode_ci follows the Unicode Collation Algorithm and supports so called ignorable characters (or ignorables). CHAR(0) is one of those ignorable characters. See unicode.org/reports/tr10/#Ignorables_Defn for details. Yes, it is intentionally designed so that '' and CHAR(0) are equal for utf8mb4_unicode_ci, because:
To avoid unique violations you can do either of the following:
To find potential duplicates, you can do something like this:
| |||||||||||||||
| Comment by ssauravy [ 2021-01-04 ] | |||||||||||||||
|
Checking for duplicates between different rows The part of finding a character already stored in a column of the same row is annoying. | |||||||||||||||
| Comment by Alexander Barkov [ 2022-06-27 ] | |||||||||||||||
|
There are no bugs here. Closing as won't fix. |