[MDEV-24009] alter table ~collate utf8mb4_unicode_ci Created: 2020-10-22  Updated: 2022-06-27  Resolved: 2022-06-27

Status: Closed
Project: MariaDB Server
Component/s: Character Sets, Data Definition - Alter Table
Affects Version/s: 10.3.24, 10.2, 10.3, 10.4, 10.5
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: ssauravy Assignee: Alexander Barkov
Resolution: Won't Fix Votes: 1
Labels: character-set, collation, upstream-wontfix
Environment:

CentOS Linux release 7.7.1908 (Core)



 Description   

We have confirmed that there is a problem with the collation process of utf8mb4_unicode_ci.
In general, we have seen that MariaDB manages the values of empty space('') and char(0) differently.
After that, as a result of performing the character set/collation change work, in utf8mb4_unicode_ci, the above acronyms were duplicated.
It was confirmed that a duplicate entry'' for key'PRIMARY' error occurred.
Please refer to the test results below.

 
MariaDB [test]> status;
--------------
mysql  Ver 15.1 Distrib 10.3.24-MariaDB, for Linux (x86_64) using readline 5.1
 
Connection id:		766474
Current database:	test
Current user:		root@localhost
SSL:			Not in use
Current pager:		stdout
Using outfile:		''
Using delimiter:	;
Server:			MariaDB
Server version:		10.3.24-MariaDB-log MariaDB Server
Protocol version:	10
Connection:		Localhost via UNIX socket
Server characterset:	utf8mb4
Db     characterset:	utf8mb4
Client characterset:	utf8mb4
Conn.  characterset:	utf8mb4
...
 
MariaDB [test]> create table t1( c1 varchar(100) character set utf8 not null, nm varchar(100) , primary key(c1)) engine=innodb character set utf8;
Query OK, 0 rows affected (0.176 sec)
 
MariaDB [test]> insert into t1 values('','');
Query OK, 1 row affected (0.074 sec)
 
MariaDB [test]> insert into t1 values(char(0),'');
Query OK, 1 row affected (0.088 sec)
 
MariaDB [test]> select * from t1 where c1='';
+----+------+
| c1 | nm   |
+----+------+
|    |      |
+----+------+
1 row in set (0.000 sec)
 
MariaDB [test]> select * from t1 where c1=char(0);
+----+------+
| c1 | nm   |
+----+------+
|    |      |
+----+------+
1 row in set (0.000 sec)
 
MariaDB [test]> alter table t1 character set utf8mb4;
Query OK, 0 rows affected (0.065 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
MariaDB [test]> 
MariaDB [test]> alter table t1 modify c1 varchar(100) character set utf8mb4 collate utf8mb4_unicode_ci not null, modify nm varchar(100) character set utf8mb4 collate utf8mb4_unicode_ci ;
ERROR 1062 (23000): Duplicate entry '' for key 'PRIMARY' <------------------------
 
MariaDB [test]> alter table t1 modify c1 varchar(100) character set utf8mb4 collate utf8mb4_general_ci not null, modify nm varchar(100) character set utf8mb4 collate utf8mb4_general_ci ;
Query OK, 2 rows affected (0.476 sec)              
Records: 2  Duplicates: 0  Warnings: 0
 
MariaDB [test]> O.M.G.



 Comments   
Comment by Alice Sherepa [ 2020-10-22 ]

the same behavior on 5.5-10.5

MariaDB [test]> select char(0)=("" collate utf8mb4_unicode_ci);
+-----------------------------------------+
| char(0)=("" collate utf8mb4_unicode_ci) |
+-----------------------------------------+
|                                       1 |
+-----------------------------------------+
1 row in set (0.000 sec)
 
MariaDB [test]> select char(0)=("" collate utf8mb4_general_ci);
+-----------------------------------------+
| char(0)=("" collate utf8mb4_general_ci) |
+-----------------------------------------+
|                                       0 |
+-----------------------------------------+
1 row in set (0.000 sec)

Comment by ssauravy [ 2020-10-28 ]

From the current mysql point of view, it has been confirmed as a bug of utf8mb4_general_ci.
https://bugs.mysql.com/?id=101311
In MariaDB, utf8mb4_general_ci is currently the default sorted set.
As an alternative, what sort set should we use?
utf8mb4_unicode_520_ci?

Comment by Alexander Barkov [ 2021-01-03 ]

CHAR(0) also known as U+0000 is an ignorable character in utf8mb4_unicode_ci. It does not have any weight.
This makes '' equal to CHAR(0). This is by design. There are no bug here.

utf8_general_ci is a simpler collation, with one-to-one mapping between characters and weights. It does not support ignorable characters, so it treats CHAR(0) differently.

Comment by ssauravy [ 2021-01-03 ]

What I want to point out is the part about PK.
If'' is intentionally designed so that CHAR(0) is equal
Due to this encoding conversion between different charsets dealing with PK,
Serious errors and performance problems in Unique/Primary key
Happens.

This fact is a PK consisting of utf8mb4_unicode_ci
Can't change to other character set such as utf8mb4_general_ci?

Comment by Alexander Barkov [ 2021-01-03 ]

Every collation has its own rules for sorting and uniqueness. In this example, utf8mb4_general_ci and utf8mb4_unicode_ci have different rules..
Unique violation is surely possible if you change a column collation and you have a PRIMARY/UNIQUE key on this column.
This is the purpose of collations: to have different sorting and uniqueness rules.

utf8mb4_unicode_ci follows the Unicode Collation Algorithm and supports so called ignorable characters (or ignorables). CHAR(0) is one of those ignorable characters. See unicode.org/reports/tr10/#Ignorables_Defn for details.

Yes, it is intentionally designed so that '' and CHAR(0) are equal for utf8mb4_unicode_ci, because:

  • CHAR(0) is an ignorable character
  • any arbitrary string 'xxx' is equal to concatenation of this string and an ignorable character
  • in this example, empty string '' is equal to CONCAT('', CHAR(0)).

To avoid unique violations you can do either of the following:

  • Choose a proper collation from the beginning, during CREATE TABLE
  • Or remove duplicates before altering the collation.

To find potential duplicates, you can do something like this:

SELECT GROUP_CONCAT(c1), COUNT(*) FROM t1
  GROUP BY c1 COLLATE utf8mb4_unicode_ci
  HAVING COUNT(*)>1;

Comment by ssauravy [ 2021-01-04 ]

Checking for duplicates between different rows
SELECT GROUP_CONCAT(c1), COUNT FROM t1
GROUP BY c1 COLLATE utf8mb4_unicode_ci
HAVING COUNT>1;
Is possible with

The part of finding a character already stored in a column of the same row is annoying.

Comment by Alexander Barkov [ 2022-06-27 ]

There are no bugs here. Closing as won't fix.

Generated at Thu Feb 08 09:26:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.