[MDEV-15182] Incorrect compare latin extended unicode symbol with charset utf8mb4 Created: 2018-02-02  Updated: 2018-02-05  Resolved: 2018-02-05

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.1.30
Fix Version/s: N/A

Type: Bug Priority: Minor
Reporter: Stas Assignee: Alexander Barkov
Resolution: Not a Bug Votes: 0
Labels: innodb
Environment:

Linux centos 7 , windows 7



 Description   

Hello
In the table, these words are the same AKKUŞSAN = AKKUSSAN

CREATE TABLE t.table1 (
  t varchar(50) NOT NULL,
  UNIQUE INDEX UK_table1_t (t)
)
ENGINE = INNODB
CHARACTER SET utf8mb4
COLLATE utf8mb4_general_ci;

execute

INSERT INTO table1 (t) VALUES ('AKKUŞSAN');
INSERT INTO table1 (t) VALUES ('AKKUSSAN');

we get error
Duplicate entry 'AKKUSSAN' for key 'UK_table1_t'

But symbol S != Ş , it is different strings



 Comments   
Comment by Elena Stepanova [ 2018-02-05 ]

I think it's one of multiple mapping/matching/comparison rules for UTF8, I'll leave it to bar to point at the precise rule (or to say there isn't one).

MySQL [test]> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)
 
MySQL [test]> select 'AKKUŞSAN' = 'AKKUSSAN';
+--------------------------+
| 'AKKUŞSAN' = 'AKKUSSAN'  |
+--------------------------+
|                        1 |
+--------------------------+
1 row in set (0.00 sec)

Comment by Alexander Barkov [ 2018-02-05 ]

utf8mb4_general_ci is accent insensitive, so this is the expected behaviour.

Please use utf8mb4_thai_520_w2 if you need accent sensitive comparison.

This script demonstrates that Ş != S when using utf8mb4_thai_520_w2:

CREATE OR REPLACE TABLE t1 (
  t varchar(50) NOT NULL,
  UNIQUE INDEX UK_table1_t (t)
) CHARACTER SET utf8mb4 COLLATE utf8mb4_thai_520_w2;
INSERT INTO t1 (t) VALUES ('AKKUŞSAN');
INSERT INTO t1 (t) VALUES ('AKKUSSAN');
SELECT COUNT(DISTINCT t) FROM t1;

+-------------------+
| COUNT(DISTINCT t) |
+-------------------+
|                 2 |
+-------------------+

Comment by Alexander Barkov [ 2018-02-05 ]

Another option is to use utf8mb4_turkish_ci.

It implements the following rules:

C < ç = Ç
G < ğ = Ğ
H < ı = I
O < ö = Ö
S < ş = Ş
U < ü = Ü

It's accent sensitive only for these specified letters, and accent insensitive for all other letters.

Should work for Turkish (and most likely for Azerbaijani)

Generated at Thu Feb 08 08:19:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.