Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-15182

Incorrect compare latin extended unicode symbol with charset utf8mb4

Details

    • Bug
    • Status: Closed (View Workflow)
    • Minor
    • Resolution: Not a Bug
    • 10.1.30
    • N/A
    • Character Sets
    • Linux centos 7 , windows 7

    Description

      Hello
      In the table, these words are the same AKKUŞSAN = AKKUSSAN

      CREATE TABLE t.table1 (
        t varchar(50) NOT NULL,
        UNIQUE INDEX UK_table1_t (t)
      )
      ENGINE = INNODB
      CHARACTER SET utf8mb4
      COLLATE utf8mb4_general_ci;
      

      execute

      INSERT INTO table1 (t) VALUES ('AKKUŞSAN');
      INSERT INTO table1 (t) VALUES ('AKKUSSAN');
      

      we get error
      Duplicate entry 'AKKUSSAN' for key 'UK_table1_t'

      But symbol S != Ş , it is different strings

      Attachments

        Activity

          I think it's one of multiple mapping/matching/comparison rules for UTF8, I'll leave it to bar to point at the precise rule (or to say there isn't one).

          MySQL [test]> set names utf8mb4;
          Query OK, 0 rows affected (0.00 sec)
           
          MySQL [test]> select 'AKKUŞSAN' = 'AKKUSSAN';
          +--------------------------+
          | 'AKKUŞSAN' = 'AKKUSSAN'  |
          +--------------------------+
          |                        1 |
          +--------------------------+
          1 row in set (0.00 sec)
          

          elenst Elena Stepanova added a comment - I think it's one of multiple mapping/matching/comparison rules for UTF8, I'll leave it to bar to point at the precise rule (or to say there isn't one). MySQL [test]> set names utf8mb4; Query OK, 0 rows affected (0.00 sec)   MySQL [test]> select 'AKKUŞSAN' = 'AKKUSSAN' ; + --------------------------+ | 'AKKUŞSAN' = 'AKKUSSAN' | + --------------------------+ | 1 | + --------------------------+ 1 row in set (0.00 sec)
          bar Alexander Barkov added a comment - - edited

          utf8mb4_general_ci is accent insensitive, so this is the expected behaviour.

          Please use utf8mb4_thai_520_w2 if you need accent sensitive comparison.

          This script demonstrates that Ş != S when using utf8mb4_thai_520_w2:

          CREATE OR REPLACE TABLE t1 (
            t varchar(50) NOT NULL,
            UNIQUE INDEX UK_table1_t (t)
          ) CHARACTER SET utf8mb4 COLLATE utf8mb4_thai_520_w2;
          INSERT INTO t1 (t) VALUES ('AKKUŞSAN');
          INSERT INTO t1 (t) VALUES ('AKKUSSAN');
          SELECT COUNT(DISTINCT t) FROM t1;
          

          +-------------------+
          | COUNT(DISTINCT t) |
          +-------------------+
          |                 2 |
          +-------------------+
          

          bar Alexander Barkov added a comment - - edited utf8mb4_general_ci is accent insensitive, so this is the expected behaviour. Please use utf8mb4_thai_520_w2 if you need accent sensitive comparison. This script demonstrates that Ş != S when using utf8mb4_thai_520_w2 : CREATE OR REPLACE TABLE t1 ( t varchar (50) NOT NULL , UNIQUE INDEX UK_table1_t (t) ) CHARACTER SET utf8mb4 COLLATE utf8mb4_thai_520_w2; INSERT INTO t1 (t) VALUES ( 'AKKUŞSAN' ); INSERT INTO t1 (t) VALUES ( 'AKKUSSAN' ); SELECT COUNT ( DISTINCT t) FROM t1; +-------------------+ | COUNT(DISTINCT t) | +-------------------+ | 2 | +-------------------+

          Another option is to use utf8mb4_turkish_ci.

          It implements the following rules:

          C < ç = Ç
          G < ğ = Ğ
          H < ı = I
          O < ö = Ö
          S < ş = Ş
          U < ü = Ü
          

          It's accent sensitive only for these specified letters, and accent insensitive for all other letters.

          Should work for Turkish (and most likely for Azerbaijani)

          bar Alexander Barkov added a comment - Another option is to use utf8mb4_turkish_ci . It implements the following rules: C < ç = Ç G < ğ = Ğ H < ı = I O < ö = Ö S < ş = Ş U < ü = Ü It's accent sensitive only for these specified letters, and accent insensitive for all other letters. Should work for Turkish (and most likely for Azerbaijani)

          People

            bar Alexander Barkov
            workism@yandex.ru Stas
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.