Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27653

long uniques don't work with unicode collations

Details

    Description

      set names utf8;
      create table t1 (a text collate utf8_general_ci unique);
      show create table t1;
      insert t1 values ('a');
      insert t1 values ('ä');
      select * from t1;
      select distinct * from t1;
      select 'a' = 'ä';
      drop table t1;
      

      shows that one can insert two characters into a long unique index that compare equally.

      It happens because calc_hash_for_unique() hashes string length in bytes together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.

      The effect of this bug is that if one has a table with
      CREATE TABLE
      ...
      xxx VARCHAR(255) COLLATE 'utf8mb3_general_ci',
      ...
      UNIQUE INDEX `xxx` (column,...) USING HASH...

      Then the following problems may occur:

      • The table will accept duplicates for UTF8 strings that are different but should compare equal.
      • If master and slave are of different versions where the bug is fixed on one of them or the hash value is different (see MDEV-32093) then the following can also happen:
      • Insert works on master (which does not notice the duplicate key) but fails on slave (which notices the duplicate key)

      Fix is to drop and add back the unique index or run ALTER TABLE xxx ENGINE=InnoDB (for InnoDB tables) to reconstruct the hash index.

      Attachments

        Issue Links

          Activity

            serg Sergei Golubchik created issue -
            serg Sergei Golubchik made changes -
            Field Original Value New Value
            Affects Version/s 10.4 [ 22408 ]
            Affects Version/s 10.5 [ 23123 ]
            Affects Version/s 10.6 [ 24028 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            serg Sergei Golubchik made changes -
            Assignee Nikita Malyavin [ nikitamalyavin ]
            serg Sergei Golubchik made changes -
            Description {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert to characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.
            {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.
            rjasdfiii Rick James added a comment - - edited

            Neither `LENGTH` nor `CHAR_LENGTH` is valid. Consider these three characters; they are all "equal"

            > SELECT 'ä'='a', 'ä'='a';
            +----------+-----------+
            | 'ä'='a'  | 'ä'='a'  |
            +----------+-----------+
            |        1 |         1 |
            +----------+-----------+
            1 row in set (0.01 sec)
            

            They are

            61           L  LATIN SMALL LETTER A
            C3A4       L  LATIN SMALL LETTER A WITH DIAERESIS
            61 plus CC88        NSM  COMBINING DIAERESIS -- 3 bytes, 2 "characters"
            

            It may be wise to pull the plug on `UNIQUE` for `TEXT` until this tough problem can be thought through thoroughly.
            (BLOB should not have this problem.)

            rjasdfiii Rick James added a comment - - edited Neither `LENGTH` nor `CHAR_LENGTH` is valid. Consider these three characters; they are all "equal" > SELECT 'ä'='a', 'ä'='a'; +----------+-----------+ | 'ä'='a' | 'ä'='a' | +----------+-----------+ | 1 | 1 | +----------+-----------+ 1 row in set (0.01 sec) They are 61 L LATIN SMALL LETTER A C3A4 L LATIN SMALL LETTER A WITH DIAERESIS 61 plus CC88 NSM COMBINING DIAERESIS -- 3 bytes, 2 "characters" It may be wise to pull the plug on `UNIQUE` for `TEXT` until this tough problem can be thought through thoroughly. (BLOB should not have this problem.)
            danblack Daniel Black made changes -

            For the record, I got this patch that should supposedly disable MDEV-371:

            diff --git a/sql/sql_table.cc b/sql/sql_table.cc
            index 6e8a4795f21..f8f3eefc114 100644
            --- a/sql/sql_table.cc
            +++ b/sql/sql_table.cc
            @@ -2441,6 +2441,8 @@ static inline void make_long_hash_field_name(LEX_CSTRING *buf, uint num)
             static Create_field * add_hash_field(THD * thd, List<Create_field> *create_list,
                                                   KEY *key_info)
             {
            +  my_error(ER_TOO_LONG_KEY, MYF(0), 1000);
            +  return nullptr;
               List_iterator<Create_field> it(*create_list);
               Create_field *dup_field, *cf= new (thd->mem_root) Create_field();
               cf->flags|= UNSIGNED_FLAG | LONG_UNIQUE_HASH_FIELD;
            

            I agree that it could be a good idea to simply disable MDEV-371 for the affected collations. A proper fix could be implemented later.

            In addition to what rjasdfiii mentioned, there are also things like this:

            SET NAMES utf8;
            CREATE TABLE t(a TEXT UNIQUE) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;
            --error ER_DUP_ENTRY
            INSERT INTO t VALUES('ⅱ'),('II');
            --error ER_DUP_ENTRY
            INSERT INTO t VALUES('⑽'),('(10)');
            

            Contrary to what I expected, in this example the duplicates are identified correctly.

            marko Marko Mäkelä added a comment - For the record, I got this patch that should supposedly disable MDEV-371 : diff --git a/sql/sql_table.cc b/sql/sql_table.cc index 6e8a4795f21..f8f3eefc114 100644 --- a/sql/sql_table.cc +++ b/sql/sql_table.cc @@ -2441,6 +2441,8 @@ static inline void make_long_hash_field_name(LEX_CSTRING *buf, uint num) static Create_field * add_hash_field(THD * thd, List<Create_field> *create_list, KEY *key_info) { + my_error(ER_TOO_LONG_KEY, MYF(0), 1000); + return nullptr; List_iterator<Create_field> it(*create_list); Create_field *dup_field, *cf= new (thd->mem_root) Create_field(); cf->flags|= UNSIGNED_FLAG | LONG_UNIQUE_HASH_FIELD; I agree that it could be a good idea to simply disable MDEV-371 for the affected collations. A proper fix could be implemented later. In addition to what rjasdfiii mentioned, there are also things like this: SET NAMES utf8; CREATE TABLE t(a TEXT UNIQUE ) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci; --error ER_DUP_ENTRY INSERT INTO t VALUES ( 'ⅱ' ),( 'II' ); --error ER_DUP_ENTRY INSERT INTO t VALUES ( '⑽' ),( '(10)' ); Contrary to what I expected, in this example the duplicates are identified correctly.
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            serg Sergei Golubchik made changes -
            Assignee Nikita Malyavin [ nikitamalyavin ] Alexander Barkov [ bar ]
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov added a comment - - edited

            A similar problem is repeatable with this script:

            CREATE OR REPLACE TABLE t1 (a timestamp, UNIQUE KEY(a) USING HASH) ENGINE=MyISAM;
            SET time_zone='+00:00';
            INSERT INTO t1 VALUES ('2001-01-01 10:20:30');
            SET time_zone='+01:00';
            INSERT INTO t1 SELECT MAX(a) FROM t1;
            SELECT * FROM t1;
            

            +---------------------+
            | a                   |
            +---------------------+
            | 2001-01-01 11:20:30 |
            | 2001-01-01 11:20:30 |
            +---------------------+
            

            Notice, the table contains two equal values, which is wrong.

            bar Alexander Barkov added a comment - - edited A similar problem is repeatable with this script: CREATE OR REPLACE TABLE t1 (a timestamp , UNIQUE KEY (a) USING HASH) ENGINE=MyISAM; SET time_zone= '+00:00' ; INSERT INTO t1 VALUES ( '2001-01-01 10:20:30' ); SET time_zone= '+01:00' ; INSERT INTO t1 SELECT MAX (a) FROM t1; SELECT * FROM t1; +---------------------+ | a | +---------------------+ | 2001-01-01 11:20:30 | | 2001-01-01 11:20:30 | +---------------------+ Notice, the table contains two equal values, which is wrong.

            One more problem:

            UNIQUE HASH keys on TIMESTAMP columns do not work in combination with CHECK TABLE if time_zone changes:

            CREATE OR REPLACE TABLE t1 (a timestamp, UNIQUE KEY(a) USING HASH) ENGINE=MyISAM;
            SET time_zone='+00:00';
            INSERT INTO t1 VALUES ('2001-01-01 10:20:30');
            SET time_zone='+01:00';
            CHECK TABLE t1;
            

            +---------+-------+----------+---------------------------------------------------------+
            | Table   | Op    | Msg_type | Msg_text                                                |
            +---------+-------+----------+---------------------------------------------------------+
            | test.t1 | check | error    | Checksum for key:  1 doesn't match checksum for records |
            | test.t1 | check | error    | Corrupt                                                 |
            +---------+-------+----------+---------------------------------------------------------+
            

            bar Alexander Barkov added a comment - One more problem: UNIQUE HASH keys on TIMESTAMP columns do not work in combination with CHECK TABLE if time_zone changes: CREATE OR REPLACE TABLE t1 (a timestamp , UNIQUE KEY (a) USING HASH) ENGINE=MyISAM; SET time_zone= '+00:00' ; INSERT INTO t1 VALUES ( '2001-01-01 10:20:30' ); SET time_zone= '+01:00' ; CHECK TABLE t1; +---------+-------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +---------+-------+----------+---------------------------------------------------------+ | test.t1 | check | error | Checksum for key: 1 doesn't match checksum for records | | test.t1 | check | error | Corrupt | +---------+-------+----------+---------------------------------------------------------+
            rjasdfiii Rick James added a comment -

            Alexander – Can you (or someone else) explain what is going on to even catch the time_zone difference? Is TZ stored in the table definition? Is the timestamp "converted" at the wrong phase of the flow of the data?

            BTW, I don't get the error when testing with MySQL 8.0.31. Is there an MDEV to where MariaDB diverged from Oracle in TIMESTAMPs?
            Perhaps:

            • MDEV-21249 MariaDB 10.3.10 When referring to bigint to generate timestamp data in the virtual generated column, the value of the generated column does not change when the time zone changes
            • MDEV-16246 – even though this relates to Spider.
            • MariaDB-10.4 New datatype INTERVAL
            • MDEV-16939 Move TIMESTAMP truncation code to Field_timestamp::store_TIME_with_warn
            • 10.1 Galera: --mysql56-temporal-format option to use the MySQL-5.6 low level formats to store TIME, DATETIME and TIMESTAMP types. (MDEV-5528)
            • The test you gave involved MyISAM; what about other engines?
            rjasdfiii Rick James added a comment - Alexander – Can you (or someone else) explain what is going on to even catch the time_zone difference? Is TZ stored in the table definition? Is the timestamp "converted" at the wrong phase of the flow of the data? BTW, I don't get the error when testing with MySQL 8.0.31. Is there an MDEV to where MariaDB diverged from Oracle in TIMESTAMPs? Perhaps: MDEV-21249 MariaDB 10.3.10 When referring to bigint to generate timestamp data in the virtual generated column, the value of the generated column does not change when the time zone changes MDEV-16246 – even though this relates to Spider. MariaDB-10.4 New datatype INTERVAL MDEV-16939 Move TIMESTAMP truncation code to Field_timestamp::store_TIME_with_warn 10.1 Galera: --mysql56-temporal-format option to use the MySQL-5.6 low level formats to store TIME, DATETIME and TIMESTAMP types. ( MDEV-5528 ) The test you gave involved MyISAM; what about other engines?
            serg Sergei Golubchik added a comment - - edited

            TIMESTAMP is stored as a number of seconds from epoch, in UTC. So it's a time zone independent.

            Time zone comes into effect when a timestamp is converted from a number to a string YYYY-MM-DD hh:mm:ss representation. At that point it's converted from UTC to the current time zone.

            The bug is that long unique was using the string representation of a timestamp, not its native numeric representation.

            In fact, it was using string representation for all columns and data types, but only for timestamp it has actually caused a bug, for other types it was just an unnecessary conversion.

            serg Sergei Golubchik added a comment - - edited TIMESTAMP is stored as a number of seconds from epoch, in UTC. So it's a time zone independent. Time zone comes into effect when a timestamp is converted from a number to a string YYYY-MM-DD hh:mm:ss representation. At that point it's converted from UTC to the current time zone. The bug is that long unique was using the string representation of a timestamp, not its native numeric representation. In fact, it was using string representation for all columns and data types, but only for timestamp it has actually caused a bug, for other types it was just an unnecessary conversion.
            bar Alexander Barkov made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            bar Alexander Barkov added a comment - Hello serg , Can you please review a patch: https://github.com/MariaDB/server/commit/5dcde8f652381773b3e337e63254f402e03a1081 ? Thanks.
            bar Alexander Barkov made changes -
            Status In Progress [ 3 ] In Review [ 10002 ]
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Sergei Golubchik [ serg ]
            serg Sergei Golubchik made changes -
            bar Alexander Barkov made changes -
            bar Alexander Barkov made changes -
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Alexander Barkov [ bar ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            Hello serg,

            Please have a look into a new patch version:

            https://github.com/MariaDB/server/commit/06ce67c644b92a600b842744363e1d0acd8ce7d6

            Thanks.

            bar Alexander Barkov added a comment - Hello serg , Please have a look into a new patch version: https://github.com/MariaDB/server/commit/06ce67c644b92a600b842744363e1d0acd8ce7d6 Thanks.
            bar Alexander Barkov made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            bar Alexander Barkov made changes -
            Assignee Alexander Barkov [ bar ] Sergei Golubchik [ serg ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Alexander Barkov [ bar ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            bar Alexander Barkov made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]
            serg Sergei Golubchik added a comment - https://github.com/MariaDB/server/commit/38d6d0dc6651c11f39344b0fe7b0cece85e5e5c8 is ok to push
            bar Alexander Barkov made changes -
            Component/s Data types [ 13906 ]
            Fix Version/s 10.4.28 [ 28509 ]
            Fix Version/s 10.5.19 [ 28511 ]
            Fix Version/s 10.6.12 [ 28513 ]
            Fix Version/s 10.7.8 [ 28515 ]
            Fix Version/s 10.8.7 [ 28517 ]
            Fix Version/s 10.9.5 [ 28519 ]
            Fix Version/s 10.10.3 [ 28521 ]
            Fix Version/s 10.11.2 [ 28523 ]
            Fix Version/s 10.4 [ 22408 ]
            Fix Version/s 10.5 [ 23123 ]
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.9 [ 26905 ]
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            marko Marko Mäkelä made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            monty Michael Widenius made changes -
            Description {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.
            {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.

            The effect of this bug is that if one has a table with
            CREATE TABLE
             ...
            xxx VARCHAR(255) COLLATE 'utf8mb3_general_ci',
            ...
            UNIQUE INDEX `xxx` (column,...) USING HASH...

            Then the following problems may occur:
            - The table will accept duplicates even if the value should be compared as
            monty Michael Widenius made changes -
            Description {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.

            The effect of this bug is that if one has a table with
            CREATE TABLE
             ...
            xxx VARCHAR(255) COLLATE 'utf8mb3_general_ci',
            ...
            UNIQUE INDEX `xxx` (column,...) USING HASH...

            Then the following problems may occur:
            - The table will accept duplicates even if the value should be compared as
            {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.

            The effect of this bug is that if one has a table with
            CREATE TABLE
             ...
            xxx VARCHAR(255) COLLATE 'utf8mb3_general_ci',
            ...
            UNIQUE INDEX `xxx` (column,...) USING HASH...

            Then the following problems may occur:
            - The table will accept duplicates for UTF8 strings that are different but should compare equal.

            monty Michael Widenius made changes -
            Description {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.

            The effect of this bug is that if one has a table with
            CREATE TABLE
             ...
            xxx VARCHAR(255) COLLATE 'utf8mb3_general_ci',
            ...
            UNIQUE INDEX `xxx` (column,...) USING HASH...

            Then the following problems may occur:
            - The table will accept duplicates for UTF8 strings that are different but should compare equal.

            {code:sql}
            set names utf8;
            create table t1 (a text collate utf8_general_ci unique);
            show create table t1;
            insert t1 values ('a');
            insert t1 values ('ä');
            select * from t1;
            select distinct * from t1;
            select 'a' = 'ä';
            drop table t1;
            {code}
            shows that one can insert two characters into a long unique index that compare equally.

            It happens because {{calc_hash_for_unique()}} hashes string *length in bytes* together with the string content. So even if two utf8 strings hash to the same value, they can have different length in bytes and long unique hash will be different.

            The effect of this bug is that if one has a table with
            CREATE TABLE
             ...
            xxx VARCHAR(255) COLLATE 'utf8mb3_general_ci',
            ...
            UNIQUE INDEX `xxx` (column,...) USING HASH...

            Then the following problems may occur:
            - The table will accept duplicates for UTF8 strings that are different but should compare equal.
            - If master and slave are of different versions where the bug is fixed on one of them or the hash value is different (see MDEV-32093) then the following can also happen:
              - Insert works on master (which does not notice the duplicate key) but fails on slave (which notices the duplicate key)

            Fix is to drop and add back the unique index or run ALTER TABLE xxx ENGINE=InnoDB (for InnoDB tables) to reconstruct the hash index.

            rjasdfiii Rick James added a comment - https://dba.stackexchange.com/questions/335247/unique-key-on-varchar-is-ignored-for-maria-db complains about not using a UNIQUE index.

            it's a complain that the optimizer doesn't use an index to optimize the query. This is correct, optimizer doesn't, this is not implemented (yet). But the UNIQUE isn't ignored, the uniqueness constraint holds, it applies on inserts and updates. But the optimizer cannot use it for SELECT.

            serg Sergei Golubchik added a comment - it's a complain that the optimizer doesn't use an index to optimize the query. This is correct, optimizer doesn't, this is not implemented (yet). But the UNIQUE isn't ignored , the uniqueness constraint holds, it applies on inserts and updates. But the optimizer cannot use it for SELECT.
            rjasdfiii Rick James added a comment -

            A possible test case: https://dba.stackexchange.com/questions/339135/how-can-i-get-duplicate-entry-error-on-a-table-that-only-has-one-record This has a composite index with 2 VARCHAR(400) columns.

            rjasdfiii Rick James added a comment - A possible test case: https://dba.stackexchange.com/questions/339135/how-can-i-get-duplicate-entry-error-on-a-table-that-only-has-one-record This has a composite index with 2 VARCHAR(400) columns.

            it's a different bug, I've reported it as MDEV-34052

            serg Sergei Golubchik added a comment - it's a different bug, I've reported it as MDEV-34052
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 144370 136244
            Johnny See Zhongyu Shi made changes -

            People

              bar Alexander Barkov
              serg Sergei Golubchik
              Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.