Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25829

Change default Unicode collation to uca1400_ai_ci

Details

    Description

      Under terms of this task will change the default collation for Unicode character sets (utf8mb3, utf8mb4, ucs2, utf16, utf32) from xxx_general_ci to uca1400_ai_ci.

      We won't change the server default character set from latin1 to utf8mb4 yet. It will be done separately under terms of MDEV-19123.

      Therefore, we narrowed the original scope of this MDEV, which was about both:

      1. Changing the default collation for Unicode character sets from xxx_general_ci to a modern Unicode collation, with proper support for SMP characters (including emoji)
      2. Changing the server default character set from latin1 to utf8mb4

      to just #1.

      Under terms of this task we won't change the collation of INFORMATION_SCHEMA columns. They'll remain utf8mb3_general_ci. They'll probably also change eventually, but not now. Just one big change at a time!

      See also the section "The original task description from the reporter" below.

      Implementation details

      There were two different implementation ways to change the default collation:

      1. Change the default of the server system variable @@character_set_collations, recently added to 11.2 by MDEV-30164, to:

         "utf8mb3=uca1400_ai_ci,utf8mb4=uca1400_ai_ci,ucs2=uca1400_ai_ci,utf16=uca1400_ai_ci,utf32=uca1400_ai_ci";
        

      2. Change the hard-coded defaults inside the collation library (which resides in the /strings sub-directory of MariaDB sources)

      We'll go #1.

      The hard-coded defaults inside the collation library will remain xxx_general_ci for all Unicode character sets, for the following reasons:

      1. It's easier to do this way, because uca1400_ai_ci needs some initialization at the server startup, while xxx_general_ci works just out of the box. At the very early startup stage (before the collation library initialization), the server needs a functional CHARSET_INFO pointer to handle utf8 data.
      2. Hard-coding uca1400_ai_ci won't be good for all users anyway. Some users might prefer uca1400_as_ci (for accent sensitive comparison), or prefer xxx_general_cl for backward compatibility.

      Changing collation derivation for CAST(expr AS CHAR), CONVERT(expr USING cs), user variables

      Changing the default collation will cause "Illegal mix of collations" errors in two important scenarios:

      • Comparing a column with a non-default collation to a result of CAST/CONVERT
      • Comparing a column with a non-default collation to a string user variable

      This is especially important for queries to INFORMATION_SCHEMA tables, whose columns use utf8mb3_general_ci:

      SET character_set_collations='utf8mb3=uca1400_ai_ci' /*Emulate MDEV-25829 */;
      SET NAMES utf8mb3;
      SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME=CONCAT('a',CAST(1 AS CHAR CHARACTER SET utf8));
      

      ERROR 1267 (HY000): Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (utf8mb3_uca1400_ai_ci,IMPLICIT) for operation '='
      

      SET character_set_collations='utf8mb3=uca1400_ai_ci' /*Emulate MDEV-25829 */;
      SET NAMES utf8mb3;
      SET @a='test';
      SELECT * FROM INFORMATION_SCHEMA.COLUMNS WHERE COLUMN_NAME=@a;
      

      ERROR 1267 (HY000): Illegal mix of collations (utf8mb3_general_ci,IMPLICIT) and (utf8mb3_uca1400_ai_ci,IMPLICIT) for operation '='
      

      To avoid these conflicts let's change collation derivations as follows:

      • for CAST/CONVERT: from IMPLICIT to a new derivation level CAST between SYSCONST and COERCIBLE
      • for string user variables: from IMPLICIT to COERCIBLE

      So after the change, derivations for various expressions look as follows:

          
            DERIVATION_IGNORABLE= 7, // Explicit NULL
          
            DERIVATION_NUMERIC= 6,   // Numbers in string context,
                                     // Numeric user variables
                                     // CAST(numeric_expr AS CHAR)
          
            DERIVATION_COERCIBLE= 5, // Literals, string user variables
          
            DERIVATION_CAST= 4,      // CAST(string_expr AS CHAR),
                                     // CONVERT(string_expr USING cs)
          
            DERIVATION_SYSCONST= 3,  // utf8 metadata functions, e.g. user(), database()
            DERIVATION_IMPLICIT= 2,  // Table columns, SP variables
            DERIVATION_NONE= 1,      // A mix (e.g. CONCAT) of two differrent collations
            DERIVATION_EXPLICIT= 0   // An explicit COLLATE clause
      

      The original task description from the reporter

      Since MySQL 8.0, they changed the default collation to utf8mb4_0900_ai_ci:

      MySQL includes character set support that enables you to store data using a variety of character sets and perform comparisons according to a variety of collations. The default MySQL server character set and collation are utf8mb4 and utf8mb4_0900_ai_ci, but you can specify character sets at the server, database, table, column, and string literal levels.

      https://dev.mysql.com/doc/refman/8.0/en/charset.html

      Frankly, if anything, this change came way too late. I've personally seen hundreds of development hours be wasted on MySQL databases created with wrong (default) collations and only through bug reports ('my characters get mangled') was it found out and fixed. Over. And over. Again. Because every dev that did 'create table whatever (...)' would forget to set the collation and first (back in the day) get a Swedish (why??) collation and then by mistake switch to the (crippled/broken) utf8, thinking it solved it... And then finally got wise and changed to utf8mb4, just to have the whole story start over on the next table someone created.

      I love MySQL, but man this part was a royal mess-up and it cost the world hundreds of millions if not billions of wasted dev hours. I mean this hit everyone running a MySQL server. And now with those emoji I mean I don't even dare to guess at the number of $ wasted.

      So imagine my surprise when I read

      In MariaDB, the default character set is latin1, and the default collation is latin1_swedish_ci

      https://mariadb.com/kb/en/setting-character-sets-and-collations/

      OMG, please say it isn't so!

      Please! Please change this. Like, NOW. This is costing sooooo much dev hours. Soooo much billions of dollars wasted! Devs are stupid ok? They don't understand character sets and collations ok? Never have, never will. This whole Unicode thing has been a personal interest for me for over a decade and I still don't fully grasp it. But one thing I do know for sure and that is: `latin1_swedish_ci` makes NO SENSE whatsoever, to anyone. This is not a good setting. For no one. Not even for Swedish people. Because it fits only 255 characters! I mean really? Do databases still get created for which this default actually makes any sense?

      Please fix it. Please! For love of the world. For peace. To end poverty. Out of good citizenship. Because it's just a few lines in the default server config. For the lulz. For whatever reason, but just do it. Please!

      Attachments

        Issue Links

          Activity

            The default depends on the distribution you take MariaDB from.
            In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in.
            In all deb packages the default is utf8mb4_general_ci, set it my.cnf

            In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. Given how few complains we ever got about it, we might better take some time thinking about how to do this transition in a least disruptive way possible. Instead of flipping the switch now.

            In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how.

            serg Sergei Golubchik added a comment - The default depends on the distribution you take MariaDB from. In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in. In all deb packages the default is utf8mb4_general_ci, set it my.cnf In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. Given how few complains we ever got about it, we might better take some time thinking about how to do this transition in a least disruptive way possible. Instead of flipping the switch now. In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how.
            cybernet2u cybernet2u added a comment -

            any news ?

            cybernet2u cybernet2u added a comment - any news ?
            StijnDeWitt Stijn de Witt added a comment - - edited

            > In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in.

            A default of `utf8` might actually be worse, since people are getting the 3-byte broken MySQL version of UTF-8 and they will not understand why most characters are stored correctly but some get mangled. It takes deep understanding of the history of both Unicode and MySQL to understand that in MySQL, `utf8` means something different than the rest of the world means by it.

            > In all deb packages the default is utf8mb4_general_ci, set it my.cnf

            That's a sane choice.

            > In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications.

            I have heard this argument before and I find it rather weak. It might break existing swedish applications. That never need their text content to be read in any other part of the world. And for which 255 characters suffice...

            For applications from any other place in the world, or that need more than 255 characters, this default makes no sense whatsoever. The idea that there might be many applications for which a swedish 255 character set is ok is imho laughable. Either apps stick to ASCII, in which case basically all encodings are ok for it, or they need support for more characters and then basically there is only one suitable encoding and that is utf8mb4. All others are broken out of the box.

            In the meantime, this default continues to break all new applications. The backward compatibility argument is moot There is no compatibility with these legacy encodings. They only work for a small subset of machines in a particular region of the world and they break for all other regions. They are a left over of days gone by when people simply had no internet and transferred files to each other using floppy disks. With these legacy encodings, people from France cannot read files from Germany. People from England cannot read files from Sweden etc. There simply are ZERO applications in today's world for which latin1 makes sense. And there are ZERO aplications for which utf8mb4 is not the best and in fact ONLY encoding that makes sense. All other encodings will break some characters. Only utf8mb4 works under all circumstances.

            > we might better take some time thinking about how to do this transition

            With all due respect...

            "the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things."

            I wrote this on may 5th, 2010. over 12 years ago! And it's still true. Please switch the default. Millions of devs will be better off for it.

            The way MySQL (and now by inheritance MariaDB) handles Unicode is so problematic, I wrote a post to warn devs just about that in 2015. 7 years ago!

            > In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how.

            I'm glad you agree. And sorry for my rant. But frankly I have been watching the encoding drama unfold literally for well over a decade now and I've heard the backward compatibility argument so many times. So I get triggered by it

            I suggest you have a look at what MySQL did. You just make a new major release and splatter this thing all over the release notes. Possibly you check whether an explicit encoding is set on import of data and print a warning if not. But in my experience there are only two scenarios with MySQL:

            • People set the encoding explicitly -> they are not affected
            • People forget to set the encoding -> they get latin1 -> their app is broken until they fix it

            There are no apps using latin1 that are not broken. That is a blanket statement I know, but I challenge you to show me such an app. The nature of these legacy encodings is that they are incompatible out of the box. And someone that did not address their encoding issues still after all these years does not deserve to be helped / assisted by maintaining this broken default. New devs using MariaDB in their first project deserve to be helped.

            StijnDeWitt Stijn de Witt added a comment - - edited > In rpm packages from SUSE, as far as I know, the default is utf8. Compiled in. A default of `utf8` might actually be worse, since people are getting the 3-byte broken MySQL version of UTF-8 and they will not understand why most characters are stored correctly but some get mangled. It takes deep understanding of the history of both Unicode and MySQL to understand that in MySQL, `utf8` means something different than the rest of the world means by it. > In all deb packages the default is utf8mb4_general_ci, set it my.cnf That's a sane choice. > In binary tarballs and rpms from mariadb.org the default is, indeed, latin1_swedish_ci. Changing it is a rather drastic incompatible change that will break existing applications. I have heard this argument before and I find it rather weak. It might break existing swedish applications. That never need their text content to be read in any other part of the world. And for which 255 characters suffice... For applications from any other place in the world, or that need more than 255 characters, this default makes no sense whatsoever. The idea that there might be many applications for which a swedish 255 character set is ok is imho laughable. Either apps stick to ASCII, in which case basically all encodings are ok for it, or they need support for more characters and then basically there is only one suitable encoding and that is utf8mb4. All others are broken out of the box. In the meantime, this default continues to break all new applications. The backward compatibility argument is moot There is no compatibility with these legacy encodings. They only work for a small subset of machines in a particular region of the world and they break for all other regions. They are a left over of days gone by when people simply had no internet and transferred files to each other using floppy disks. With these legacy encodings, people from France cannot read files from Germany. People from England cannot read files from Sweden etc. There simply are ZERO applications in today's world for which latin1 makes sense. And there are ZERO aplications for which utf8mb4 is not the best and in fact ONLY encoding that makes sense. All other encodings will break some characters. Only utf8mb4 works under all circumstances. > we might better take some time thinking about how to do this transition With all due respect... "the hardest thing about Unicode is not Unicode itself, but all the legacy encodings used in other software and files creeping into your project and breaking things." I wrote this on may 5th, 2010. over 12 years ago! And it's still true. Please switch the default. Millions of devs will be better off for it. The way MySQL (and now by inheritance MariaDB) handles Unicode is so problematic, I wrote a post to warn devs just about that in 2015. 7 years ago! > In general, we agree it has to be done. It's not 1995 anymore. The question only is when to do it and how. I'm glad you agree. And sorry for my rant. But frankly I have been watching the encoding drama unfold literally for well over a decade now and I've heard the backward compatibility argument so many times. So I get triggered by it I suggest you have a look at what MySQL did. You just make a new major release and splatter this thing all over the release notes. Possibly you check whether an explicit encoding is set on import of data and print a warning if not. But in my experience there are only two scenarios with MySQL: People set the encoding explicitly -> they are not affected People forget to set the encoding -> they get latin1 -> their app is broken until they fix it There are no apps using latin1 that are not broken. That is a blanket statement I know, but I challenge you to show me such an app. The nature of these legacy encodings is that they are incompatible out of the box. And someone that did not address their encoding issues still after all these years does not deserve to be helped / assisted by maintaining this broken default. New devs using MariaDB in their first project deserve to be helped.
            StijnDeWitt Stijn de Witt added a comment -

            I want to add some more. Sorry about bothering you but maybe it is ok because you might learn something?

            There basically are 3 major problems that in MariaDB compound together to create a real drama that costs the world billions of dollars:

            1: Very few developers understand character encodings
            2. The default in one of the world's most popular data storage platforms is broken
            3. The `utf8` encoding in that platform is also broken!

            So what happens is this:

            • A developer with little understanding of encodings installs MariaDB and connects his app to it
            • He delivers the app and soon, customers complain about mangled characters
            • He discovers that the default encoding is latin1_swedish and reads up on Unicode
            • He switches the encoding to `utf8`, oblivious of the fact that that is in fact ALSO BROKEN!
            • He spends valuable time trying to figure out how to fix the existing data with mangled text
            • He delivers the fix to the customer
            • For a while, all seems well
            • Reports start to come in about emoji's being mangled
            • The dev reads up on the Basic Multilingual Plane and the history of MySQL/MariaDB
            • To his dismay, the dev realizes that `utf8` in MariaDB is something different altogether than what he assumed it was
            • The dev finally changes the encoding to `utf8mb4`.
            • A new attempt at fixing the broken data is made
            • The second fix is delivered to the customer
            • Finally the app works!

            These are some of the most expensive changes that are possible. And every app that is made has to go through this process if the dev was not an encoding expert. These issues involve schema migrations, data migrations and the devs to read up a lot and do a lot of research. They demand of these devs that they become Unicode and MariaDB experts just to be able to understand what is happening. And every single dev has to go through this again and again. I have issues on my current project (logged by me) saying 'we have to change the encoding for these tables' and then a long list of tables. In many occasions, the server defaults can not be easily changed by the dev (managed databases). So we just have to tell every dev: 'if you create a table, set the encoding'. All because of this one default setting that is wrong.

            People don't understand this. And that's ok. From your comments on how changing the default might break applications, I realize that you don't really understand this as well. And that's also ok. No one should be forced to learn about this stuff. There are better things we can do with our time. But maybe you can trust me and my experience and just change the default and be done with it. I promise you it will not break any applications that were not already broken and it will prevent the drama I described above for all new applications. And it will allow devs to not understand encodings and focus on their business logic instead of requiring them to become encoding/MariaDB experts. The world will be better off for it.

            StijnDeWitt Stijn de Witt added a comment - I want to add some more. Sorry about bothering you but maybe it is ok because you might learn something? There basically are 3 major problems that in MariaDB compound together to create a real drama that costs the world billions of dollars: 1: Very few developers understand character encodings 2. The default in one of the world's most popular data storage platforms is broken 3. The `utf8` encoding in that platform is also broken! So what happens is this: A developer with little understanding of encodings installs MariaDB and connects his app to it He delivers the app and soon, customers complain about mangled characters He discovers that the default encoding is latin1_swedish and reads up on Unicode He switches the encoding to `utf8`, oblivious of the fact that that is in fact ALSO BROKEN! He spends valuable time trying to figure out how to fix the existing data with mangled text He delivers the fix to the customer For a while, all seems well Reports start to come in about emoji's being mangled The dev reads up on the Basic Multilingual Plane and the history of MySQL/MariaDB To his dismay, the dev realizes that `utf8` in MariaDB is something different altogether than what he assumed it was The dev finally changes the encoding to `utf8mb4`. A new attempt at fixing the broken data is made The second fix is delivered to the customer Finally the app works! These are some of the most expensive changes that are possible. And every app that is made has to go through this process if the dev was not an encoding expert. These issues involve schema migrations, data migrations and the devs to read up a lot and do a lot of research. They demand of these devs that they become Unicode and MariaDB experts just to be able to understand what is happening. And every single dev has to go through this again and again. I have issues on my current project (logged by me) saying 'we have to change the encoding for these tables' and then a long list of tables. In many occasions, the server defaults can not be easily changed by the dev (managed databases). So we just have to tell every dev: 'if you create a table, set the encoding'. All because of this one default setting that is wrong. People don't understand this. And that's ok. From your comments on how changing the default might break applications, I realize that you don't really understand this as well. And that's also ok. No one should be forced to learn about this stuff. There are better things we can do with our time. But maybe you can trust me and my experience and just change the default and be done with it. I promise you it will not break any applications that were not already broken and it will prevent the drama I described above for all new applications. And it will allow devs to not understand encodings and focus on their business logic instead of requiring them to become encoding/MariaDB experts. The world will be better off for it.
            serg Sergei Golubchik added a comment - - edited

            as a status update — this issue is related or depends on a bunch of other issues, notably:

            this is work in progress at the moment

            serg Sergei Golubchik added a comment - - edited as a status update — this issue is related or depends on a bunch of other issues, notably: MDEV-19123 Change default charset to utf8mb4 MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4 MDEV-27490 Allow full utf8mb4 for identifiers MDEV-27009 Add UCA-14.0.0 collations this is work in progress at the moment
            StijnDeWitt Stijn de Witt added a comment -

            If I read this correctly, you guys have fixed it!

            That deserves a 'Hurray!!' and congratulations to all involved.
            Thank you for making the world a better place!

            StijnDeWitt Stijn de Witt added a comment - If I read this correctly, you guys have fixed it! That deserves a 'Hurray!!' and congratulations to all involved. Thank you for making the world a better place!
            bar Alexander Barkov added a comment - - edited

            StijnDeWitt, your original guess in 2022 was not correct.
            Changing default collations for character sets is possible only starting from MDEV-30164 (11.2.1).
            So one can set the default collation for utf8mb4 to say utf8mb4_uca1400_ai_ci

            This task is still open. It's about changing the hard coded default to utf8mb4_uca1400_ai_ci.

            We need to make sure:

            • to do all possible optimization for utf8mb4_uca1400_ai_ci, to make sure it does not degrade much in performance comparing to utf8mb4_general_ci
            • we change defaults for all Unicode character sets at the same time
            • to add uca1400 collations for utf16be
            bar Alexander Barkov added a comment - - edited StijnDeWitt , your original guess in 2022 was not correct. Changing default collations for character sets is possible only starting from MDEV-30164 (11.2.1). So one can set the default collation for utf8mb4 to say utf8mb4_uca1400_ai_ci This task is still open. It's about changing the hard coded default to utf8mb4_uca1400_ai_ci. We need to make sure: to do all possible optimization for utf8mb4_uca1400_ai_ci, to make sure it does not degrade much in performance comparing to utf8mb4_general_ci we change defaults for all Unicode character sets at the same time to add uca1400 collations for utf16be
            bar Alexander Barkov added a comment - - edited

            The easiest solution to implement this task would be to change the default value for GLOBAL.character_set_collations from empty to:

            utf8mb3=utf8mb3_uca1400_ai_ci,ucs2=ucs2_uca1400_ai_ci,utf8mb4=utf8mb4_uca1400_ai_ci,utf16=utf16_uca1400_ai_ci,utf32=utf32_uca1400_ai_ci
            

            If we go this way, we should make sure that the implementation of @@character_set_collations does not have any performance problems with a non-empty value.

            bar Alexander Barkov added a comment - - edited The easiest solution to implement this task would be to change the default value for GLOBAL.character_set_collations from empty to: utf8mb3=utf8mb3_uca1400_ai_ci,ucs2=ucs2_uca1400_ai_ci,utf8mb4=utf8mb4_uca1400_ai_ci,utf16=utf16_uca1400_ai_ci,utf32=utf32_uca1400_ai_ci If we go this way, we should make sure that the implementation of @@character_set_collations does not have any performance problems with a non-empty value.
            cybernet2u cybernet2u added a comment -

            I would like to thank developers for listening

            cybernet2u cybernet2u added a comment - I would like to thank developers for listening

            Hello serg,

            Can you please review patches for this MDEV?

            There are three patches. They are in this branch:

            https://github.com/MariaDB/server/commits/bb-11.4-bar-MDEV-25829

            bar Alexander Barkov added a comment - Hello serg , Can you please review patches for this MDEV? There are three patches. They are in this branch: https://github.com/MariaDB/server/commits/bb-11.4-bar-MDEV-25829
            bar Alexander Barkov added a comment - Hello serg . Please find a set of new patches for review here: Step#1: https://github.com/MariaDB/server/commit/58c1cccc7c8b40ba6d86d80760c6aa3aa965f1d3 Step#2: https://github.com/MariaDB/server/commit/85c659860c0e4ea2408f1a47c71ee590f32ddc9f Step#3: https://github.com/MariaDB/server/commit/1413de15df01602a5fe799f8e136cf9c047da14e Also, replied by email. Thanks.
            serg Sergei Golubchik added a comment - https://github.com/MariaDB/server/commit/58c1cccc7c8b40ba6d86d80760c6aa3aa965f1d3 https://github.com/MariaDB/server/commit/85c659860c0e4ea2408f1a47c71ee590f32ddc9f https://github.com/MariaDB/server/commit/1413de15df01602a5fe799f8e136cf9c047da14e are ok to push

            bar, I've pushed a fixup commit into your branch, it's clearly named "fixup", so please rename it or squash with the main commit, whatever you prefer

            serg Sergei Golubchik added a comment - bar , I've pushed a fixup commit into your branch, it's clearly named "fixup", so please rename it or squash with the main commit, whatever you prefer
            alice Alice Sherepa added a comment -

            bb-11.5-bar-MDEV-25829 af1418118e0462e0f5173b621ab8b42acab78446 is ok to push into 11.5

            alice Alice Sherepa added a comment - bb-11.5-bar- MDEV-25829 af1418118e0462e0f5173b621ab8b42acab78446 is ok to push into 11.5
            thomaslauria Thomas Lauria added a comment -

            Hi, due this task we got some problems after updating MariaDB.
            I could reduce the problem to the following sets of commands.
            The question is now: Is that by purpose or is that a bug?

            SET NAMES utf8mb4;
            SET character_set_collations='utf8mb4=uca1400_ai_ci';
             
            CREATE TEMPORARY TABLE `test` (
              `value` text DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
             
            INSERT INTO test VALUES ('');
             
            SELECT @a:=value FROM test;
             
            SELECT COLLATION(@a), COERCIBILITY(CONCAT('a', @a, 'b')), COLLATION(CONCAT('a', @a, 'b'));
            -- results see below
             
            -- now change to utf8mb3
            SET NAMES utf8mb3;
            SET character_set_collations='utf8mb3=uca1400_ai_ci';
            SELECT COLLATION(@a), COERCIBILITY(CONCAT('a', @a, 'b')), COLLATION(CONCAT('a', @a, 'b'));
            -- results see below
            

            Results:

            MariaDB [translate5]> SELECT COLLATION(@a), COERCIBILITY(CONCAT('a', @a, 'b')), COLLATION(CONCAT('a', @a, 'b'));
            +--------------------+------------------------------------+---------------------------------+
            | COLLATION(@a)      | COERCIBILITY(CONCAT('a', @a, 'b')) | COLLATION(CONCAT('a', @a, 'b')) |
            +--------------------+------------------------------------+---------------------------------+
            | utf8mb4_unicode_ci |                                  1 | utf8mb4_bin                     |
            +--------------------+------------------------------------+---------------------------------+
            and
            MariaDB [translate5]> SELECT COLLATION(@a), COERCIBILITY(CONCAT('a', @a, 'b')), COLLATION(CONCAT('a', @a, 'b'));
            +--------------------+------------------------------------+---------------------------------+
            | COLLATION(@a)      | COERCIBILITY(CONCAT('a', @a, 'b')) | COLLATION(CONCAT('a', @a, 'b')) |
            +--------------------+------------------------------------+---------------------------------+
            | utf8mb4_unicode_ci |                                  5 | utf8mb4_unicode_ci              |
            +--------------------+------------------------------------+---------------------------------+
            
            

            My main question is:
            Why is in case of NAMES = utf8mb4 (same charset as in table) the COERCIBILITY of the CONCAT 1 and therefore the COLLATION utf8mb4_bin and when using utf8mb3 (a different charset) COERCIBILITY is 5 and there fore the COLLATION utf8mb4_unicode_ci.

            Is that somehow by purpose or a bug?

            Thanks.

            thomaslauria Thomas Lauria added a comment - Hi, due this task we got some problems after updating MariaDB. I could reduce the problem to the following sets of commands. The question is now: Is that by purpose or is that a bug? SET NAMES utf8mb4; SET character_set_collations= 'utf8mb4=uca1400_ai_ci' ;   CREATE TEMPORARY TABLE `test` ( `value` text DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE =utf8mb4_unicode_ci;   INSERT INTO test VALUES ( '' );   SELECT @a:=value FROM test;   SELECT COLLATION(@a), COERCIBILITY(CONCAT( 'a' , @a, 'b' )), COLLATION(CONCAT( 'a' , @a, 'b' )); -- results see below   -- now change to utf8mb3 SET NAMES utf8mb3; SET character_set_collations= 'utf8mb3=uca1400_ai_ci' ; SELECT COLLATION(@a), COERCIBILITY(CONCAT( 'a' , @a, 'b' )), COLLATION(CONCAT( 'a' , @a, 'b' )); -- results see below Results: MariaDB [translate5]> SELECT COLLATION(@a), COERCIBILITY(CONCAT( 'a' , @a, 'b' )), COLLATION(CONCAT( 'a' , @a, 'b' )); + --------------------+------------------------------------+---------------------------------+ | COLLATION(@a) | COERCIBILITY(CONCAT( 'a' , @a, 'b' )) | COLLATION(CONCAT( 'a' , @a, 'b' )) | + --------------------+------------------------------------+---------------------------------+ | utf8mb4_unicode_ci | 1 | utf8mb4_bin | + --------------------+------------------------------------+---------------------------------+ and MariaDB [translate5]> SELECT COLLATION(@a), COERCIBILITY(CONCAT( 'a' , @a, 'b' )), COLLATION(CONCAT( 'a' , @a, 'b' )); + --------------------+------------------------------------+---------------------------------+ | COLLATION(@a) | COERCIBILITY(CONCAT( 'a' , @a, 'b' )) | COLLATION(CONCAT( 'a' , @a, 'b' )) | + --------------------+------------------------------------+---------------------------------+ | utf8mb4_unicode_ci | 5 | utf8mb4_unicode_ci | + --------------------+------------------------------------+---------------------------------+ My main question is: Why is in case of NAMES = utf8mb4 (same charset as in table) the COERCIBILITY of the CONCAT 1 and therefore the COLLATION utf8mb4_bin and when using utf8mb3 (a different charset) COERCIBILITY is 5 and there fore the COLLATION utf8mb4_unicode_ci. Is that somehow by purpose or a bug? Thanks.
            bar Alexander Barkov added a comment - - edited

            For simplicity I removed the third argument to CONCAT, as it has the same collation characteristics as the first one and therefore does not affect the result collation resolution.

            In case of utf8mb3:

            SET NAMES utf8mb3;
            SET character_set_collations='utf8mb3=uca1400_ai_ci';
            CREATE OR REPLACE TEMPORARY TABLE t1 (
              `value` text DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
            INSERT INTO test VALUES ('');
            SELECT @a:=value FROM t1;
            SELECT
              COLLATION(@a), COERCIBILITY(@a),
              COLLATION('a'), COERCIBILITY('a'),
              COLLATION(CONCAT('a', @a)), COERCIBILITY(CONCAT('a', @a))\G
            

            *************************** 1. row ***************************
                            COLLATION(@a): utf8mb4_unicode_ci
                         COERCIBILITY(@a): 5
                           COLLATION('a'): utf8mb3_uca1400_ai_ci
                        COERCIBILITY('a'): 5
               COLLATION(CONCAT('a', @a)): utf8mb4_unicode_ci
            COERCIBILITY(CONCAT('a', @a)): 5
            

            The result collation resolution is done as follows:

            • There are two argument to CONCAT with different character sets (and collations), with equal collation coercibility levels 5 (COERCIBLE).
            • The utf8mb4 side wins, because utf8mb4 is a superset for utf8mb3.
            • The collation and the coercibility of the CONCAT are copied from the first argument `@a`

            In case of utf8mb4:

            SET NAMES utf8mb4;
            SET character_set_collations='utf8mb4=uca1400_ai_ci';
            CREATE OR REPLACE TEMPORARY TABLE t1 (
              `value` text DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
            INSERT INTO test VALUES ('');
            SELECT @a:=value FROM t1;
            SELECT
              COLLATION(@a), COERCIBILITY(@a),
              COLLATION('a'), COERCIBILITY('a'),
              COLLATION(CONCAT('a', @a)), COERCIBILITY(CONCAT('a', @a))\G
            

            *************************** 1. row ***************************
                            COLLATION(@a): utf8mb4_unicode_ci
                         COERCIBILITY(@a): 5
                           COLLATION('a'): utf8mb4_uca1400_ai_ci
                        COERCIBILITY('a'): 5
               COLLATION(CONCAT('a', @a)): utf8mb4_bin
            COERCIBILITY(CONCAT('a', @a)): 1
            

            The result collation resolution is done as follows:

            • There are two arguments with equal character sets, different collations, equal collation coercibility levels.
            • None of the side wins.
            • The result collation is chosen as a neutral collation utf8mb4_bin, with a special collation coercibility level 1, which means "NONE" (or "CONFLICT"). Note, this conflicting state can be resolved to a non-conflicting state by adding an explicit COLLATE clause after the CONCAT.

            Observations

            The result collation resolution itself seems to work as expected in both cases (with utf8mb3 and utf8mb4).

            However this CONCAT() should rather return an error when used in the SELECT list without an explicit COLLATE clause following the CONCAT (which would resolve the conflict).

            bar Alexander Barkov added a comment - - edited For simplicity I removed the third argument to CONCAT, as it has the same collation characteristics as the first one and therefore does not affect the result collation resolution. In case of utf8mb3: SET NAMES utf8mb3; SET character_set_collations= 'utf8mb3=uca1400_ai_ci' ; CREATE OR REPLACE TEMPORARY TABLE t1 ( `value` text DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE =utf8mb4_unicode_ci; INSERT INTO test VALUES ( '' ); SELECT @a:=value FROM t1; SELECT COLLATION(@a), COERCIBILITY(@a), COLLATION( 'a' ), COERCIBILITY( 'a' ), COLLATION(CONCAT( 'a' , @a)), COERCIBILITY(CONCAT( 'a' , @a))\G *************************** 1. row *************************** COLLATION(@a): utf8mb4_unicode_ci COERCIBILITY(@a): 5 COLLATION('a'): utf8mb3_uca1400_ai_ci COERCIBILITY('a'): 5 COLLATION(CONCAT('a', @a)): utf8mb4_unicode_ci COERCIBILITY(CONCAT('a', @a)): 5 The result collation resolution is done as follows: There are two argument to CONCAT with different character sets (and collations), with equal collation coercibility levels 5 (COERCIBLE). The utf8mb4 side wins, because utf8mb4 is a superset for utf8mb3. The collation and the coercibility of the CONCAT are copied from the first argument `@a` In case of utf8mb4: SET NAMES utf8mb4; SET character_set_collations= 'utf8mb4=uca1400_ai_ci' ; CREATE OR REPLACE TEMPORARY TABLE t1 ( `value` text DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE =utf8mb4_unicode_ci; INSERT INTO test VALUES ( '' ); SELECT @a:=value FROM t1; SELECT COLLATION(@a), COERCIBILITY(@a), COLLATION( 'a' ), COERCIBILITY( 'a' ), COLLATION(CONCAT( 'a' , @a)), COERCIBILITY(CONCAT( 'a' , @a))\G *************************** 1. row *************************** COLLATION(@a): utf8mb4_unicode_ci COERCIBILITY(@a): 5 COLLATION('a'): utf8mb4_uca1400_ai_ci COERCIBILITY('a'): 5 COLLATION(CONCAT('a', @a)): utf8mb4_bin COERCIBILITY(CONCAT('a', @a)): 1 The result collation resolution is done as follows: There are two arguments with equal character sets, different collations, equal collation coercibility levels. None of the side wins. The result collation is chosen as a neutral collation utf8mb4_bin, with a special collation coercibility level 1, which means "NONE" (or "CONFLICT"). Note, this conflicting state can be resolved to a non-conflicting state by adding an explicit COLLATE clause after the CONCAT. Observations The result collation resolution itself seems to work as expected in both cases (with utf8mb3 and utf8mb4). However this CONCAT() should rather return an error when used in the SELECT list without an explicit COLLATE clause following the CONCAT (which would resolve the conflict).
            bar Alexander Barkov added a comment - - edited

            Just checked with MySQL-8.3.0. It also allows to use an expression with the coercibility level of 1 (NONE) in the SELECT list:

            drop table if exists t1, t2;
            create  table t1 (
              a varchar(10) character set utf8mb4 collate utf8mb4_general_ci,
              b varchar(10) character set utf8mb4 collate utf8mb4_unicode_ci
            );
            create table t2 as select concat(a,b) from t1;
            show create table t2;
            

            +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | Table | Create Table                                                                                                                                                              |
            +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            | t2    | CREATE TABLE `t2` (
              `concat(a,b)` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL
            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci |
            +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
            

            Which I'd say is rather a bug. But this behavior has been this way for many years in MySQL and then MariaDB.

            Not sure if we should change this behavior in MariaDB at this point (i.e. raise an error instead of returning an utf8mb4_bin result).

            bar Alexander Barkov added a comment - - edited Just checked with MySQL-8.3.0. It also allows to use an expression with the coercibility level of 1 (NONE) in the SELECT list: drop table if exists t1, t2; create table t1 ( a varchar (10) character set utf8mb4 collate utf8mb4_general_ci, b varchar (10) character set utf8mb4 collate utf8mb4_unicode_ci ); create table t2 as select concat(a,b) from t1; show create table t2; +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t2 | CREATE TABLE `t2` ( `concat(a,b)` varchar(20) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci | +-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ Which I'd say is rather a bug. But this behavior has been this way for many years in MySQL and then MariaDB. Not sure if we should change this behavior in MariaDB at this point (i.e. raise an error instead of returning an utf8mb4_bin result).
            bar Alexander Barkov added a comment - - edited

            Possible workarounds to avoid having such a conflicting state in a SELECT list expression:

            • Use an explicit COLLATE with one of the argument:

              SELECT CONCAT('a' COLLATE utf8mb4_unicode_ci, @a);
              SELECT CONCAT('a', @a  COLLATE utf8mb4_unicode_ci);
              

            • Use an explicit COLLATE following the CONCAT, to resolve the conflicting state:

              SELECT CONCAT('a', @a)  COLLATE utf8mb4_unicode_ci;
              

            bar Alexander Barkov added a comment - - edited Possible workarounds to avoid having such a conflicting state in a SELECT list expression: Use an explicit COLLATE with one of the argument: SELECT CONCAT( 'a' COLLATE utf8mb4_unicode_ci, @a); SELECT CONCAT( 'a' , @a COLLATE utf8mb4_unicode_ci); Use an explicit COLLATE following the CONCAT, to resolve the conflicting state: SELECT CONCAT( 'a' , @a) COLLATE utf8mb4_unicode_ci;
            ycp Yuchen Pei added a comment - - edited

            greenman
            It looks like the kb documentation on COERCIBILITY needs an update since with this MDEV a new derivation DERIVATION_CAST is added between DERIVATION_SYSCONST and DERIVATION_COERCIBLE, and COERCIBLE applies to both literal strings and user variables now, and possible other changes.

            https://mariadb.com/kb/en/coercibility/

            modified   sql/sql_type.h
            @@ -3019,13 +3019,46 @@ static inline my_repertoire_t &operator|=(my_repertoire_t &a,
             
             enum Derivation
             {
            -  DERIVATION_IGNORABLE= 6,
            -  DERIVATION_NUMERIC= 5,
            -  DERIVATION_COERCIBLE= 4,
            +  DERIVATION_IGNORABLE= 7, // Explicit NULL
            +
            +  /*
            +    Explicit or implicit conversion from numeric/temporal data to string:
            +    - Numbers/temporals in string context
            +    - Numeric user variables
            +    - CAST(numeric_or_temporal_expr AS CHAR)
            +  */
            +  DERIVATION_NUMERIC= 6,
            +
            +  /*
            +    - String literals
            +    - String user variables
            +  */
            +  DERIVATION_COERCIBLE= 5,
            +
            +  /*
            +    String cast and conversion functions:
            +    - BINARY(expr)
            +    - CAST(string_expr AS CHAR)
            +    - CONVERT(expr USING cs)
            +  */
            +  DERIVATION_CAST= 4,
            +
            +  /*
            +    utf8 metadata functions:
            +    - DATABASE()
            +    - CURRENT_ROLE()
            +    - USER()
            +  */
               DERIVATION_SYSCONST= 3,
            +
            +  /*
            +    - Table columns
            +    - SP variables
            +    - BINARY(expr) and CAST(expr AS BINARY)
            +  */
               DERIVATION_IMPLICIT= 2,
            -  DERIVATION_NONE= 1,
            -  DERIVATION_EXPLICIT= 0
            +  DERIVATION_NONE= 1,      // A mix (e.g. CONCAT) of two differrent collations
            +  DERIVATION_EXPLICIT= 0   // An explicit COLLATE clause
             };
            

            UPDATE: also MDEV-35041 added another derivation

            ycp Yuchen Pei added a comment - - edited greenman It looks like the kb documentation on COERCIBILITY needs an update since with this MDEV a new derivation DERIVATION_CAST is added between DERIVATION_SYSCONST and DERIVATION_COERCIBLE, and COERCIBLE applies to both literal strings and user variables now, and possible other changes. https://mariadb.com/kb/en/coercibility/ modified sql/sql_type.h @@ -3019,13 +3019,46 @@ static inline my_repertoire_t &operator|=(my_repertoire_t &a, enum Derivation { - DERIVATION_IGNORABLE= 6, - DERIVATION_NUMERIC= 5, - DERIVATION_COERCIBLE= 4, + DERIVATION_IGNORABLE= 7, // Explicit NULL + + /* + Explicit or implicit conversion from numeric/temporal data to string: + - Numbers/temporals in string context + - Numeric user variables + - CAST(numeric_or_temporal_expr AS CHAR) + */ + DERIVATION_NUMERIC= 6, + + /* + - String literals + - String user variables + */ + DERIVATION_COERCIBLE= 5, + + /* + String cast and conversion functions: + - BINARY(expr) + - CAST(string_expr AS CHAR) + - CONVERT(expr USING cs) + */ + DERIVATION_CAST= 4, + + /* + utf8 metadata functions: + - DATABASE() + - CURRENT_ROLE() + - USER() + */ DERIVATION_SYSCONST= 3, + + /* + - Table columns + - SP variables + - BINARY(expr) and CAST(expr AS BINARY) + */ DERIVATION_IMPLICIT= 2, - DERIVATION_NONE= 1, - DERIVATION_EXPLICIT= 0 + DERIVATION_NONE= 1, // A mix (e.g. CONCAT) of two differrent collations + DERIVATION_EXPLICIT= 0 // An explicit COLLATE clause }; UPDATE: also MDEV-35041 added another derivation

            People

              bar Alexander Barkov
              StijnDeWitt Stijn de Witt
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.