Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7128

Configuring charsets or collations as utf8 yields surprising result and leads to data loss

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Open (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0
    • Fix Version/s: 10.0
    • Component/s: None
    • Labels:
      None

      Description

      Configuring databases and collations to be utf8 and utf8_unicode_ci respectively leads to the very surprising result that actually utf8 is NOT used, but instead a custom variant of utf8 which is incompatible with the full range of unicode characters.

      Which of course leads to hard to debug problems as one doesn't even suspect that such a problem exists.

      The problem is that the name utf-8 is reused to mean something that it is not according to the relevant standards rfc, unicode consortium

      Instead one has to workaround the problem by a) actually knowing about and b) configuring something like this:

      -- snip --
      [mysqld]
      # switch to 4 byte utf-8 as default
      # See: https://mathiasbynens.be/notes/mysql-utf8mb4
      init_connect  = "SET NAMES utf8mb4"
      collation_server = utf8mb4_unicode_ci
      character_set_server = utf8mb4
       
      [client]
      default-character-set = utf8mb4
       
      [mysql]
      default-character-set = utf8mb4
      -- snap --

      What do I expect as a user: If I configure mariadb to use utf8, it expect it to do so, and not some custom variant subset of unicode that automatically discards data my users entered as utf8 trusting it to be handled correctly.

      Options to handle this:
      a) Just actually use unicode if unicode is specified - as the utf8mb3 encoding is fully binary compatible with utf8mb4 (actual utf8) this would just work. People who need utf8mb3 can configure their system and likely know what it is anyway if the actually require it.
      b) If the name of the encoding cannot be just redefined, a lengthy deprecation cycle can be started right now. Deprecate all the utf8 encodings which don't specify if they should be utf8mb3 or utf8mb4 and emit warnings for the next several years (I'd think something like 4 or whatever floats your boat) to fully specify the actual wanted encoding. Then after that make the utf8 name invalid for the same amount to get it out of configurations. Only then (now we're 8 years into the future) can the name utf8 be reenabled as a new encoding that actually is utf8 and doesn't surprise users.

      Well a) and b) are certainly some extremes, so I suspect that you will choose something in between, but right now the situation is very bad as nobody who configures utf8 or utf8-ci actually suspects that they're not getting what they want, so just switching utf8 to mean utf8mb4 will - I suspect - do the right thing for a lot of people. So I would like to suggest choosing an option that is closer to a) than to b).

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              bar Alexander Barkov
              Reporter:
              dwt Martin Häcker
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated: