Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38743

Document the implications of changing the default collation

    XMLWordPrintable

Details

    • Can result in unexpected behaviour

    Description

      The documenting task is now handled by:
      https://mariadbcorp.atlassian.net/browse/DOCS-6015

      In MDEV-25829 and MDEV-19123, the default character set and collation were changed from the long-time default latin1 and latin1_swedish_ci to utf8mb4 and utf8mb4_uca1400_ai_ci.

      As far as I understand, this has the following implications:

      • Because the collation utf8mb4_uca1400_ai_ci was not implemented before MDEV-27009, replication from MariaDB Server 11.8 or later versions to MariaDB Server 10.6 will be impacted.
      • The storage overhead related to CHAR and VARCHAR columns may be significantly increased. This includes persistent storage when the columns contain latin1 code points outside ASCII.
      • Some comparison operations are significantly slower with utf8mb4_uca1400_ai_ci than with the old default collation latin1_swedish_ci, which worked on fixed-width character encoding and was accent-sensitive. MDEV-34427 is just one example.
      • Some applications that may have relied on the old default collation could be broken; see MDEV-36286 for an example.

      It would be useful if https://mariadb.com/docs/release-notes/community-server/11.8/what-is-mariadb-118 documented how to configure MariaDB Server 11.8 or later to use the same default character set and collation as 11.4 or older releases. It would also be useful to include a warning that such configuration is advisable when attempting replication to MariaDB Server 10.6.

      Why we changed the default character set from latin1 to utf8mb4

      • Help people all around the world use MariaDB server out of the box without additional configuration of character set and collation. Old defaults worked fine for West European languages only.
      • People all around the world use supplementary characters such as Emoji. Old defaults with latin1 did not allow to store Emoji.
      • For better MySQL-8.0 compatibility

      Why we changed the default collations for Unicode character set from xxx_general_ci to xxx_uca1400_ai_ci

      • The old default collation xxx_general_ci (e.g. utf8mb4_general_ci) considered all supplementary characters (with Unicode code point >=U+10000) as equal to each other. The new default collation xxx_uca1400_ai_ci (e.g. utf8mb4_uca1400_ai_ci) works with supplementary characters correctly
      • The old default collation xxx_general_ci is a simplified collation. It does support things like character expansions and character contractions. The new default collation xxx_uca1400_ai_ci provides a better comparison and sorting order because it supports expansions and contractions from DUCET (Default Unicode Collation Element Table). For example, German character ß (U+00DF LATIN SMALL LETTER SHARP S) is correctly compared as equal to the combination of two letters "ss".

      Restoring to the old defaults

      This change of the defaults will make the data files incompatible with 10.6 (because 10.6 is missing MDEV-27009) and potentially slightly increase the storage and CPU consumption.

      To return to the old defaults please edit your my.cnf file as follows

      [mysqld]
      character-set-server=latin1
      collation-server=latin1_swedish_ci
      character-set-collations=''
      

      Attachments

        Issue Links

          Activity

            People

              maxmether Max Mether
              marko Marko Mäkelä
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.