Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19123

Change default charset from latin1 to utf8mb4

Details

    Description

      Goal of this task is to set default global variables to 4 bytes utf8 charset
      meaning :

      • character_set_client : from from utf8 to utf8mb4.
      • character_set_database : from latin1 to utf8mb4
      • character_set_server : from latin1 to utf8mb4
      • character_set_results: from utf8 to utf8mb4
      • character_set_connection: from utf8 to utf8mb4
      • collation_database: from latin1_swedish_ci to utf8mb4_uca1400_ai_ci
      • collation_server: from latin1_swedish_ci to utf8mb4_uca1400_ai_ci

      Default changed in mysql 8.0.1

      There are some questions which should be discussed before/while working on this task:

      • YES: Should we change the default collation for utf8mb4 from utf8mb4_general_ci to uca1400_ai_ci? The problem is that utf8mb4_general_ci is very bad for non-BMP characters - it considers all non-BMP charcters as equal to each other. See MDEV-25829
      • YES: Should we reassign the UTF8 Linux Locale from utf8mb3 to utf8mb4 in the client? Or to what the server side uses as the alias for "utf8". See MDEV-19123
      • YES: Should we change system_charset_info from utf8mb3 to utf8mb4 and allow non-BMP characters in identifiers? See MDEV-27490.
        • If so, table name to file name encoding should be extended to support non-BMP characters. See MDEV-27490
        • system charset cannot be utf8mb4 until we fix the collation as above
      • YES: Should we change numerous INFORMATION_SCHEMA columns from utf8mb3 to utf8mb4?
        • they should be in the system_charset_info, as they store identifiers

      Attachments

        Issue Links

          Activity

            Roger that, full UTF-8 (=utfmb4) is indeed needed to support emojis (e.g. U+01F4A9 PILE OF POO �) and other characters in the full UTF-8 spec.

            For followers of this Jira, the post https://mathiasbynens.be/notes/mysql-utf8mb4 is a good explanation of the topic (though it does not mention utf8mb3).

            otto Otto Kekäläinen added a comment - Roger that, full UTF-8 (=utfmb4) is indeed needed to support emojis (e.g. U+01F4A9 PILE OF POO �) and other characters in the full UTF-8 spec. For followers of this Jira, the post https://mathiasbynens.be/notes/mysql-utf8mb4 is a good explanation of the topic (though it does not mention utf8mb3).
            darioseidl Dario Seidl added a comment -

            There are some questions which should be discussed before/while working on this task:

            Should we introduce a new collation (as a replacement for utf8mb4_general_ci) and make it default for utf8mb4? The problem is that utf8mb4_general_ci is very bad for non-BMP characters - it considers all non-BMP charcters as equal to each other.

            I agree, it should really not be `utf8mb4_general_ci`, that one sacrifices correctness for performance and that shouldn't be the default. There's `utf8mb4_unicode_ci` which correctly implements Unicode sorting, and the newer `utf8mb4_unicode_520_ci` with updated weight keys. MySQL also has the even newer `utf8mb4_0900_ai_ci` which doesn't exist in MariaDB yet, I think.

            darioseidl Dario Seidl added a comment - There are some questions which should be discussed before/while working on this task: Should we introduce a new collation (as a replacement for utf8mb4_general_ci) and make it default for utf8mb4? The problem is that utf8mb4_general_ci is very bad for non-BMP characters - it considers all non-BMP charcters as equal to each other. I agree, it should really not be `utf8mb4_general_ci`, that one sacrifices correctness for performance and that shouldn't be the default. There's `utf8mb4_unicode_ci` which correctly implements Unicode sorting, and the newer `utf8mb4_unicode_520_ci` with updated weight keys. MySQL also has the even newer `utf8mb4_0900_ai_ci` which doesn't exist in MariaDB yet, I think.

            bar Hi! During testing, I am seeing a lot of these:

            bb-11.6-bar-MDEV-19123 11.6.0 98ebe0a3afc432bb903fd10dbfb0a68572df0f67 (Debug, UBASAN)

            2024-06-12 22:46:22 0 [Note] InnoDB: Shutdown completed; log sequence number 1009586; transaction id 1395
            2024-06-12 22:46:22 0 [Note] /test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd: Shutdown complete
             
            =================================================================
            ==1260209==ERROR: LeakSanitizer: detected memory leaks
             
            Direct leak of 48 byte(s) in 1 object(s) allocated from:
                #0 0x560841c37407 in __interceptor_malloc (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x89c4407)
                #1 0x14e663f7e53c  (<unknown module>)
             
            Indirect leak of 53 byte(s) in 1 object(s) allocated from:
                #0 0x560841bde547 in __interceptor_strdup (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x896b547)
                #1 0x14e6640a71b6  (<unknown module>)
             
            Indirect leak of 24 byte(s) in 1 object(s) allocated from:
                #0 0x560841c37407 in __interceptor_malloc (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x89c4407)
                #1 0x14e6640a71f4  (<unknown module>)
             
            Indirect leak of 8 byte(s) in 1 object(s) allocated from:
                #0 0x560841c37407 in __interceptor_malloc (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x89c4407)
                #1 0x14e6640a7249  (<unknown module>)
             
            SUMMARY: AddressSanitizer: 133 byte(s) leaked in 4 allocation(s).
            240612 22:46:22 [ERROR] mysqld got signal 6 ;
            

            i.e. a memory loss observed after shutdown, but with a broken stack. The output is the same in all cases (observed quite a few times).

            None of them has proved reproducible yet, even when using many instances to counter sporadicity. My impression thus far is that it is caused by the patch - I am continuing to research.

            Otherwise, I have observed no feature related bugs during my part of testing (I am assisting Ramesh with testing) and things look very stable from my end.

            Roel Roel Van de Paar added a comment - bar Hi! During testing, I am seeing a lot of these: bb-11.6-bar-MDEV-19123 11.6.0 98ebe0a3afc432bb903fd10dbfb0a68572df0f67 (Debug, UBASAN) 2024-06-12 22:46:22 0 [Note] InnoDB: Shutdown completed; log sequence number 1009586; transaction id 1395 2024-06-12 22:46:22 0 [Note] /test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd: Shutdown complete   ================================================================= ==1260209==ERROR: LeakSanitizer: detected memory leaks   Direct leak of 48 byte(s) in 1 object(s) allocated from: #0 0x560841c37407 in __interceptor_malloc (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x89c4407) #1 0x14e663f7e53c (<unknown module>)   Indirect leak of 53 byte(s) in 1 object(s) allocated from: #0 0x560841bde547 in __interceptor_strdup (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x896b547) #1 0x14e6640a71b6 (<unknown module>)   Indirect leak of 24 byte(s) in 1 object(s) allocated from: #0 0x560841c37407 in __interceptor_malloc (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x89c4407) #1 0x14e6640a71f4 (<unknown module>)   Indirect leak of 8 byte(s) in 1 object(s) allocated from: #0 0x560841c37407 in __interceptor_malloc (/test/MDEV19123_UBASAN_MD120624-mariadb-11.6.0-linux-x86_64-dbg/bin/mariadbd+0x89c4407) #1 0x14e6640a7249 (<unknown module>)   SUMMARY: AddressSanitizer: 133 byte(s) leaked in 4 allocation(s). 240612 22:46:22 [ERROR] mysqld got signal 6 ; i.e. a memory loss observed after shutdown, but with a broken stack. The output is the same in all cases (observed quite a few times). None of them has proved reproducible yet, even when using many instances to counter sporadicity. My impression thus far is that it is caused by the patch - I am continuing to research. Otherwise, I have observed no feature related bugs during my part of testing (I am assisting Ramesh with testing) and things look very stable from my end.

            ok to push

            ramesh Ramesh Sivaraman added a comment - ok to push

            Rebasing to the latest 11.6

            bar Alexander Barkov added a comment - Rebasing to the latest 11.6

            People

              bar Alexander Barkov
              diego dupin Diego Dupin
              Votes:
              9 Vote for this issue
              Watchers:
              17 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.