Currently MariaDB's has two utf8 character sets:
- utf8 that can store 1 to 3 byte characters and implements Unicode BMP range U+0000..U+FFFF
This character set is also available under name "utf8mb3"
- utf8mb4 that cat store 1 to 4 byte characters and implements the full Unicode standard range U+0000..U+10FFFF.
In long terms we want the name utf8 mean the full featured UTF-8.
We'll do a few preparatory steps:
1. Change the main name of the 3-byte character set from "utf8" to "utf8m3" and make "utf8" alias for "utf8mb3". This will change all SHOW and INFORMATION_SCHEMA output to display utf8mb3 instread of utf8, as well as change mysqldump to dump utf8mb3 instead of just utf8.
2. Add a new server option, say --utf8-is-utf8mb3, which will be true by default, but the DBA will be able to change it to false and thus make "utf8" mean "utf8mb4".
3. A few releases later we'll change --utf8-is-utf8mb3 to be "false" by default.
Another option is to implement this SQL standard statement:
Originally, there were two reasons to have two utf8 implementations:
- The CHAR column needs less space in case of utf8mb3. InnoDB can store CHAR in a packed format, so space needed is the same for utf8mb3 and utf8mb4 on the same data. Other engines could probably do the same trick to safe space: store CHAR in a packed format with trailing spaces removed.
- Before 10.5, filesort was faster for utf8mb3 than for utf8mb4, because utf8mb3 needs to reserve less bytes for one weight. Now with Varun's improvements (e.g.
MDEV-21580) in filesort (sort buffer now can store the original string instead if its weight array), filesort should be the same fast for utf8mb3 and utf8mb4 on equal data sets.
So we could have just one "utf8", with the following aliases:
- utf8mb4 is just a simple alias for the "new utf8"
- utf8mb3 is also an alias for the "new utf8", but with an automatic constraint added
After the upgrade, SHOW for old tables with the 3-byte utf8 could be displayed about like this:
where is_bmp_only() is a new built-in function to test if a string contains only Basic Multilingual Plane characters and returning:
- TRUE if a string contains only BMP characters U+0000..U+FFFF, fitting into 3-byte utf8 sequences
- FALSE if the string has characters outside of BMP, i.e. U+10000..U+10FFFF, and therefore require 4 bytes in utf8 encoding.
The exact API for the constrain function may be different, e.g. it could test for an arbitrary Unicode character range (not only BMP vs non-BMP). This could be useful for other purposes as well.
- It's not clear how to handle the database and the table level clause CHARACTER SET utf8mb3:
The table level CHARACTER SET for "t1" could probably automatically add the constraint into all columns that would have implicitly created as utf8mb3.
- TODO: add upgrade details
- TODO: add replication details