[MDEV-30164] System variable for default collations Created: 2022-12-06  Updated: 2024-01-19  Resolved: 2023-07-17

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 11.2.1

Type: Task Priority: Critical
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-25829 Change default collation to utf8mb4_u... In Review
Problem/Incident
is caused by MDEV-33182 Server assertion fails when trying to... Closed
Relates
relates to MDEV-31748 Reuse Simple_tokenizer Open
relates to MDEV-31739 Document @@character_set_collations (... Closed
relates to MDEV-32192 @@character_set_collations to resolve... Closed

 Description   

We're planning to change default collation for Unicode character sets to uca1400_ai_ci soon.

A hard coded change in the collation library is not desirable, and is hardly even possible.

It should be implemented on the server (rather than collation library) level, to set default collation overrides per-session:

  • For old-to-new replication
  • Some users may still want to use the legacy xxx_general_ci instead of uca1400_ai_ci.
  • We'll likely add more UCA versions in the future, so users will also want to choose the desired default UCA version.
  • Some users will want to have accent sensitive comparison by default, i.e. uca140_as_ci instead of uca1400_ai_ci.

A new system variable @@character_set_collations

The easiest way to solve all these problems is introducing a new system variable, in this format:
'charset1=collation1,charset2=collation2,charset3=collation3'.
That is, a comma separated list of charset/collation pairs.

For example:

SET @@character_set_collations='utf8mb4=utf8mb4_uca1400_ai_ci, utf8mb3=utf8mb4_uca1400_ai_ci';

The new variable will understand contextually typed collations:

# No needs for the charset prefix in the collation name:
SET @@character_set_collations='utf8mb4=uca1400_ai_ci, utf8mb3=uca1400_ai_ci';

The new variable will be global and session. The global value should be settable in the server command line and in my.cnf.

Convenience syntax (postponed)

Originally there was a proposal to add a convenience syntax, however we won't add it under terms of this task:

Let's additionally introduce a nicer syntax to set this variable in a more readable way:

SET CHARACTER SET COLLATION uca1400_ai_ci FOR utf8mb3, utf8mb4, ucs2, utf16, utf32;

Usage example:

SET @@character_set_collations='utf8mb4=uca1400_ai_ci';
CREATE OR REPLACE TABLE t1 (a TEXT CHARACTER SET utf8mb4);
SHOW CREATE TABLE t1;

+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table                                                                                                                                                     |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| t1    | CREATE TABLE `t1` (
  `a` text CHARACTER SET utf8mb4 COLLATE utf8mb4_uca1400_ai_ci DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci |
+-------+------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Notice, the column was created with the collation utf8mb4_uca1400_ai_ci, which is the current session default collation for utf8mb4 as stated in @@character_set_collations.

The new variable will take effect in all cases where a character set is explicitly or implicitly specified without an explicit COLLATE clause, including but not limited to:

  • Column collation
  • Table collation
  • Database collation
  • CHAR(expr USING csname)
  • CONVERT(expr USING csname)
  • CAST(expr AS CHAR CHARACTER SET csname)
  • '' - character string literal
  • _utf8mb3'text' - a character string literal with an introducer
  • _utf8mb3 X'61' - a character string literal with an introducer with hex notation
  • _utf8mb3 0x61 - a character string literal with an introducer with hex hybrid notation
  • @@collation_connection after a SET NAMES without COLLATE


 Comments   
Comment by Alexander Barkov [ 2022-12-15 ]

Here's a new version of the patch:

https://github.com/MariaDB/server/commit/940d028521f02845eda315fc03dc3434920a56a3

Comment by Alexander Barkov [ 2023-03-21 ]

serg, please review a new patch version:

https://github.com/MariaDB/server/commit/8d51c6d234b1730d4ff3b2c1fe7828eeca81998b

Comment by Alexander Barkov [ 2023-07-07 ]

Hello serg,
Please find a new patch version here:
https://github.com/MariaDB/server/commit/d8519e13fafce94447da11b5b895798744491955
and my reply by email.
Thanks.

Comment by Alexander Barkov [ 2023-07-10 ]

serg, a new patch version is here:
https://github.com/MariaDB/server/commit/5102d86beae6b7464b1271081f8003ae81850a5b
Please also find my reply by email.
Thanks.

Generated at Thu Feb 08 10:14:08 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.