[MCOL-4684] Collation aggregation Created: 2021-04-20  Updated: 2023-07-01

Status: Open
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 5.6.1, 6.1.1
Fix Version/s: 23.10

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

When some operation arguments have different collations, MariaDB uses so called collation aggregation. For example:

SELECT CONCAT(a,b) FROM t1;
SELECT * FROM t1 WHERE a=b;

Suppose that collations of "a" and "b" are different. Collation aggregation allows to determine:

  • which collation to use for an operation result data type, like in CONCAT(a,b)
  • which collation to use to perform a comparison, like in WHERE a=b

Collation aggregation is based on the collation strength of the arguments. The collation strength is called "collation derivation" (in early SQL standard version it was called "collation coercibility").

For more information about collation strength, see here:
https://mariadb.com/kb/en/coercibility/

Some details on how collation aggregation works can be found in MySQL user manual:
https://dev.mysql.com/doc/refman/8.0/en/charset-collation-coercibility.html

Collation aggregation is used at least in:

  • UNION
  • CASE (for WHEN and THEN elements) and its abberiations and variations: IF(), COALESCE(), IFNULL(), NULLIF()
  • Comparison operators <,<=,<>,=,<=>,=>,>
  • IN
  • BETWEEN
  • LIKE
  • RLIKE
  • MATCH ... AGAINST
  • STRCMP()
  • REGEX_INSTR()
  • FIELD()
  • FIND_IN_SET()
  • LOCATE()
  • JSON_ARRAY()
  • JSON_OBJECTAGG()
  • CONCAT()
  • CONCAT_WS()
  • REPLACE()
  • REGEXP_REPLACE()
  • REGEXP_SUBSTR()
  • INSERT()
  • SUBSTR_INDEX()
  • ELT()
  • MAKE_SET()
  • PAD()
  • EXPORT_SET()
  • GROUP_CONCAT()
  • A number of XML functions inside XPath implementation used in EXTRACT_VALUE() and UpdateXML()

For MariaDB compatibility, we should implement collation aggregation in MCS.



 Comments   
Comment by Alexander Barkov [ 2021-04-23 ]

MariaDB also performs collation aggregation for a number of functions with a single formal argument of a string data type. That's needed to convert actual numeric arguments (passed to formal string arguments) to character_set_connection. For compatibility, ColumnStore should probably do the same thing:

  • REVERSE()
  • LOWER()
  • UPPER()
  • LEFT()
  • RIGHT()
  • SUBSTR()
  • TRIM()
  • SOUNDEX()
  • REPEAT()
  • STR_TO_DATE()

Example:

SET character_set_connection=ucs2;
SELECT HEX(LEFT(10,1)), CHARSET(LEFT(10,1));

+-----------------+---------------------+
| HEX(LEFT(10,1)) | CHARSET(LEFT(10,1)) |
+-----------------+---------------------+
| 0031            | ucs2                |
+-----------------+---------------------+

Generated at Thu Feb 08 02:52:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.