[MDEV-10759] Fix Aria to support 2-byte collation IDs Created: 2016-09-07  Updated: 2016-12-02  Resolved: 2016-12-02

Status: Closed
Project: MariaDB Server
Component/s: Character Sets, Storage Engine - Aria
Fix Version/s: 10.1.20

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None


 Description   

The Aria engine currently supports only 8-bit collation IDs.

As of version 10.2.2 we have 95 collations with 2-byte IDs, according to this query:

SELECT COUNT(*) FROM INFORMATION_SCHEMA.COLLATIONS WHERE id>255;

A big number of collations with 2-byte IDs was added by MDEV-9711 NO PAD collations.

It would be nice to fix before the 10.2.2 release.



 Comments   
Comment by Sergei Golubchik [ 2016-10-09 ]

What's the effect of Aria not supporting these collations?

Comment by Alexander Barkov [ 2016-10-10 ]

Aria internally compares according to a collation with ID equal to id%256.

Btw, In some cases (e.g in case of NO PAD collations) a query result can even be correct:

  • NO PAD collations IDs are assigned as ID of the corresponding PAD collation plus 0x400
  • Instead of NO PAD collation, Aria erroneously uses the corresponding PAD collation
  • False positives are filtered out in the SQL level using the real NO PAD collation
Comment by Alexander Barkov [ 2016-12-02 ]

This is the correct result using MyISAM:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8 COLLATE utf8_croatian_ci, KEY(a)) ENGINE=MyISAM;
INSERT INTO t1 VALUES ('na'),('nj'),('nz'),('Z');
SELECT * FROM t1 ORDER BY a;

+------+
| a    |
+------+
| na   |
| nz   |
| nj   |
| Z    |
+------+

Notice:

  • nj is sorted after nz, as nj is a special letter in Croatian
  • Z is sorted at the end, because it's a case insensitive collation

If I do the same in ARIA, instead of utf8_croatian_ci (ID=576) it uses armscii8_bin (ID=576%256=64)

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a VARCHAR(10) CHARACTER SET utf8 COLLATE utf8_croatian_ci, KEY(a)) ENGINE=ARIA;
INSERT INTO t1 VALUES ('na'),('nj'),('nz'),('Z');
SELECT * FROM t1 ORDER BY a;

+------+
| a    |
+------+
| Z    |
| na   |
| nj   |
| nz   |
+------+

Notice:

  • nj moved before nz
  • Z moved before na
Comment by Alexander Barkov [ 2016-12-02 ]

Here's the mapping that happens when the high byte disappears:

SELECT t1.ID,t1.COLLATION_NAME,t2.ID,t2.COLLATION_NAME
FROM INFORMATION_SCHEMA.COLLATIONS t1
  JOIN INFORMATION_SCHEMA.COLLATIONS t2
  ON (t1.ID%256=t2.ID)
WHERE t1.ID>255;

+-----+---------------------+-----+----------------------+
| ID  | COLLATION_NAME      | ID  | COLLATION_NAME       |
+-----+---------------------+-----+----------------------+
| 577 | utf8_myanmar_ci     |  65 | ascii_bin            |
| 578 | utf8_thai_520_w2    |  66 | cp1250_bin           |
| 576 | utf8_croatian_ci    |  64 | armscii8_bin         |  -- This mapping is used in the test above
| 640 | ucs2_croatian_ci    | 128 | ucs2_unicode_ci      |
| 641 | ucs2_myanmar_ci     | 129 | ucs2_icelandic_ci    |
| 642 | ucs2_thai_520_w2    | 130 | ucs2_latvian_ci      |
| 736 | utf32_croatian_ci   | 224 | utf8mb4_unicode_ci   |
| 737 | utf32_myanmar_ci    | 225 | utf8mb4_icelandic_ci |
| 738 | utf32_thai_520_w2   | 226 | utf8mb4_latvian_ci   |
| 672 | utf16_croatian_ci   | 160 | utf32_unicode_ci     |
| 673 | utf16_myanmar_ci    | 161 | utf32_icelandic_ci   |
| 674 | utf16_thai_520_w2   | 162 | utf32_latvian_ci     |
| 608 | utf8mb4_croatian_ci |  96 | cp932_bin            |
| 609 | utf8mb4_myanmar_ci  |  97 | eucjpms_japanese_ci  |
| 610 | utf8mb4_thai_520_w2 |  98 | eucjpms_bin          |
+-----+---------------------+-----+----------------------+

Comment by Michael Widenius [ 2016-12-02 ]

Fixed by using same approach that was done some time ago in MyISAM

Generated at Thu Feb 08 07:44:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.