[MDEV-27154] allkeys.txt based tests for Unicode-4.0.0 and 5.2.0 Created: 2021-12-02  Updated: 2021-12-20  Resolved: 2021-12-02

Status: Closed
Project: MariaDB Server
Component/s: Character Sets, Tests
Fix Version/s: 10.8.0

Type: Task Priority: Major
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Blocks
blocks MDEV-27009 Add UCA-14.0.0 collations Closed
Relates
relates to MDEV-21263 Allow packed values of non-sorted fie... Closed
relates to MDEV-27307 main.ctype_utf8mb4_uca_allkeys tests ... Closed

 Description   

Let's add MTR tests which will load the default weight table allkeys.txt from Unicode-4.0.0 and Unicode-5.2.0 to check that the collations utf8mb4_unicode_ci and utf8mb4_unicode_520_ci work as expected.

These new tests will cover all characters in the range U+0000..U+10FFFF and will make sure that nothing breaks after upcoming changes soon.

The idea is to calculate weights for every Unicode character into two ways:

1. Using WEIGHT_STRING() - this is the weight that the MariaDB collation returns for the character.
2. Parsing implicit weights from the corresponding line in allkeys.txt (or by calculating its implicit weight) - this is the weight that the collation is supposed to return according to the Unicode standard.

Both calculated values must produce equal results for every character.
If for some character the weights calculated in two weights are different, it means the collation works incorrectly.

The only exception character is "FDFA ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM". It has 18 weights in allkeys.txt, while MariaDB has a limit of 8 weights per character.



 Comments   
Comment by Alexander Barkov [ 2021-12-20 ]

It's also repeatable with mtr --valgrind run with this smaller script:

--source include/have_utf32.inc
--source include/have_utf8mb4.inc
 
SET NAMES latin1;
 
CREATE TABLE t1 (
  code INT NOT NULL,
  str VARCHAR(1) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL
) ENGINE=MyISAM;
 
DELIMITER $$;
FOR i IN 0x0000..0x2FFF
DO
  INSERT INTO t1 VALUES (i, CHAR(i USING utf32));
END FOR;
$$
DELIMITER ;$$
SELECT COUNT(*) FROM t1;
 
SELECT HEX(code), HEX(str) FROM t1 ORDER BY HEX(str);
 
DROP TABLE t1;

Generated at Thu Feb 08 09:50:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.