[MDEV-12110] Inavlid polish collation Created: 2017-02-22  Updated: 2017-10-18  Resolved: 2017-10-18

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.0.27-galera
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Marek Assignee: Alexander Barkov
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

openSUSE 42.2 (x86_64)



 Description   

Sample table:

CREATE TABLE `product` (
	`id` INT(11) NOT NULL AUTO_INCREMENT,
	`productName` VARCHAR(255) NULL DEFAULT NULL COLLATE 'utf8_polish_ci',
	PRIMARY KEY (`id`)
)
COLLATE='utf8_polish_ci'
ENGINE=InnoDB
 
INSERT INTO `product` (`id`, `productName`) VALUES (1, 'maka');
INSERT INTO `product` (`id`, `productName`) VALUES (2, 'mąka');
INSERT INTO `product` (`id`, `productName`) VALUES (3, 'Mąka');
INSERT INTO `product` (`id`, `productName`) VALUES (4, 'lodz');
INSERT INTO `product` (`id`, `productName`) VALUES (5, 'łódź');
INSERT INTO `product` (`id`, `productName`) VALUES (6, 'Łódź');

Now I don't understand why query:

SELECT * FROM `product` WHERE productName collate utf8_polish_ci LIKE '%maka%';

returns only 1 row. Forcing utf8_unicode_ci shows 3 results, but for:

SELECT * FROM `product` WHERE productName collate utf8_unicode_ci LIKE '%lodz%';

there is again 1 result. Ł and ł letters are not mapped to l. This should work according to http://unicode.org/repos/cldr/tags/latest/common/collation/pl.xml



 Comments   
Comment by Alexander Barkov [ 2017-02-27 ]

This query returns only one row:

SELECT * FROM `product` WHERE productName collate utf8_polish_ci LIKE '%maka%';

because 'a' and 'ą' are different letters. 'ą' is greater than 'a' on the primary level.

This query returns only one row:

SELECT * FROM `product` WHERE productName collate utf8_polish_ci LIKE '%lodz%';

because:

  • 'l' and 'ł' are different letters: 'ł' is greater than 'l' on the primary level.
  • 'o' and 'ó' are different letters: 'ó' is greater than 'o' on the primary level.
  • 'z' and 'ź' are different letters: 'ź' is greater than 'z' on the primary level.

So utf8_polish_ci works in full accordance with:
http://unicode.org/repos/cldr/tags/latest/common/collation/pl.xml
I don't see any bugs here.

Now let's check utf8_unicode_ci:

This query returns three rows:

SELECT * FROM `product` WHERE productName collate utf8_unicode_ci LIKE '%maka%';

because in the default weight table, 'a' and 'ą' are equal on the primary level.

This query returns one one:

SELECT * FROM `product` WHERE productName collate utf8_unicode_ci LIKE '%lodz%';

because 'l' and 'ł' were different letters in the default weight table for Unicode-4.x:
ftp://ftp.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt

Notice, the two letters have a different primary weight:

006C  ; [.0F2E.0020.0002.006C] # LATIN SMALL LETTER L
0142  ; [.0F36.0020.0002.0142] # LATIN SMALL LETTER L WITH STROKE

Notice, 0F2E vs 0F36.

utf8_unicode_ci is based on Unicode-4.x, so it inherits all Unicode-4.x features.
So utf8_unicode_ci works according to Unicode-4.x, as it's supposed to. There are no bugs here.

In later versions, Unicode changed the default weight table. In Unicode-5.x, 'l' and 'ł' are already primary equal letters:
ftp://ftp.unicode.org/Public/UCA/5.1.0/allkeys.txt

006C  ; [.1262.0020.0002.006C] # LATIN SMALL LETTER L
0142  ; [.1262.0020.0002.006C][.0000.008D.0002.0335] # LATIN SMALL LETTER L WITH STROKE; QQCM

Notice, they have the same primary weight of 1262.

So you can use this query to get the desired result:

SELECT * FROM `product` WHERE productName collate utf8_unicode_520_ci LIKE '%lodz%'

It returns three rows.

Generated at Thu Feb 08 07:55:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.