[MDEV-11199] Czech Unicode collations are wrong Created: 2016-11-01  Updated: 2016-11-15  Resolved: 2016-11-15

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.2
Fix Version/s: 10.2.3

Type: Bug Priority: Major
Reporter: Tomáš Pecina Assignee: Alexander Barkov
Resolution: Not a Bug Votes: 0
Labels: Czech, collation, contribution
Environment:

all


Attachments: File mariadb-utf8-czech-collation.patch     File test1.sql     File utf8_czech_520_w2.diff    
Issue Links:
Relates
relates to MDEV-10132 utf8_thai_520_w2 collation Closed
relates to MDEV-11256 Unicode 9.0.0 Open
relates to MDEV-11255 LDML: allow defining 2-level UCA coll... Closed

 Description   

Collation rules for the Czech language do not comply to the standard rules as defined by the relevant standardization document (for an explanation in Czech see the Czech Wikipedia). A simple fix to strings/ctype-uca.c solved the issue; the patch is attached.



 Comments   
Comment by Alexander Barkov [ 2016-11-01 ]

We're using the Unicode Common Locale Data Repository as an authority for collations.
In particular, the utf8_czech_ci collation is implemented according to these rules:
http://unicode.org/repos/cldr/tags/latest/common/collation/cs.xml

Can you past an example SQL script demonstrating that utf8_czech_ci returns a wrong result?

Thanks!

Comment by Tomáš Pecina [ 2016-11-01 ]

The XML file, besides being based on a source predating the new Czech rules (which were issued in 1994), is completely off the mark: it implements only Level One collation rules, ignoring Level Two.

Apart from wrong ordering in some cases, the current collation makes different characters appear to be equal, so that, e.g., SELECT 'a' COLLATE utf8_czech_ci = 'á' COLLATE utf8_czech_ci yields 1 instead of 0.

My patch fixes these issues. For fairness' sake, it should be noted that full implementation of the rules would require a longer list as there are rules for foreign diacritical marks as well, which I ignore (except for the German umlaut, which is very common in Czech surnames).

Comment by Alexander Barkov [ 2016-11-01 ]

It seems that you need a 2-level collation which would take into account secondary difference on comparison and therefore compare letters with different diacritic marks as non-equal.

The good news is that we recently added support for 2-level collations and added utf16_thai_520_w2 as an example.

Please check the attached patch utf8_czech_520_w2.diff. It compares 'a' and 'á' as non-equal:

MariaDB [test]> SELECT 'a' COLLATE utf8_czech_520_w2 = 'á';
+--------------------------------------+
| 'a' COLLATE utf8_czech_520_w2 = 'á'  |
+--------------------------------------+
|                                    0 |
+--------------------------------------+

Does it work as expected otherwise?

Comment by Tomáš Pecina [ 2016-11-01 ]

Yes, this is a valid solution. Thanks!

Comment by Alexander Barkov [ 2016-11-02 ]

Great! Thanks for testing!
I have two things to discuss before it can be added to the main distribution:
1. Is it important for the new Czech collation to support Thai contractions?
There are two choices:
a. Support Thai contractions. In this case the new collation will be closer to the Unicode canonical sorting order.
b. Ignore Thai contractions. In this case it will be more consistent with utf8_czech_ci.
I'm inclined toward "b". What do you think?

2. Can you send the scripts that you used for testing, so we can include them into our
standard test suite?

Thanks!

Comment by Tomáš Pecina [ 2016-11-02 ]

1. No, there are no Czech contractions. However, it is imperative that either the two-level algorithm is used (the tailoring being unchanged), or my patch is applied to the existing one-level code. What it actually does is it mimics Level Two rules using Level One. The nuances are so subtle at least 99,9% of native Czech speakers will never tell the difference, but to achieve full conformity with the standard, one level is not enough.

2. It is attached. I have already deleted my test build so I cannot test it against your patch without a lot of compilation, all I can say it tests both L1 and L2 rules and it passes on PostgreSQL, while failing rather misearbly on MariaDB.

Comment by Alexander Barkov [ 2016-11-08 ]

Hi Tomáš,

Unfortunately, we cannot add this new collation into 10.2 right now.
The problem is that Unicode version 5.2.0 is quite out of the date. It would be nice to add support for the latest Unicode-9.0.0 first, and then create the new collation based on Unicode-9.0.0. Wasting collations IDs for the 5.2.0 based collations is not a good idea.
Adding Unicode-9.0.0 support will need some efforts to see the latest changes in MySQL (as it adds Unicode-9.0.0 support in the latest 8.0 release).
So we'll most likely add Unicode-9.0.0 only into MariaDB-10.3.

So for now I suggest a simple workaround. I just pushed a patch into 10.2 which makes it possible to define 2-level collations in the collation file Index.xml,
which resides in the $prefix/share/charsets directory (usually in /usr/share/mariadb/charsets).

You just need to compile the latest 10.2 sources (or wait for the next 10.2.3 release), install, and add an XML fragment into your Index.xml file, as described in MDEV-11255.

Thanks for the file with tests. I used it (with a small modification) in the patch:
https://github.com/MariaDB/server/commit/0259b3cbbe47448beb385a33348af56300004aa6

Comment by Alexander Barkov [ 2016-11-15 ]

I'm closing this issue as "not a bug".

Generated at Thu Feb 08 07:48:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.