[MDEV-11199] Czech Unicode collations are wrong Created: 2016-11-01 Updated: 2016-11-15 Resolved: 2016-11-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Affects Version/s: | 10.2 |
| Fix Version/s: | 10.2.3 |
| Type: | Bug | Priority: | Major |
| Reporter: | Tomáš Pecina | Assignee: | Alexander Barkov |
| Resolution: | Not a Bug | Votes: | 0 |
| Labels: | Czech, collation, contribution | ||
| Environment: |
all |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Collation rules for the Czech language do not comply to the standard rules as defined by the relevant standardization document (for an explanation in Czech see the Czech Wikipedia). A simple fix to strings/ctype-uca.c solved the issue; the patch is attached. |
| Comments |
| Comment by Alexander Barkov [ 2016-11-01 ] | ||||||
|
We're using the Unicode Common Locale Data Repository as an authority for collations. Can you past an example SQL script demonstrating that utf8_czech_ci returns a wrong result? Thanks! | ||||||
| Comment by Tomáš Pecina [ 2016-11-01 ] | ||||||
|
The XML file, besides being based on a source predating the new Czech rules (which were issued in 1994), is completely off the mark: it implements only Level One collation rules, ignoring Level Two. Apart from wrong ordering in some cases, the current collation makes different characters appear to be equal, so that, e.g., SELECT 'a' COLLATE utf8_czech_ci = 'á' COLLATE utf8_czech_ci yields 1 instead of 0. My patch fixes these issues. For fairness' sake, it should be noted that full implementation of the rules would require a longer list as there are rules for foreign diacritical marks as well, which I ignore (except for the German umlaut, which is very common in Czech surnames). | ||||||
| Comment by Alexander Barkov [ 2016-11-01 ] | ||||||
|
It seems that you need a 2-level collation which would take into account secondary difference on comparison and therefore compare letters with different diacritic marks as non-equal. The good news is that we recently added support for 2-level collations and added utf16_thai_520_w2 as an example. Please check the attached patch utf8_czech_520_w2.diff. It compares 'a' and 'á' as non-equal:
Does it work as expected otherwise? | ||||||
| Comment by Tomáš Pecina [ 2016-11-01 ] | ||||||
|
Yes, this is a valid solution. Thanks! | ||||||
| Comment by Alexander Barkov [ 2016-11-02 ] | ||||||
|
Great! Thanks for testing! 2. Can you send the scripts that you used for testing, so we can include them into our Thanks! | ||||||
| Comment by Tomáš Pecina [ 2016-11-02 ] | ||||||
|
1. No, there are no Czech contractions. However, it is imperative that either the two-level algorithm is used (the tailoring being unchanged), or my patch is applied to the existing one-level code. What it actually does is it mimics Level Two rules using Level One. The nuances are so subtle at least 99,9% of native Czech speakers will never tell the difference, but to achieve full conformity with the standard, one level is not enough. 2. It is attached. I have already deleted my test build so I cannot test it against your patch without a lot of compilation, all I can say it tests both L1 and L2 rules and it passes on PostgreSQL, while failing rather misearbly on MariaDB. | ||||||
| Comment by Alexander Barkov [ 2016-11-08 ] | ||||||
|
Hi Tomáš, Unfortunately, we cannot add this new collation into 10.2 right now. So for now I suggest a simple workaround. I just pushed a patch into 10.2 which makes it possible to define 2-level collations in the collation file Index.xml, You just need to compile the latest 10.2 sources (or wait for the next 10.2.3 release), install, and add an XML fragment into your Index.xml file, as described in Thanks for the file with tests. I used it (with a small modification) in the patch: | ||||||
| Comment by Alexander Barkov [ 2016-11-15 ] | ||||||
|
I'm closing this issue as "not a bug". |