Details

    Description

      Collation rules for the Czech language do not comply to the standard rules as defined by the relevant standardization document (for an explanation in Czech see the Czech Wikipedia). A simple fix to strings/ctype-uca.c solved the issue; the patch is attached.

      Attachments

        Issue Links

          Activity

            We're using the Unicode Common Locale Data Repository as an authority for collations.
            In particular, the utf8_czech_ci collation is implemented according to these rules:
            http://unicode.org/repos/cldr/tags/latest/common/collation/cs.xml

            Can you past an example SQL script demonstrating that utf8_czech_ci returns a wrong result?

            Thanks!

            bar Alexander Barkov added a comment - We're using the Unicode Common Locale Data Repository as an authority for collations. In particular, the utf8_czech_ci collation is implemented according to these rules: http://unicode.org/repos/cldr/tags/latest/common/collation/cs.xml Can you past an example SQL script demonstrating that utf8_czech_ci returns a wrong result? Thanks!

            The XML file, besides being based on a source predating the new Czech rules (which were issued in 1994), is completely off the mark: it implements only Level One collation rules, ignoring Level Two.

            Apart from wrong ordering in some cases, the current collation makes different characters appear to be equal, so that, e.g., SELECT 'a' COLLATE utf8_czech_ci = 'á' COLLATE utf8_czech_ci yields 1 instead of 0.

            My patch fixes these issues. For fairness' sake, it should be noted that full implementation of the rules would require a longer list as there are rules for foreign diacritical marks as well, which I ignore (except for the German umlaut, which is very common in Czech surnames).

            tompecina Tomáš Pecina added a comment - The XML file, besides being based on a source predating the new Czech rules (which were issued in 1994), is completely off the mark: it implements only Level One collation rules, ignoring Level Two. Apart from wrong ordering in some cases, the current collation makes different characters appear to be equal, so that, e.g., SELECT 'a' COLLATE utf8_czech_ci = 'á' COLLATE utf8_czech_ci yields 1 instead of 0. My patch fixes these issues. For fairness' sake, it should be noted that full implementation of the rules would require a longer list as there are rules for foreign diacritical marks as well, which I ignore (except for the German umlaut, which is very common in Czech surnames).

            It seems that you need a 2-level collation which would take into account secondary difference on comparison and therefore compare letters with different diacritic marks as non-equal.

            The good news is that we recently added support for 2-level collations and added utf16_thai_520_w2 as an example.

            Please check the attached patch utf8_czech_520_w2.diff. It compares 'a' and 'á' as non-equal:

            MariaDB [test]> SELECT 'a' COLLATE utf8_czech_520_w2 = 'á';
            +--------------------------------------+
            | 'a' COLLATE utf8_czech_520_w2 = 'á'  |
            +--------------------------------------+
            |                                    0 |
            +--------------------------------------+
            

            Does it work as expected otherwise?

            bar Alexander Barkov added a comment - It seems that you need a 2-level collation which would take into account secondary difference on comparison and therefore compare letters with different diacritic marks as non-equal. The good news is that we recently added support for 2-level collations and added utf16_thai_520_w2 as an example. Please check the attached patch utf8_czech_520_w2.diff. It compares 'a' and 'á' as non-equal: MariaDB [test]> SELECT 'a' COLLATE utf8_czech_520_w2 = 'á'; +--------------------------------------+ | 'a' COLLATE utf8_czech_520_w2 = 'á' | +--------------------------------------+ | 0 | +--------------------------------------+ Does it work as expected otherwise?

            Yes, this is a valid solution. Thanks!

            tompecina Tomáš Pecina added a comment - Yes, this is a valid solution. Thanks!

            Great! Thanks for testing!
            I have two things to discuss before it can be added to the main distribution:
            1. Is it important for the new Czech collation to support Thai contractions?
            There are two choices:
            a. Support Thai contractions. In this case the new collation will be closer to the Unicode canonical sorting order.
            b. Ignore Thai contractions. In this case it will be more consistent with utf8_czech_ci.
            I'm inclined toward "b". What do you think?

            2. Can you send the scripts that you used for testing, so we can include them into our
            standard test suite?

            Thanks!

            bar Alexander Barkov added a comment - Great! Thanks for testing! I have two things to discuss before it can be added to the main distribution: 1. Is it important for the new Czech collation to support Thai contractions? There are two choices: a. Support Thai contractions. In this case the new collation will be closer to the Unicode canonical sorting order. b. Ignore Thai contractions. In this case it will be more consistent with utf8_czech_ci. I'm inclined toward "b". What do you think? 2. Can you send the scripts that you used for testing, so we can include them into our standard test suite? Thanks!

            1. No, there are no Czech contractions. However, it is imperative that either the two-level algorithm is used (the tailoring being unchanged), or my patch is applied to the existing one-level code. What it actually does is it mimics Level Two rules using Level One. The nuances are so subtle at least 99,9% of native Czech speakers will never tell the difference, but to achieve full conformity with the standard, one level is not enough.

            2. It is attached. I have already deleted my test build so I cannot test it against your patch without a lot of compilation, all I can say it tests both L1 and L2 rules and it passes on PostgreSQL, while failing rather misearbly on MariaDB.

            tompecina Tomáš Pecina added a comment - 1. No, there are no Czech contractions. However, it is imperative that either the two-level algorithm is used (the tailoring being unchanged), or my patch is applied to the existing one-level code. What it actually does is it mimics Level Two rules using Level One. The nuances are so subtle at least 99,9% of native Czech speakers will never tell the difference, but to achieve full conformity with the standard, one level is not enough. 2. It is attached. I have already deleted my test build so I cannot test it against your patch without a lot of compilation, all I can say it tests both L1 and L2 rules and it passes on PostgreSQL, while failing rather misearbly on MariaDB.
            bar Alexander Barkov added a comment - - edited

            Hi Tomáš,

            Unfortunately, we cannot add this new collation into 10.2 right now.
            The problem is that Unicode version 5.2.0 is quite out of the date. It would be nice to add support for the latest Unicode-9.0.0 first, and then create the new collation based on Unicode-9.0.0. Wasting collations IDs for the 5.2.0 based collations is not a good idea.
            Adding Unicode-9.0.0 support will need some efforts to see the latest changes in MySQL (as it adds Unicode-9.0.0 support in the latest 8.0 release).
            So we'll most likely add Unicode-9.0.0 only into MariaDB-10.3.

            So for now I suggest a simple workaround. I just pushed a patch into 10.2 which makes it possible to define 2-level collations in the collation file Index.xml,
            which resides in the $prefix/share/charsets directory (usually in /usr/share/mariadb/charsets).

            You just need to compile the latest 10.2 sources (or wait for the next 10.2.3 release), install, and add an XML fragment into your Index.xml file, as described in MDEV-11255.

            Thanks for the file with tests. I used it (with a small modification) in the patch:
            https://github.com/MariaDB/server/commit/0259b3cbbe47448beb385a33348af56300004aa6

            bar Alexander Barkov added a comment - - edited Hi Tomáš, Unfortunately, we cannot add this new collation into 10.2 right now. The problem is that Unicode version 5.2.0 is quite out of the date. It would be nice to add support for the latest Unicode-9.0.0 first, and then create the new collation based on Unicode-9.0.0. Wasting collations IDs for the 5.2.0 based collations is not a good idea. Adding Unicode-9.0.0 support will need some efforts to see the latest changes in MySQL (as it adds Unicode-9.0.0 support in the latest 8.0 release). So we'll most likely add Unicode-9.0.0 only into MariaDB-10.3. So for now I suggest a simple workaround. I just pushed a patch into 10.2 which makes it possible to define 2-level collations in the collation file Index.xml, which resides in the $prefix/share/charsets directory (usually in /usr/share/mariadb/charsets). You just need to compile the latest 10.2 sources (or wait for the next 10.2.3 release), install, and add an XML fragment into your Index.xml file, as described in MDEV-11255 . Thanks for the file with tests. I used it (with a small modification) in the patch: https://github.com/MariaDB/server/commit/0259b3cbbe47448beb385a33348af56300004aa6

            I'm closing this issue as "not a bug".

            bar Alexander Barkov added a comment - I'm closing this issue as "not a bug".

            People

              bar Alexander Barkov
              tompecina Tomáš Pecina
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.