Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32127

direct comparison of strings in different charsets

    XMLWordPrintable

Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Character Sets
    • None

    Description

      Since 10.10.1 (MDEV-27009) MariaDB has detached collations from character sets.

      A character set defines characters, what sequence of bytes represents what character, what properties what character has, etc. Examples of character sets are utf8mb3, ucs2, latin1, sjis.
      A collation defines how sequences of characters are compared. Examples of collations are uca1400_latvian_ai_ci, latin1_german2_ci, sjis_japanese_ci.

      Some collations apply only to one character set and have the character set as a part of the collation name (e.g. latin1_german2_ci and sjis_japanese_ci), others (like uca1400_latvian_ai_ci) apply to many different character sets.

      When MariaDB internally need to compare two strings, it first needs to have them both in the same character set. Thus when two expressions (items) are compared, first MariaDB determines what character set and collation they should be compared in, then wraps them, as needed, into CONVERT(expr USING charset) function. Then, during execution it gets both expression results already in the same character set and compares them.

      Note that for this to succeed, the server must determine one single collation that can be used to compare results of expressions, this is logically unavoidable. But strictly speaking, there is no need to convert both results to the same character set. If they are in different character sets, but can be compared according to one specific collation, then this collation must apply to both character sets. This is only true for the UCA collations. Comparison in the UCA collation generally works like

      • read the next weight from the first string (read characters as needed, convert to the weight)
      • read the next weight from the second string
      • compare weights
      • repeat

      converting characters to weights is character-set dependent operation, but it's very easy to modify the above loop to use different character sets for the first and the second string, it won't affect how weights are compared.

      This will eliminate fragile expression tree rewrites and expensive character set conversions.

      Attachments

        Issue Links

          Activity

            People

              bar Alexander Barkov
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.