Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25904

New collation functions to compare InnoDB style trimmed NO PAD strings




      How InnoDB trailing space trimming causes problems with NO PAD collations

      Unless ROW_FORMAT=REDUNDANT, a CHAR(N) column of a variable-length encoding such as UTF-8 or UTF-16 is actually stored as variable length in InnoDB, occupying N*mbminlen to N*mbmaxlen bytes. For example, in a CHAR(3) CHARACTER SET utf8mb3 column the value 'äh' is stored in exactly 3 bytes, using 2 bytes for the first character and 1 byte for the next one. Trailing spaces are trimmed in order to reach the minimum column value of 3*1 bytes: it removes as many trailing spaces as possible, but it keeps the minimum of N*mbminlen bytes.

      The reason why the CHAR(N) value length is between N*mbminlen and N*mbmaxlen is to maximize the opportunity for update-in-place such that no size changes will occur.

      In other engines (and in InnoDB with no trimming formats) values are stored as N*mbmaxlen bytes. For example, in case of CHAR(3), 'äh' is stored as 'äh ', notice six trailing spaces. This makes 9 bytes total:

      • 2 bytes for 'ä'
      • 1 bytes for 'h'
      • 6 trailing spaces

      This InnoDB trailing space trimming is not taken into account when calling the collation library comparison routines such as strnncollsp(), which expects to get values with a proper amount of trailing spaces padded. See MDEV-25440 for a wrong behaviour example. At the low level, such as cmp_data_data(), InnoDB only has the column lengths in bytes, not in characters at all. If, after the trimming, one of the CHAR(3) strings is 'äh' and another is 'ah ' (note one trailing space), than a regular strnncollsp() call returns a wrong result in case of NOPAD collations:

      SET NAMES utf8 COLLATE utf8_general_nopad_ci;
      SELECT strcmp('äh','ah ');

      | strcmp('äh','ah ')  |
      |                  -1 |

      In case of other engines (or not trimming InnoDB formats), which pad up to N*mbmaxlen, the comparison code goes through Field_string::cmp() which gets values padded up to N*mbmaxlen, then it trims redundant trailing spaces (using a charpos() call), then calls strnncollsp(). So latter gets the following values (with CHAR_LENGTH equal to 3 in both arguments):

      SET NAMES utf8 COLLATE utf8_general_nopad_ci;
      SELECT strcmp('äh ','ah '); -- Notice one trailing space in both values

      | strcmp('äh ','ah ')  |
      |                    0 |

      Some theory on not trimmed strings NO PAD

      Suppose we need to compare two not trimmed CHAR(N) strings ustr1 and ustr2, both padded in the way that CHAR_LENGTH is equal to N. Note, they can have different octet lengths.

      For NO PAD collations the following should be true:

      • Axiom 1a. We can append the same amount of spaces to ustr1 and ustr2. The result of strnncollsp() won't change.
      • Axiom 1b. We can trim the same amount of spaces from ustr1 and ustr2. The result of strnncollsp() won't change.
      • Axiom 1c. We can NOT trim or append different amount of spaces. The result will change. The above example demonstrates this.

      Some theory on InnoDB trimmed strings

      Inside the InnoDB, we have two trimmed strings str1 and str2 both represented as a pointer and its length.

      We know that the original non-trimmed representations ustr1 and ustr2 were of the same CHAR_LENGTH initially, which was equal to N for both strings.

      But we trimmed some spaces from both strings. Note, in general case we removed different amount of spaces from the two strings!

      We don't know the exact number of spaces we removed from each string, because:

      • We don't have access to N in the relevant part of the InnoDB code.
      • Even if we knew the extact N, we don't want to calculate CHAR_LENGTH - it would mean an extra loop other the two strings, which would be too slow.

      Given the above, let's iterate over the two trimmed strings:

      • scanning and comparing weights, and
      • counting number of characters at the same time.

      The following cases are possibles.

      Case 1. Different weights found

      If at some point we find two different weights - we just return the result - the two strings are different.

      Case 2. Weights ended at the same time.

      If the loop comparing weights reached the end of the two strings at the same time. We know CHAR_LENGTH(str1) and CHAR_LENGTH(str2) of the two trimmed strings str1 and str2 at this point.

      • If CHAR_LENGTH(str1) == CHAR_LENGTH(str2), then we removed the same amount of trailing spaces from ustr1 and ustr2. That means they were equal.
      • If CHAR_LENGTH(str1) < CHAR_LENGTH(str2), then ustr1 was smaller than ustr2.
      • If CHAR_LENGTH(str1) > CHAR_LENGTH(str2), then ustr1 was greater than ustr2.

      Case 3. One of the strings has more weights.

      Suppose the loop comparing weights reached the end of the two strings, but str2 still has some characters left. We know OCTET_LENGTH(str1). We don't know CHAR_LENGTH(str2) yet - we only know it still has more characters. That means we trimmed more trailing spaces from ustr1 than from ustr2. So we now need to compare the tail of str2 against REPEAT(' ', CHAR_LENGTH(str2)-CHAR_LENGTH(str1)) and return its result.

      Suppose the opposite happened: str1 still has some characters, but str2 has ended. We compare the tail of str1 against REPEAT(' ', CHAR_LENGTH(str1)-CHAR_LENGTH(str2))

      New functions proposal

      We need two new things in the collation library:

      • A loop scanning and comparing weights, and counting CHAR_LENGTHs of the two strings at the same time.
      • A loop comparing a string to REPEAT(' ',NSPACES) to handle Case 3. Or just doing "scan weights and compare against the weight for the space character", so we don't have to know and pass the exact NSPACES.

      Something like this should do the job:

      typeef struct
        size_t octet_length; /* the number of bytes that we have scanned */
        size_t char_length;  /* the number of characters that we have scanned */
      int strnncollsp_ex(CHARSET_INFO *cs,
                         const uchar *str1, size_t len1,
                         const uchar *str2, size_t len2,
                         STRCOLL_STATUS *st1,
                         STRCOLL_STATUS *st2);
      int compare_to_spaces(CHARSET_INFO *cs,
                            const uchar *str, size_t len);

      Then, InnoDB can perform trimmed NO PAD string comparison using this functions:

      int innodb_compare_packed_char(CHARSET_INFO *cs,
                                     const uchar *str1, size_t len1,
                                     const uchar *str2, size_t len2)
        STRCOLL_STATUS st1, st2;
        int rc= cs->coll->strnncollsp_ex(cs, str1, len1, str2, len2, &st1, &st2);
        if (rc !=0)
          return rc; /* Case 1: a weight difference was found *
        if (st1.octet_length == len1 &&
            st2.octet_length == len2)
          /* Case2: Both strings ended at the same time *
          return st1.char_length == st2.char_length ? 0 :
                 st1.char_length < st2.char_length  ? -1 : 1;
        /* Case 3: One of the strings appears to be longer (in terms of characters) */
        return st1.octet_length < len1 ?
          +cs->coll->compare_to_spaces(cs, str1 + st1.octet_length, len1 - st1.octet_length) :
          -cs->coll->compare_to_spaces(cs, str2 + st2.octet_length, len2 - st2.octet_length);

      Note, the function innodb_compare_packed_char() could actually reside inside the collation library itself.


          Issue Links



              bar Alexander Barkov
              bar Alexander Barkov
              0 Vote for this issue
              5 Start watching this issue



                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.