[MDEV-25904] New collation functions to compare InnoDB style trimmed NO PAD strings Created: 2021-06-13 Updated: 2023-03-14 Resolved: 2022-01-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets, Storage Engine - InnoDB |
| Affects Version/s: | None |
| Fix Version/s: | 10.4.23, 10.5.14, 10.6.6, 10.7.2, 10.8.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Alexander Barkov | Assignee: | Alexander Barkov |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
How InnoDB trailing space trimming causes problems with NO PAD collationsUnless ROW_FORMAT=REDUNDANT, a CHAR(N) column of a variable-length encoding such as UTF-8 or UTF-16 is actually stored as variable length in InnoDB, occupying N*mbminlen to N*mbmaxlen bytes. For example, in a CHAR(3) CHARACTER SET utf8mb3 column the value 'äh' is stored in exactly 3 bytes, using 2 bytes for the first character and 1 byte for the next one. Trailing spaces are trimmed in order to reach the minimum column value of 3*1 bytes: it removes as many trailing spaces as possible, but it keeps the minimum of N*mbminlen bytes. The reason why the CHAR(N) value length is between N*mbminlen and N*mbmaxlen is to maximize the opportunity for update-in-place such that no size changes will occur. In other engines (and in InnoDB with no trimming formats) values are stored as N*mbmaxlen bytes. For example, in case of CHAR(3), 'äh' is stored as 'äh ', notice six trailing spaces. This makes 9 bytes total:
This InnoDB trailing space trimming is not taken into account when calling the collation library comparison routines such as strnncollsp(), which expects to get values with a proper amount of trailing spaces padded. See
In case of other engines (or not trimming InnoDB formats), which pad up to N*mbmaxlen, the comparison code goes through Field_string::cmp() which gets values padded up to N*mbmaxlen, then it trims redundant trailing spaces (using a charpos() call), then calls strnncollsp(). So latter gets the following values (with CHAR_LENGTH equal to 3 in both arguments):
Some theory on not trimmed strings NO PADSuppose we need to compare two not trimmed CHAR(N) strings ustr1 and ustr2, both padded in the way that CHAR_LENGTH is equal to N. Note, they can have different octet lengths. For NO PAD collations the following should be true:
Some theory on InnoDB trimmed stringsInside the InnoDB, we have two trimmed strings str1 and str2 both represented as a pointer and its length. We know that the original non-trimmed representations ustr1 and ustr2 were of the same CHAR_LENGTH initially, which was equal to N for both strings. But we trimmed some spaces from both strings. Note, in general case we removed different amount of spaces from the two strings! We don't know the exact number of spaces we removed from each string, because:
Given the above, let's iterate over the two trimmed strings:
The following cases are possibles. Case 1. Different weights found If at some point we find two different weights - we just return the result - the two strings are different. Case 2. Weights ended at the same time. If the loop comparing weights reached the end of the two strings at the same time. We know CHAR_LENGTH(str1) and CHAR_LENGTH(str2) of the two trimmed strings str1 and str2 at this point.
Case 3. One of the strings has more weights. Suppose the loop comparing weights reached the end of the two strings, but str2 still has some characters left. We know OCTET_LENGTH(str1). We don't know CHAR_LENGTH(str2) yet - we only know it still has more characters. That means we trimmed more trailing spaces from ustr1 than from ustr2. So we now need to compare the tail of str2 against REPEAT(' ', CHAR_LENGTH(str2)-CHAR_LENGTH(str1)) and return its result. Suppose the opposite happened: str1 still has some characters, but str2 has ended. We compare the tail of str1 against REPEAT(' ', CHAR_LENGTH(str1)-CHAR_LENGTH(str2)) New functions proposalWe need two new things in the collation library:
Something like this should do the job:
Then, InnoDB can perform trimmed NO PAD string comparison using this functions:
Note, the function innodb_compare_packed_char() could actually reside inside the collation library itself. |
| Comments |
| Comment by Alexander Barkov [ 2021-10-15 ] |
|
A gentle reminder, a set of patches for https://github.com/MariaDB/server/commits/bb-10.4-MDEV-25440 Thanks. |
| Comment by Sergei Golubchik [ 2022-01-10 ] |
|
6f9f19a3c79e is ok to push |