Unless ROW_FORMAT=REDUNDANT, a CHAR(N) column of a variable-length encoding such as UTF-8 or UTF-16 is actually stored as variable length in InnoDB, occupying N*mbminlen to N*mbmaxlen bytes. For example, in a CHAR(3) CHARACTER SET utf8mb3 column the value 'äh' is stored in exactly 3 bytes, using 2 bytes for the first character and 1 byte for the next one. Trailing spaces are trimmed in order to reach the minimum column value of 3*1 bytes: it removes as many trailing spaces as possible, but it keeps the minimum of N*mbminlen bytes.
The reason why the CHAR(N) value length is between N*mbminlen and N*mbmaxlen is to maximize the opportunity for update-in-place such that no size changes will occur.
In other engines (and in InnoDB with no trimming formats) values are stored as N*mbmaxlen bytes. For example, in case of CHAR(3), 'äh' is stored as 'äh ', notice six trailing spaces. This makes 9 bytes total:
- 2 bytes for 'ä'
- 1 bytes for 'h'
- 6 trailing spaces
This InnoDB trailing space trimming is not taken into account when calling the collation library comparison routines such as strnncollsp(), which expects to get values with a proper amount of trailing spaces padded. See
MDEV-25440 for a wrong behaviour example. At the low level, such as cmp_data_data(), InnoDB only has the column lengths in bytes, not in characters at all. If, after the trimming, one of the CHAR(3) strings is 'äh' and another is 'ah ' (note one trailing space), than a regular strnncollsp() call returns a wrong result in case of NOPAD collations:
In case of other engines (or not trimming InnoDB formats), which pad up to N*mbmaxlen, the comparison code goes through Field_string::cmp() which gets values padded up to N*mbmaxlen, then it trims redundant trailing spaces (using a charpos() call), then calls strnncollsp(). So latter gets the following values (with CHAR_LENGTH equal to 3 in both arguments):
Suppose we need to compare two not trimmed CHAR(N) strings ustr1 and ustr2, both padded in the way that CHAR_LENGTH is equal to N. Note, they can have different octet lengths.
For NO PAD collations the following should be true:
- Axiom 1a. We can append the same amount of spaces to ustr1 and ustr2. The result of strnncollsp() won't change.
- Axiom 1b. We can trim the same amount of spaces from ustr1 and ustr2. The result of strnncollsp() won't change.
- Axiom 1c. We can NOT trim or append different amount of spaces. The result will change. The above example demonstrates this.
Inside the InnoDB, we have two trimmed strings str1 and str2 both represented as a pointer and its length.
We know that the original non-trimmed representations ustr1 and ustr2 were of the same CHAR_LENGTH initially, which was equal to N for both strings.
But we trimmed some spaces from both strings. Note, in general case we removed different amount of spaces from the two strings!
We don't know the exact number of spaces we removed from each string, because:
- We don't have access to N in the relevant part of the InnoDB code.
- Even if we knew the extact N, we don't want to calculate CHAR_LENGTH - it would mean an extra loop other the two strings, which would be too slow.
Given the above, let's iterate over the two trimmed strings:
- scanning and comparing weights, and
- counting number of characters at the same time.
The following cases are possibles.
Case 1. Different weights found
If at some point we find two different weights - we just return the result - the two strings are different.
Case 2. Weights ended at the same time.
If the loop comparing weights reached the end of the two strings at the same time. We know CHAR_LENGTH(str1) and CHAR_LENGTH(str2) of the two trimmed strings str1 and str2 at this point.
- If CHAR_LENGTH(str1) == CHAR_LENGTH(str2), then we removed the same amount of trailing spaces from ustr1 and ustr2. That means they were equal.
- If CHAR_LENGTH(str1) < CHAR_LENGTH(str2), then ustr1 was smaller than ustr2.
- If CHAR_LENGTH(str1) > CHAR_LENGTH(str2), then ustr1 was greater than ustr2.
Case 3. One of the strings has more weights.
Suppose the loop comparing weights reached the end of the two strings, but str2 still has some characters left. We know OCTET_LENGTH(str1). We don't know CHAR_LENGTH(str2) yet - we only know it still has more characters. That means we trimmed more trailing spaces from ustr1 than from ustr2. So we now need to compare the tail of str2 against REPEAT(' ', CHAR_LENGTH(str2)-CHAR_LENGTH(str1)) and return its result.
Suppose the opposite happened: str1 still has some characters, but str2 has ended. We compare the tail of str1 against REPEAT(' ', CHAR_LENGTH(str1)-CHAR_LENGTH(str2))
We need two new things in the collation library:
- A loop scanning and comparing weights, and counting CHAR_LENGTHs of the two strings at the same time.
- A loop comparing a string to REPEAT(' ',NSPACES) to handle Case 3. Or just doing "scan weights and compare against the weight for the space character", so we don't have to know and pass the exact NSPACES.
Something like this should do the job:
Then, InnoDB can perform trimmed NO PAD string comparison using this functions:
Note, the function innodb_compare_packed_char() could actually reside inside the collation library itself.