[MDEV-25904] New collation functions to compare InnoDB style trimmed NO PAD strings - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 10.4.23, 10.5.14, 10.6.6, 10.7.2, 10.8.1
Component/s: Character Sets, Storage Engine - InnoDB
Labels:
None

Description

How InnoDB trailing space trimming causes problems with NO PAD collations

Unless ROW_FORMAT=REDUNDANT, a CHAR(N) column of a variable-length encoding such as UTF-8 or UTF-16 is actually stored as variable length in InnoDB, occupying N*mbminlen to N*mbmaxlen bytes. For example, in a CHAR(3) CHARACTER SET utf8mb3 column the value 'äh' is stored in exactly 3 bytes, using 2 bytes for the first character and 1 byte for the next one. Trailing spaces are trimmed in order to reach the minimum column value of 3*1 bytes: it removes as many trailing spaces as possible, but it keeps the minimum of N*mbminlen bytes.

The reason why the CHAR(N) value length is between N*mbminlen and N*mbmaxlen is to maximize the opportunity for update-in-place such that no size changes will occur.

In other engines (and in InnoDB with no trimming formats) values are stored as N*mbmaxlen bytes. For example, in case of CHAR(3), 'äh' is stored as 'äh ', notice six trailing spaces. This makes 9 bytes total:

2 bytes for 'ä'
1 bytes for 'h'
6 trailing spaces

This InnoDB trailing space trimming is not taken into account when calling the collation library comparison routines such as strnncollsp(), which expects to get values with a proper amount of trailing spaces padded. See ~~MDEV-25440~~ for a wrong behaviour example. At the low level, such as cmp_data_data(), InnoDB only has the column lengths in bytes, not in characters at all. If, after the trimming, one of the CHAR(3) strings is 'äh' and another is 'ah ' (note one trailing space), than a regular strnncollsp() call returns a wrong result in case of NOPAD collations:

SET NAMES utf8 COLLATE utf8_general_nopad_ci;

SELECT strcmp('äh','ah ');

+---------------------+

| strcmp('äh','ah ')  |

+---------------------+

|                  -1 |

+---------------------+

In case of other engines (or not trimming InnoDB formats), which pad up to N*mbmaxlen, the comparison code goes through Field_string::cmp() which gets values padded up to N*mbmaxlen, then it trims redundant trailing spaces (using a charpos() call), then calls strnncollsp(). So latter gets the following values (with CHAR_LENGTH equal to 3 in both arguments):

SET NAMES utf8 COLLATE utf8_general_nopad_ci;

SELECT strcmp('äh ','ah '); -- Notice one trailing space in both values

+----------------------+

| strcmp('äh ','ah ')  |

+----------------------+

|                    0 |

+----------------------+

Some theory on not trimmed strings NO PAD

Suppose we need to compare two not trimmed CHAR(N) strings ustr1 and ustr2, both padded in the way that CHAR_LENGTH is equal to N. Note, they can have different octet lengths.

For NO PAD collations the following should be true:

Axiom 1a. We can append the same amount of spaces to ustr1 and ustr2. The result of strnncollsp() won't change.
Axiom 1b. We can trim the same amount of spaces from ustr1 and ustr2. The result of strnncollsp() won't change.
Axiom 1c. We can NOT trim or append different amount of spaces. The result will change. The above example demonstrates this.

Some theory on InnoDB trimmed strings

Inside the InnoDB, we have two trimmed strings str1 and str2 both represented as a pointer and its length.

We know that the original non-trimmed representations ustr1 and ustr2 were of the same CHAR_LENGTH initially, which was equal to N for both strings.

But we trimmed some spaces from both strings. Note, in general case we removed different amount of spaces from the two strings!

We don't know the exact number of spaces we removed from each string, because:

We don't have access to N in the relevant part of the InnoDB code.
Even if we knew the extact N, we don't want to calculate CHAR_LENGTH - it would mean an extra loop other the two strings, which would be too slow.

Given the above, let's iterate over the two trimmed strings:

scanning and comparing weights, and
counting number of characters at the same time.

The following cases are possibles.

Case 1. Different weights found

If at some point we find two different weights - we just return the result - the two strings are different.

Case 2. Weights ended at the same time.

If the loop comparing weights reached the end of the two strings at the same time. We know CHAR_LENGTH(str1) and CHAR_LENGTH(str2) of the two trimmed strings str1 and str2 at this point.

If CHAR_LENGTH(str1) == CHAR_LENGTH(str2), then we removed the same amount of trailing spaces from ustr1 and ustr2. That means they were equal.
If CHAR_LENGTH(str1) < CHAR_LENGTH(str2), then ustr1 was smaller than ustr2.
If CHAR_LENGTH(str1) > CHAR_LENGTH(str2), then ustr1 was greater than ustr2.

Case 3. One of the strings has more weights.

Suppose the loop comparing weights reached the end of the two strings, but str2 still has some characters left. We know OCTET_LENGTH(str1). We don't know CHAR_LENGTH(str2) yet - we only know it still has more characters. That means we trimmed more trailing spaces from ustr1 than from ustr2. So we now need to compare the tail of str2 against REPEAT(' ', CHAR_LENGTH(str2)-CHAR_LENGTH(str1)) and return its result.

Suppose the opposite happened: str1 still has some characters, but str2 has ended. We compare the tail of str1 against REPEAT(' ', CHAR_LENGTH(str1)-CHAR_LENGTH(str2))

New functions proposal

We need two new things in the collation library:

A loop scanning and comparing weights, and counting CHAR_LENGTHs of the two strings at the same time.
A loop comparing a string to REPEAT(' ',NSPACES) to handle Case 3. Or just doing "scan weights and compare against the weight for the space character", so we don't have to know and pass the exact NSPACES.

Something like this should do the job:

typeef struct

  size_t octet_length; /* the number of bytes that we have scanned */

  size_t char_length;  /* the number of characters that we have scanned */

} STRCOLL_STATUS;

int strnncollsp_ex(CHARSET_INFO *cs,

                   const uchar *str1, size_t len1,

                   const uchar *str2, size_t len2,

                   STRCOLL_STATUS *st1,

                   STRCOLL_STATUS *st2);

int compare_to_spaces(CHARSET_INFO *cs,

                      const uchar *str, size_t len);

Then, InnoDB can perform trimmed NO PAD string comparison using this functions:

int innodb_compare_packed_char(CHARSET_INFO *cs,

                               const uchar *str1, size_t len1,

                               const uchar *str2, size_t len2)

  STRCOLL_STATUS st1, st2;

  int rc= cs->coll->strnncollsp_ex(cs, str1, len1, str2, len2, &st1, &st2);

  if (rc !=0)

    return rc; /* Case 1: a weight difference was found *

  if (st1.octet_length == len1 &&

      st2.octet_length == len2)

    /* Case2: Both strings ended at the same time *

    return st1.char_length == st2.char_length ? 0 :

           st1.char_length < st2.char_length  ? -1 : 1;

  /* Case 3: One of the strings appears to be longer (in terms of characters) */

  return st1.octet_length < len1 ?

    +cs->coll->compare_to_spaces(cs, str1 + st1.octet_length, len1 - st1.octet_length) :

    -cs->coll->compare_to_spaces(cs, str2 + st2.octet_length, len2 - st2.octet_length);

Note, the function innodb_compare_packed_char() could actually reside inside the collation library itself.

Attachments

Issue Links

causes

MDEV-30034 UNIQUE USING HASH accepts duplicate entries for tricky collations

Closed

MDEV-30072 Wrong ORDER BY for a partitioned prefix key + NOPAD

Closed

MDEV-30095 Unexpected duplicate entry error for UNIQUE USING HASH + NOPAD

Open

is duplicated by

MDEV-25449 Add MY_COLLATION_HANDLER::strnncollsp_nchars()

Closed

is part of

MDEV-25440 Assertion `cmp_rec_rec(rec, old_rec, offsets, old_offsets, m_index) > 0' failed in PageBulk::insert

Closed

relates to

MDEV-25449 Add MY_COLLATION_HANDLER::strnncollsp_nchars()

Closed

MDEV-26744 MyISAM, Aria, MEMORY: CHAR+nopad does not work well

Open

MDEV-27552 Change the return type of my_uca_context_weight_find() to MY_CONTRACTION*

Closed

MDEV-33812 Index ingnores trailing spaces

Closed

MDEV-26743 InnoDB: CHAR+nopad does not work well

Closed

MDEV-27670 Assertion `(cs->state & 0x20000) == 0' failed in my_strnncollsp_nchars_generic_8bit

Closed

MDEV-27768 MDEV-25440: Assertion `(cs->state & 0x20000) == 0' failed in my_strnncollsp_nchars_generic_8bit

Closed

MDEV-30048 Prefix keys for CHAR work differently for MyISAM vs InnoDB

Closed

MDEV-30050 Inconsistent results of DISTINCT with NOPAD

Closed

(9 relates to)

Activity

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2021-06-13 05:52

Updated:: 2024-04-03 05:39

Resolved:: 2022-01-21 09:17

Time Tracking

Estimated:

Remaining:

Logged:

4d 4.5h

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.