[MDEV-26572] Improve simple multibyte collation performance on the ASCII range - Jira

Alexander Barkov created issue - 2021-09-08 08:40

Alexander Barkov made changes - 2021-09-08 08:41

Field	Original Value	New Value
Description	h2. Binary collations to be improved The following binary collations (together with their _nopad_bin counterparts): - big5_bin - cp932_bin - eucjpms_bin - euckr_bin - gb2312_bin - gbk_bin - sjis_bin - ujis_bin - utf8mb3_bin - utf8mb4_bin can improve their performance if in this code in strcoll.ic: {code:cpp} static int MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO cs __attribute__((unused)), const uchar a, size_t a_length, const uchar b, size_t b_length) { const uchar a_end= a + a_length; const uchar *b_end= b + b_length; for ( ; ; ) { int a_weight, b_weight, res; uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end); ... {code} we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers. h2. Case insensitive collations to be improved Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts): - utf8mb3_general_ci - utf8mb3_general_mysql500_ci - utf8mb4_general_ci - cp932_japanese_ci - eucjpms_japanese_ci - euckr_korean_ci - sjis_japanese_ci - ujis_japanese_ci can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step: - Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters) - Load the two strings into two uint32 or uint64 numbers - Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}}) - Compare the numbers and return if they are different - Increment pointers to 4 or 8 and continue the loop Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer. h2. Requirements The changes must be done in a way not to bring any serios slow down for multi-byte data! h2. Collations that won't be changed in this task Note, under terms of this task we won't change the following multibyte case insensitive collations (and their _nopad_ci counterparts): - big5_chinese_ci - gb2312_chinese_ci - gbk_chinese_ci because these collations additionally change the order of these ASCII characters: \|\|Weight\|\|Character name\|\|Character\|\| \|0x5B\|U+005D RIGHT SQUARE BRACKET\|]\| \|0x5C\|U+005B LEFT SQUARE BRACKET\|[\| \|0x5D\|U+005C REVERSE SOLIDUS\|\\| So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.	h2. Binary collations to be improved The following binary multibyte collations (together with their _nopad_bin counterparts): - big5_bin - cp932_bin - eucjpms_bin - euckr_bin - gb2312_bin - gbk_bin - sjis_bin - ujis_bin - utf8mb3_bin - utf8mb4_bin can improve their performance if in this code in strcoll.ic: {code:cpp} static int MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO cs __attribute__((unused)), const uchar a, size_t a_length, const uchar b, size_t b_length) { const uchar a_end= a + a_length; const uchar *b_end= b + b_length; for ( ; ; ) { int a_weight, b_weight, res; uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end); ... {code} we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers. h2. Case insensitive collations to be improved Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts): - utf8mb3_general_ci - utf8mb3_general_mysql500_ci - utf8mb4_general_ci - cp932_japanese_ci - eucjpms_japanese_ci - euckr_korean_ci - sjis_japanese_ci - ujis_japanese_ci can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step: - Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters) - Load the two strings into two uint32 or uint64 numbers - Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}}) - Compare the numbers and return if they are different - Increment pointers to 4 or 8 and continue the loop Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer. h2. Requirements The changes must be done in a way not to bring any serios slow down for multi-byte data! h2. Collations that won't be changed in this task Note, under terms of this task we won't change the following multibyte case insensitive collations (and their _nopad_ci counterparts): - big5_chinese_ci - gb2312_chinese_ci - gbk_chinese_ci because these collations additionally change the order of these ASCII characters: \|\|Weight\|\|Character name\|\|Character\|\| \|0x5B\|U+005D RIGHT SQUARE BRACKET\|]\| \|0x5C\|U+005B LEFT SQUARE BRACKET\|[\| \|0x5D\|U+005C REVERSE SOLIDUS\|\\| So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 08:45

Description

h2. Binary collations to be improved

The following binary multibyte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

Note, under terms of this task we won't change the following multibyte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimizes in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code.

Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 08:46

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimizes in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code.

Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code.

Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 08:47

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code.

Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 08:59

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The changes must be done in a way not to bring any serios slow down for multi-byte data!

h2. Collations that won't be changed in this task

MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because these collations additionally change the order of these ASCII characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The expected performance improvement on the pure ASCII range for strings 4 or more bytes long is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 09:01

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The expected performance improvement on the pure ASCII range for strings 4 or more bytes long is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 09:03

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The performance of the low level comparison functions can be measured by the {{BENCHMARK()}} SQL functions, e.g.:
{code:sql}
SET NAMES utf8mb3 COLLATE utf8mb3_general_ci;
SELECT BENCHMARK(10000000,'aaaaaaaaaaaaaaaa'='aaaaaaaaaaaaaaaa');
{code}

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 09:06

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The performance of the low level comparison functions can be measured by the {{BENCHMARK()}} SQL functions, e.g.:
{code:sql}
SET NAMES utf8mb3 COLLATE utf8mb3_general_ci;
SELECT BENCHMARK(10000000,'aaaaaaaaaaaaaaaa'='aaaaaaaaaaaaaaaa');
{code}

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The performance of the low level comparison functions can be measured by the {{BENCHMARK()}} SQL functions, e.g.:
{code:sql}
SET NAMES utf8mb3 COLLATE utf8mb3_general_ci;
SELECT BENCHMARK(10000000,'aaaaaaaaaaaaaaaa'='aaaaaaaaaaaaaaaa');
{code}

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive _general_ collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

Alexander Barkov made changes - 2021-09-08 09:06

Description

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The performance of the low level comparison functions can be measured by the {{BENCHMARK()}} SQL functions, e.g.:
{code:sql}
SET NAMES utf8mb3 COLLATE utf8mb3_general_ci;
SELECT BENCHMARK(10000000,'aaaaaaaaaaaaaaaa'='aaaaaaaaaaaaaaaa');
{code}

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive _general_ collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

h2. Binary collations to be improved

The following binary multi-byte collations (together with their _nopad_bin counterparts):
- big5_bin
- cp932_bin
- eucjpms_bin
- euckr_bin
- gb2312_bin
- gbk_bin
- sjis_bin
- ujis_bin
- utf8mb3_bin
- utf8mb4_bin

can improve their performance if in this code in strcoll.ic:

{code:cpp}
static int
MY_FUNCTION_NAME(strnncollsp)(CHARSET_INFO *cs __attribute__((unused)),
                              const uchar *a, size_t a_length,
                              const uchar *b, size_t b_length)
{
  const uchar *a_end= a + a_length;
  const uchar *b_end= b + b_length;
  for ( ; ; )
  {
    int a_weight, b_weight, res;
    uint a_wlen= MY_FUNCTION_NAME(scan_weight)(&a_weight, a, a_end);
    ...
{code}
we catch pure ASCII and try to handle 4 or even 8 bytes in one iteration by loading string data into big-endian uint32 or uint64 numbers, then comparing these two numbers.

h2. Case insensitive collations to be improved

Additionally, the following case insensitive multibyte collations (and their _nopad_ci counteparts):
- utf8mb3_general_ci
- utf8mb3_general_mysql500_ci
- utf8mb4_general_ci
- cp932_japanese_ci
- eucjpms_japanese_ci
- euckr_korean_ci
- sjis_japanese_ci
- ujis_japanese_ci

can use the same idea because for ASCII they perform only a trivial mapping from lower case Latin letters {{[a-z]}} to their upper case counterparts {{[A-Z]}}, and after this mapping done the comparison is performed in binary style. These collations can do the following on every iteration step:
- Test the leading 4 or 8 bytes in the two strings for pure ASCII data and go to the old code on failure (to handle multi-byte characters)
- Load the two strings into two uint32 or uint64 numbers
- Perform bulk conversion of all bytes in the two numbers from {{[61..7A]}} to {{[41..5A]}} (i.e. from {{[a-z]}} to {{[A-Z]}})
- Compare the numbers and return if they are different
- Increment pointers to 4 or 8 and continue the loop

Note, the exact way of bulk conversion of numbers to upper case is to be found out by the developer.

h2. Requirements

The performance of the low level comparison functions can be measured by the {{BENCHMARK()}} SQL functions, e.g.:
{code:sql}
SET NAMES utf8mb3 COLLATE utf8mb3_general_ci;
SELECT BENCHMARK(10000000,'aaaaaaaaaaaaaaaa'='aaaaaaaaaaaaaaaa');
{code}

The expected performance improvement on the pure ASCII data (for strings with octet length >= 4) is between 2 and 3 times (depending on the exact length and collation).

Note, the changes must be done in a way not to bring any serious (more than 10%) slow down for:
- strings with multi-byte characters
- short strings 1..3 bytes long

h2. Collations that won't be changed in this task

h3. 8bit case insensitive collations
MariaDB has a number of 8bit case insensitive collations with trivial toupper mapping on the ASCII range. So they can get optimized in the same way. But we'll improve these collations under terms of a separate task because they don't use the mentioned code and have their own implementations.

h3. Three Chinese case insensitive collations
Also, under terms of this task we won't change the following multi-byte case insensitive collations (and their _nopad_ci counterparts):
- big5_chinese_ci
- gb2312_chinese_ci
- gbk_chinese_ci

because all these three collations additionally change the order of some ASCII punctuation characters:

||Weight||Character name||Character||
|0x5B|U+005D RIGHT SQUARE BRACKET|]|
|0x5C|U+005B LEFT SQUARE BRACKET|[|
|0x5D|U+005C REVERSE SOLIDUS|\|

So on the bulk conversion step they need more efforts and the proposed optimization may not be efficient. These collations will be improved later under terms of a separate task.

h3. Case insensitive {{_general_ci}} collations for ucs2, utf16, utf32
These character sets have separate implementations and don't use the mentioned code. They'll be improved under terms of a separate task.

Alexander Barkov made changes - 2021-09-09 11:38

Status

Open [ 1 ]

Confirmed [ 10101 ]

Alexander Barkov made changes - 2021-09-09 11:38

Status

Confirmed [ 10101 ]

In Progress [ 3 ]

Alexander Barkov made changes - 2021-09-09 11:39

Assignee	Alexander Barkov [ bar ]	Sergei Golubchik [ serg ]
Status	In Progress [ 3 ]	In Review [ 10002 ]

Sergei Golubchik made changes - 2021-09-11 13:47

Assignee	Sergei Golubchik [ serg ]	Alexander Barkov [ bar ]
Status	In Review [ 10002 ]	Stalled [ 10000 ]

Alexander Barkov made changes - 2021-09-13 09:53

Attachment

all.txt [ 58937 ]

Alexander Barkov made changes - 2021-09-13 10:28

Assignee	Alexander Barkov [ bar ]	Sergei Golubchik [ serg ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Sergei Golubchik made changes - 2021-09-13 21:00

Fix Version/s		10.7.0 [ 26072 ]
Fix Version/s	10.7 [ 24805 ]
Resolution		Fixed [ 1 ]
Status	In Review [ 10002 ]	Closed [ 6 ]

Sergei Golubchik made changes - 2021-09-13 21:52

Link

This issue relates to TODO-3118 [ TODO-3118 ]

Alexander Barkov made changes - 2021-09-15 16:07

Link

This issue blocks ~~MCOL-4691~~ [ ~~MCOL-4691~~ ]

Alexander Barkov made changes - 2021-09-18 11:29

Link

This issue relates to ~~MDEV-26637~~ [ ~~MDEV-26637~~ ]

Sergei Golubchik made changes - 2021-12-06 21:24

Workflow

MariaDB v3 [ 124897 ]

MariaDB v4 [ 134466 ]

Alexander Barkov made changes - 2023-02-23 14:23

Link

This issue relates to ~~MDEV-27266~~ [ ~~MDEV-27266~~ ]

Alexander Barkov made changes - 2023-02-23 14:34

Link

This issue causes ~~MDEV-30577~~ [ ~~MDEV-30577~~ ]

Alexander Barkov made changes - 2023-02-23 14:35

Link

This issue causes ~~MDEV-30577~~ [ ~~MDEV-30577~~ ]

Alexander Barkov made changes - 2023-02-23 14:35

Link

This issue causes ~~MDEV-30716~~ [ ~~MDEV-30716~~ ]

Alexander Barkov made changes - 2024-06-19 08:54

Link

This issue relates to ~~MDEV-22720~~ [ ~~MDEV-22720~~ ]

MariaDB Server

Improve simple multibyte collation performance on the ASCII range

Details

Description

Binary collations to be improved

Case insensitive collations to be improved

Requirements

Collations that won't be changed in this task

8bit case insensitive collations

Three Chinese case insensitive collations

Case insensitive `_general_ci` collations for ucs2, utf16, utf32

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration

Weight	Character name	Character
0x5B	U+005D RIGHT SQUARE BRACKET	]
0x5C	U+005B LEFT SQUARE BRACKET	[
0x5D	U+005C REVERSE SOLIDUS	\|

MariaDB Server

Details

Description

Binary collations to be improved

Case insensitive collations to be improved

Requirements

Collations that won't be changed in this task

8bit case insensitive collations

Three Chinese case insensitive collations

Case insensitive _general_ci collations for ucs2, utf16, utf32

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration

Case insensitive `_general_ci` collations for ucs2, utf16, utf32