Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
Recently in MDEV-26572 we significantly improved performance of some simple multi-byte collations on the ASCII range. The idea of the improvement was to handle multiple ASCII characters (4 or 8) at the same time.
Similar style of improvement can be done for UCA collations for utf8mb3 and utf8mb4.
It's hard to handle 4 or 8 bytes at the same time, because UCA is much more complex than simple collations improved in MDEV-26572. However, it's possible to handle at least 2 bytes at the same time. It will improve performance for:
- 2-byte sequences representing two consequent ASCII characters
- 2-byte sequences representing a single 2-byte character (such as accented Latin letters, Greek, Cyrillic, Armenian, Hebrew, Arabic).
Performance improvement, level 1:
For every bytes pair [00..FF][00..FF] which:
- a. consists of two ASCII characters or makes a well-formed two-byte character
- b. whose total weight string fits into 4 weights (concatenated weight string in case of two ASCII characters, or a single weight string in case of a two-byte character)
- c. whose weight is context independent (i.e. does not depend on contractions or previous context pairs)
let's store weights in a new separate array of 64K elements of a new data type MY_UCA_2BYTES_ITEM, defined as follows:
#define MY_UCA_2BYTES_MAX_WEIGHT_SIZE (4+1) /* Including 0 terminator */
|
 |
typedef struct my_uca_2bytes_item_t |
{
|
uint16 weight[MY_UCA_2BYTES_MAX_WEIGHT_SIZE];
|
} MY_UCA_2BYTES_ITEM;
|
so during scanner_next() we can scan two bytes at a time. Byte pairs that do not match the conditions a-c should be marked in this array as not applicable for optimization, so they can be scanned as before.
Performance improvement, level 2:
For every byte pair which is applicable for optimization in #1, and which produces only one or two weights, let's store weights in one more array of 64K elements of a new data type MY_UCA_WEIGHT2, defined as follows:
typedef struct my_uca_weight2_t |
{
|
uint16 weight[2];
|
} MY_UCA_WEIGHT2;
|
So in the beginning of strnncoll*() we can skip equal prefixes using an even more efficient loop. This loop will consume two bytes at a time. The loop will scan while the two bytes on both sides produce weight strings of equal length (i.e. one weight on both sides, or two weights on both sides).
This will allow to compare efficiently:
- Context independent sequences consisting of two ASCII characters
- Context independent 2-byte characters
- Contractions consisting of two ASCII characters, e.g. Czech "ch".
- Some tricky cases: "ss" vs "SHARP S" ("ss" produces two weights, 0xC39F also produces two weights)
Other Unicode character sets
Under terms of this patch we'll improve only utf8mb3 and utf8mb4. Other Unicode character sets (ucs2, utf16le, utf16, utf32) can also reuse the same optimization, however this will need some additional code tweaks. Let's do it later under terms of a separate task later.
Attachments
Issue Links
- blocks
-
MDEV-25829 Change default Unicode collation to uca1400_ai_ci
- Closed
- relates to
-
MDEV-26572 Improve simple multibyte collation performance on the ASCII range
- Closed
-
MDEV-32340 Improve performance of my_uca_level_booster_simple_prefix_cmp
- Open
-
MDEV-27009 Add UCA-14.0.0 collations
- Closed
-
MDEV-27265 Improve contraction performance in UCA collations
- Closed