Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
None
Description
The UCA implementation uses optimization for ASCII-compatible character sets (utf8mb3, utf8mb4) implemented in the function my_uca_level_booster_equal_prefix_length(). The idea is that if two strings have equal (according to the collation) simple prefix, it can be skipped quickly before the comparison enters a heavier slower loop.
This optimization uses the member MY_UCA_LEVEL_BOOSTER::weight_strings_2bytes_to_1_or_2_weights.
"Simple" means that prefixes must have the following data:
- The data can be traversed two bytes at a time, i.e.:
- Every two bytes are either two ASCII characters or one 2-byte character
- There are no two-byte characters at an odd octet position
- There are no ASCII contraction heads at an odd octet position
- Each two bytes producing one or two weights
Skipping the equal prefix optimizes well when we compare equal strings. However it's not good if we compare different strings massively (e.g. during sorting of an array of different strings).
Let's change the "skip equal simple prefix" approach to "compare simple prefix". The member MY_UCA_LEVEL_BOOSTER::weight_strings_2bytes_to_1_or_2_weights has almost everything for this.
After changes are done the implementer should make sure that the new version works really faster, using some benchmarks.