[MDEV-27266] Improve UCA collation performance for utf8mb3 and utf8mb4 Created: 2021-12-15 Updated: 2023-10-03 Resolved: 2022-08-10 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Character Sets |
| Fix Version/s: | 10.10.1 |
| Type: | Task | Priority: | Critical |
| Reporter: | Alexander Barkov | Assignee: | Alexander Barkov |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | Preview_10.10 | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
Recently in Similar style of improvement can be done for UCA collations for utf8mb3 and utf8mb4. It's hard to handle 4 or 8 bytes at the same time, because UCA is much more complex than simple collations improved in MDEV-26572. However, it's possible to handle at least 2 bytes at the same time. It will improve performance for:
Performance improvement, level 1:For every bytes pair [00..FF][00..FF] which:
let's store weights in a new separate array of 64K elements of a new data type MY_UCA_2BYTES_ITEM, defined as follows:
so during scanner_next() we can scan two bytes at a time. Byte pairs that do not match the conditions a-c should be marked in this array as not applicable for optimization, so they can be scanned as before. Performance improvement, level 2:For every byte pair which is applicable for optimization in #1, and which produces only one or two weights, let's store weights in one more array of 64K elements of a new data type MY_UCA_WEIGHT2, defined as follows:
So in the beginning of strnncoll*() we can skip equal prefixes using an even more efficient loop. This loop will consume two bytes at a time. The loop will scan while the two bytes on both sides produce weight strings of equal length (i.e. one weight on both sides, or two weights on both sides). This will allow to compare efficiently:
Other Unicode character setsUnder terms of this patch we'll improve only utf8mb3 and utf8mb4. Other Unicode character sets (ucs2, utf16le, utf16, utf32) can also reuse the same optimization, however this will need some additional code tweaks. Let's do it later under terms of a separate task later. |
| Comments |
| Comment by Alexander Barkov [ 2021-12-15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BenchmarkingJIRA does not allow to put characters outside of BMP.
was replaced to Z. utf8mb4_general_ci (not changed - just for reference)
Summary
Old utf8mb4_unicode_ci (before the patch)
Summary
New utf8mb4_unicode_ci (after the patch)
Summary
Full summaryutf8mb4_general_ci - old utf8mb4_unicode_ci - new utf8mb4_unicode_ci
Obvervations: Performance significantly improved:
Performance slightly degraded:
The slight slow-down on 3-byte and 4-byte characters is expected: it now tries to go the optimized way, then fails, then goes the old non-optimized way. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2022-06-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It's in this branch: preview-10.10-uca14. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Lena Startseva [ 2022-07-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Environment:
Summary
On my laptop the result is a little less optimistic, but in line with expectations. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Lena Startseva [ 2022-08-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok to push | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rick James [ 2023-05-17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As for "other" charsets (ucs2, etc), I suggest that can be very low on the priority list. I have not heard of anyone creating a table with such. Importing is an unrelated topic; I would encourage converting to utf8mb4 during importation, thereby avoiding the need for collation speedups. |