Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
Currently, when searching for a contraction, the code in the function my_uca_contraction_weight() simply iterates through all items in a MY_CONTRACTIONS array.
So the performance depends on where the searched contraction resides in the list:
- it is fast for contractions which are in the very beginning of the list
- it is slow for contractions which are in the end of the list
Some collations can have many contractions. For example, utf8mb4_thai_520_w2 has 231 contractions. Searching for contractions that are in the end of the list can be really slow.
In the future versions of UCA collations we're going to enable built-in DUCET contractions. It has near 900 built-in contractions. It's important to have this problem fixed.
Benchmarking utf8mb4_thai_520_w2
Let's check contractions from the array thai_contractions in ctype-uca.c.
This is a relevant snapshop from ctype-uca.c:
/* thai_contraction[0] */ |
{ /* <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> */ |
{ 0x0E40, 0x0E01, 0 },
|
{ 0x1F70, 0x1FAA, 0 },
|
FALSE
|
},
|
|
...
|
|
/* thai_contraction[7] */ |
{ /* <THAI CHARACTER SARA O, THAI CHARACTER KHO KHAI> */ |
{ 0x0E42, 0x0E02, 0 },
|
{ 0x1F71, 0x1FAC, 0 },
|
FALSE
|
},
|
|
...
|
|
/* thai_contraction[229] */ |
{ /* <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER HO NOKHUK> */ |
{ 0x0E44, 0x0E2E, 0 },
|
{ 0x1F9D, 0x1FAE, 0 },
|
FALSE
|
},
|
Now let's check the pefromance of the above three contractions:
- The very first one
- The contraction on the 7th position in the array
- The contraction on the 229th position (the previous before the very last one)
The SQL code below intentionally adds digits 1 and 2 in the end of the left and right strings, to make the two strings primarily different, to avoid the comparison function going to the secondary level.
SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; |
DO BENCHMARK(10000000,strcmp('เก1','เก2')); |
Query OK, 0 rows affected (0.439 sec) |
SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; |
DO BENCHMARK(10000000,strcmp('โข1','โข2')); |
Query OK, 0 rows affected (0.940 sec) |
SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; |
DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2')); |
Query OK, 0 rows affected (15.805 sec) |
Summary:
Ord Contraction Time
|
--- ----------- ----
|
0 เก 0.439
|
7 โข 0.940
|
229 ไฮ 15.805
|
Notice:
- The contraction on the 7th position is twice slower than the very first one.
- The contraction on the 229th position is 36 times slower than the very fitst one!
Benchmarking utf8mb4_croatian_ci
This collation has 9th contractions. Let's check their performance in the order of their appearance in the tailoring:
SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci; |
DO BENCHMARK(10000000,strcmp('dž1','dž2')); |
DO BENCHMARK(10000000,strcmp('Dž1','Dž2')); |
DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2')); |
DO BENCHMARK(10000000,strcmp('lj1','lj2')); |
DO BENCHMARK(10000000,strcmp('Lj1','Lj2')); |
DO BENCHMARK(10000000,strcmp('LJ1','LJ2')); |
DO BENCHMARK(10000000,strcmp('nj1','nj2')); |
DO BENCHMARK(10000000,strcmp('Nj1','Nj2')); |
DO BENCHMARK(10000000,strcmp('NJ1','NJ2')); |
This is the output:
MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž1','dž2'));
|
Query OK, 0 rows affected (0.397 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
|
Query OK, 0 rows affected (0.452 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
|
Query OK, 0 rows affected (0.500 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj1','lj2'));
|
Query OK, 0 rows affected (0.535 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
|
Query OK, 0 rows affected (0.593 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
|
Query OK, 0 rows affected (0.647 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj1','nj2'));
|
Query OK, 0 rows affected (0.700 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
|
Query OK, 0 rows affected (0.753 sec)
|
|
MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));
|
Query OK, 0 rows affected (0.811 sec)
|
The summary:
Ord Contraction Time
|
--- ----------- ----
|
0 dž 0.397
|
1 Dž 0.452
|
2 DŽ 0.500
|
3 lj 0.535
|
4 Lj 0.593
|
5 LJ 0.647
|
6 nj 0.700
|
7 Nj 0.753
|
8 NJ 0.811
|
Proposal
The code should be fixed to use a hash table for contractions instead of the simple list iteration.
Attachments
Issue Links
- blocks
-
MDEV-25829 Change default Unicode collation to uca1400_ai_ci
- Closed
- relates to
-
MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4
- Closed
-
MDEV-32340 Improve performance of my_uca_level_booster_simple_prefix_cmp
- Open
-
MDEV-27009 Add UCA-14.0.0 collations
- Closed