[MDEV-27265] Improve contraction performance in UCA collations Created: 2021-12-15  Updated: 2023-10-03  Resolved: 2022-08-10

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Fix Version/s: 10.10.1

Type: Task Priority: Critical
Reporter: Alexander Barkov Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: Preview_10.10, Preview_10.8

Issue Links:
Blocks
blocks MDEV-25829 Change default collation to utf8mb4_u... In Review
Relates
relates to MDEV-27266 Improve UCA collation performance for... Closed
relates to MDEV-32340 Improve performance of my_uca_level_b... Open
relates to MDEV-27009 Add UCA-14.0.0 collations Closed

 Description   

Currently, when searching for a contraction, the code in the function my_uca_contraction_weight() simply iterates through all items in a MY_CONTRACTIONS array.
So the performance depends on where the searched contraction resides in the list:

  • it is fast for contractions which are in the very beginning of the list
  • it is slow for contractions which are in the end of the list

Some collations can have many contractions. For example, utf8mb4_thai_520_w2 has 231 contractions. Searching for contractions that are in the end of the list can be really slow.

In the future versions of UCA collations we're going to enable built-in DUCET contractions. It has near 900 built-in contractions. It's important to have this problem fixed.

Benchmarking utf8mb4_thai_520_w2

Let's check contractions from the array thai_contractions in ctype-uca.c.

This is a relevant snapshop from ctype-uca.c:

  /* thai_contraction[0] */
  { /* <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> */
    { 0x0E40, 0x0E01, 0 },
    { 0x1F70, 0x1FAA, 0 },
    FALSE
  },
 
   ...
 
  /* thai_contraction[7] */
  { /* <THAI CHARACTER SARA O, THAI CHARACTER KHO KHAI> */
    { 0x0E42, 0x0E02, 0 },
    { 0x1F71, 0x1FAC, 0 },
    FALSE
  },
 
  ...
 
  /* thai_contraction[229] */
  { /* <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER HO NOKHUK> */
    { 0x0E44, 0x0E2E, 0 },
    { 0x1F9D, 0x1FAE, 0 },
    FALSE
  },

Now let's check the pefromance of the above three contractions:

  • The very first one
  • The contraction on the 7th position in the array
  • The contraction on the 229th position (the previous before the very last one)

The SQL code below intentionally adds digits 1 and 2 in the end of the left and right strings, to make the two strings primarily different, to avoid the comparison function going to the secondary level.

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
DO BENCHMARK(10000000,strcmp('เก1','เก2'));
Query OK, 0 rows affected (0.439 sec)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
DO BENCHMARK(10000000,strcmp('โข1','โข2'));
Query OK, 0 rows affected (0.940 sec)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
Query OK, 0 rows affected (15.805 sec)

Summary:

Ord Contraction Time
--- ----------- ----
0   เก           0.439
7   โข           0.940
229 ไฮ           15.805

Notice:

  • The contraction on the 7th position is twice slower than the very first one.
  • The contraction on the 229th position is 36 times slower than the very fitst one!

Benchmarking utf8mb4_croatian_ci

This collation has 9th contractions. Let's check their performance in the order of their appearance in the tailoring:

SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
DO BENCHMARK(10000000,strcmp('dž1','dž2'));
DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
DO BENCHMARK(10000000,strcmp('lj1','lj2'));
DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
DO BENCHMARK(10000000,strcmp('nj1','nj2'));
DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));

This is the output:

MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž1','dž2'));
Query OK, 0 rows affected (0.397 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));
Query OK, 0 rows affected (0.452 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));
Query OK, 0 rows affected (0.500 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj1','lj2'));
Query OK, 0 rows affected (0.535 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));
Query OK, 0 rows affected (0.593 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));
Query OK, 0 rows affected (0.647 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj1','nj2'));
Query OK, 0 rows affected (0.700 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));
Query OK, 0 rows affected (0.753 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));
Query OK, 0 rows affected (0.811 sec)

The summary:

Ord  Contraction Time
---  ----------- ----
0    dž          0.397
1    Dž          0.452
2    DŽ          0.500
3    lj          0.535
4    Lj          0.593
5    LJ          0.647
6    nj          0.700
7    Nj          0.753
8    NJ          0.811

Proposal

The code should be fixed to use a hash table for contractions instead of the simple list iteration.



 Comments   
Comment by Alexander Barkov [ 2021-12-15 ]

Benchmarking

Scripts

-- Warning up
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
 
-- Benchmarking
SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
DO BENCHMARK(10000000,strcmp('เก1','เก2'));
DO BENCHMARK(10000000,strcmp('โข1','โข2'));
DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));

-- Warning up
SET NAMES utf8mb4 COLLATE utf8mb4_general_ci;
DO BENCHMARK(10000000,strcmp('xxxx','xxxx'));
 
-- Benchmarking
SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
DO BENCHMARK(10000000,strcmp('dž','dž'));
DO BENCHMARK(10000000,strcmp('Dž','Dž'));
DO BENCHMARK(10000000,strcmp('DŽ','DŽ'));
DO BENCHMARK(10000000,strcmp('lj','lj'));
DO BENCHMARK(10000000,strcmp('Lj','Lj'));
DO BENCHMARK(10000000,strcmp('LJ','LJ'));
DO BENCHMARK(10000000,strcmp('nj','nj'));
DO BENCHMARK(10000000,strcmp('Nj','Nj'));
DO BENCHMARK(10000000,strcmp('NJ','NJ'));

Old (before the patch)

MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
Query OK, 0 rows affected (0.001 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('เก1','เก2'));
Query OK, 0 rows affected (0.432 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('โข1','โข2'));
Query OK, 0 rows affected (0.884 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
Query OK, 0 rows affected (15.566 sec)

MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
Query OK, 0 rows affected (0.001 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž','dž'));
Query OK, 0 rows affected (0.366 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž','Dž'));
Query OK, 0 rows affected (0.425 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ','DŽ'));
Query OK, 0 rows affected (0.472 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj','lj'));
Query OK, 0 rows affected (0.509 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj','Lj'));
Query OK, 0 rows affected (0.561 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ','LJ'));
Query OK, 0 rows affected (0.622 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj','nj'));
Query OK, 0 rows affected (0.673 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj','Nj'));
Query OK, 0 rows affected (0.727 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ','NJ'));
Query OK, 0 rows affected (0.786 sec)

New (after the patch)

MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;
Query OK, 0 rows affected (0.001 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('เก1','เก2'));
Query OK, 0 rows affected (0.531 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('โข1','โข2'));
Query OK, 0 rows affected (0.530 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));
Query OK, 0 rows affected (0.531 sec)

MariaDB [test]> SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;
Query OK, 0 rows affected (0.001 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž','dž'));
Query OK, 0 rows affected (0.503 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž','Dž'));
Query OK, 0 rows affected (0.507 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ','DŽ'));
Query OK, 0 rows affected (0.501 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj','lj'));
Query OK, 0 rows affected (0.154 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj','Lj'));
Query OK, 0 rows affected (0.155 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ','LJ'));
Query OK, 0 rows affected (0.152 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj','nj'));
Query OK, 0 rows affected (0.152 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj','Nj'));
Query OK, 0 rows affected (0.153 sec)
 
MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ','NJ'));
Query OK, 0 rows affected (0.151 sec)

Summary

utf8mb4_thai_520_w2

Ord  A    B    Old time  New Time  % New/Old
---  ---  ---  --------  --------  ---------
0    เก1   เก2     0.432     0.531       122
7    โข1   โข2     0.884     0.530        59
229  ไฮ1   ไฮ2    15.566     0.531         3

utf8mb4_croatian_ci

Ord A   B     Old time New time  % New/Old
--- --- ---   -------- --------  ---------
0   dž  dž       0.366    0.503        137
1   Dž  Dž       0.425    0.507        119
2   DŽ  DŽ       0.472    0.501        106
3   lj  lj       0.509    0.154         30
4   Lj  Lj       0.561    0.155         27
5   LJ  LJ       0.622    0.152         24
6   nj  nj       0.673    0.152         22
7   Nj  Nj       0.727    0.153         21
8   NJ  NJ       0.786    0.151         19

Observation:

  • After the change, the contraction search time does not depend on the contraction ordinal position in the tailoring (it only depends on "ASCII or not ASCII")
  • Searching for the very last (but one) contraction in utf8mb4_thai_520_w2 now takes 3% of the old search time (30 times faster!)
  • Searching for the very last contraction in utf8mb4_croatian_ci now takes 19% of the old search time (5 times faster)
Comment by Sergei Golubchik [ 2022-06-18 ]

It's in this branch: preview-10.10-uca14.

Comment by Lena Startseva [ 2022-07-11 ]

Environment:

Linux  5.13.0-52-generic #59-Ubuntu SMP  x86_64 x86_64 x86_64 GNU/Linux
memory         64GiB System Memory
processor      11th Gen Intel(R) Core(TM) i7-11850H @ 2.50GHz

Summary
utf8mb4_thai_520_w2

Ord A B Old time New Time % New/Old
0 เก1 เก2 1.437 1.537 107
7 โข1 โข2 2.138 1.517 71
229 ไฮ1 ไฮ2 28.365 1.506 5

utf8mb4_croatian_ci

Ord A B Old time New Time % New/Old
0 1.131 1.363 120
1 1.220 1.318 108
2 1.337 1.321 99
3 lj lj 1.407 0.558 40
4 Lj Lj 1.499 0.558 37
5 LJ LJ 1.598 0.562 35
6 nj nj 1.719 0.561 33
7 Nj Nj 1.808 0.559 31
8 NJ NJ 1.906 0.561 29

On my laptop searching for a contraction has showed a slightly worse result, but the improvements are still observable.

I also checked new collation uca1400_a(i\s)_c(i\s). For thai is everything is good:

Ord A B ai_ci as_ci ai_cs as_cs
0 เก1 เก2 1.452 1.450 1.450 1.454
7 โข1 โข2 1.442 1.450 1.447 1.442
229 ไฮ1 ไฮ2 1.428 1.433 1.432 1.435

but for uca1400_croatian_a(i\s)_c(i\s) the time increses when accent sensitive or case sensitive is used:

Ord A B ai_ci as_ci ai_cs as_cs
0 1.216 2.144 2.144 3.193
8 NJ NJ 0.511 2.090 2.070 3.107

This doesn't look like a defect of the current task, but it might be improved in the future.

Comment by Lena Startseva [ 2022-08-08 ]

Ok to push

Generated at Thu Feb 08 09:51:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.