[MDEV-27265] Improve contraction performance in UCA collations - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.10.1
Component/s: Character Sets
Labels:
- Preview_10.10
- Preview_10.8

Description

Currently, when searching for a contraction, the code in the function my_uca_contraction_weight() simply iterates through all items in a MY_CONTRACTIONS array.
So the performance depends on where the searched contraction resides in the list:

it is fast for contractions which are in the very beginning of the list
it is slow for contractions which are in the end of the list

Some collations can have many contractions. For example, utf8mb4_thai_520_w2 has 231 contractions. Searching for contractions that are in the end of the list can be really slow.

In the future versions of UCA collations we're going to enable built-in DUCET contractions. It has near 900 built-in contractions. It's important to have this problem fixed.

Benchmarking utf8mb4_thai_520_w2

Let's check contractions from the array thai_contractions in ctype-uca.c.

This is a relevant snapshop from ctype-uca.c:

  /* thai_contraction[0] */

  { /* <THAI CHARACTER SARA E, THAI CHARACTER KO KAI> */

    { 0x0E40, 0x0E01, 0 },

    { 0x1F70, 0x1FAA, 0 },

    FALSE

},

...

  /* thai_contraction[7] */

  { /* <THAI CHARACTER SARA O, THAI CHARACTER KHO KHAI> */

    { 0x0E42, 0x0E02, 0 },

    { 0x1F71, 0x1FAC, 0 },

    FALSE

},

...

  /* thai_contraction[229] */

  { /* <THAI CHARACTER SARA AI MAIMALAI, THAI CHARACTER HO NOKHUK> */

    { 0x0E44, 0x0E2E, 0 },

    { 0x1F9D, 0x1FAE, 0 },

    FALSE

},

Now let's check the pefromance of the above three contractions:

The very first one
The contraction on the 7th position in the array
The contraction on the 229th position (the previous before the very last one)

The SQL code below intentionally adds digits 1 and 2 in the end of the left and right strings, to make the two strings primarily different, to avoid the comparison function going to the secondary level.

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;

DO BENCHMARK(10000000,strcmp('เก1','เก2'));

Query OK, 0 rows affected (0.439 sec)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;

DO BENCHMARK(10000000,strcmp('โข1','โข2'));

Query OK, 0 rows affected (0.940 sec)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;

DO BENCHMARK(10000000,strcmp('ไฮ1','ไฮ2'));

Query OK, 0 rows affected (15.805 sec)

Summary:

Ord Contraction Time

--- ----------- ----

0   เก           0.439

7   โข           0.940

229 ไฮ           15.805

Notice:

The contraction on the 7th position is twice slower than the very first one.
The contraction on the 229th position is 36 times slower than the very fitst one!

Benchmarking utf8mb4_croatian_ci

This collation has 9th contractions. Let's check their performance in the order of their appearance in the tailoring:

SET NAMES utf8mb4 COLLATE utf8mb4_croatian_ci;

DO BENCHMARK(10000000,strcmp('dž1','dž2'));

DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));

DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));

DO BENCHMARK(10000000,strcmp('lj1','lj2'));

DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));

DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));

DO BENCHMARK(10000000,strcmp('nj1','nj2'));

DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));

DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));

This is the output:

MariaDB [test]> DO BENCHMARK(10000000,strcmp('dž1','dž2'));

Query OK, 0 rows affected (0.397 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('Dž1','Dž2'));

Query OK, 0 rows affected (0.452 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('DŽ1','DŽ2'));

Query OK, 0 rows affected (0.500 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('lj1','lj2'));

Query OK, 0 rows affected (0.535 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('Lj1','Lj2'));

Query OK, 0 rows affected (0.593 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('LJ1','LJ2'));

Query OK, 0 rows affected (0.647 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('nj1','nj2'));

Query OK, 0 rows affected (0.700 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('Nj1','Nj2'));

Query OK, 0 rows affected (0.753 sec)

MariaDB [test]> DO BENCHMARK(10000000,strcmp('NJ1','NJ2'));

Query OK, 0 rows affected (0.811 sec)

The summary:

Ord  Contraction Time

---  ----------- ----

0    dž          0.397

1    Dž          0.452

2    DŽ          0.500

3    lj          0.535

4    Lj          0.593

5    LJ          0.647

6    nj          0.700

7    Nj          0.753

8    NJ          0.811

Proposal

The code should be fixed to use a hash table for contractions instead of the simple list iteration.

Attachments

Issue Links

blocks

MDEV-25829 Change default Unicode collation to uca1400_ai_ci

Closed

relates to

MDEV-27266 Improve UCA collation performance for utf8mb3 and utf8mb4

Closed

MDEV-32340 Improve performance of my_uca_level_booster_simple_prefix_cmp

Open

MDEV-27009 Add UCA-14.0.0 collations

Closed

Activity

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2021-12-15 09:23

Updated:: 2023-10-03 07:26

Resolved:: 2022-08-10 19:48

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server