[MDEV-17474] Change Unicode collation implementation from "handler" to "inline" style - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Fix Version/s: 10.4.0
Component/s: Character Sets
Labels:
None

Description

axel found a performance bottleneck in the Unicode collation implementation (~~MDEV-16413~~).

Under terms of this task, we'll reorganize the code implementing Unicode collations (such as utf8_unicode_ci) to a new style which will change virtual function calls to inline-ing:

The old style

There is one copy of every collation function (strnncoll(), strnncollsp(), hash_sort(), strnnxfrm()), e.g.:

static int my_strnncollsp_uca(CHARSET_INFO *cs,

                              my_uca_scanner_handler *scanner_handler,

                              const uchar *s, size_t slen,

                              const uchar *t, size_t tlen)

Character set-specific routines are passed in scanner_handler
scanner_handler->next() is called virtually (i.e. via a pointer to a function)
scanner_handler->next() itself calls cs->cset->mb_wc() virtually

The new style

There are multiple implementations of the functions, one function per character set.
There is a shared file ctype-uca.ic, which is included multiple times, one time per each character set.

Character set specific information is passed in macros:

#include "ctype-utf8.h"

#define MY_FUNCTION_NAME(x)   my_uca_ ## x ## _utf8mb3

#define MY_MB_WC(scanner, wc, beg, end) (my_mb_wc_utf8mb3_quick(wc, beg, end))

#define MY_LIKE_RANGE my_like_range_mb

#include "ctype-uca.ic"

There are inline my_mb_wc_CSNAME_quick() implementations in new header files: ctype-utf8.h, ctype-ucs2.h, ctype-utf16.h, ctype-utf32.h

The old version generated smaller amount of executable code, but was slower.
The new version will generate more code, but will be much faster: there will be no any virtual function calls. All calls inside new functions will be done either using inline or at least statically.

Part#2: additional changes:

Add fast paths to handle ASCII characters
Add dedicated MY_COLLATION_HANDLERs for collations with no contractions (for utf8 and for utf8mb4 character sets). The choice between the full-featured handler and the "no contraction" handler should be made at the collation initialization time.

Attachments

Issue Links

blocks

MDEV-16413 test performance of distinct range queries

Closed

relates to

MDEV-17502 Change Unicode xxx_general_ci and xxx_bin collation implementation to "inline" style

Closed

MDEV-33621 Unify duplicate code in my_wildcmp_uca_impl() and my_wildcmp_unicode_impl()

Closed

Activity

Alexander Barkov added a comment - 2018-10-17 11:31 - edited

Performance comparison (after the main change and part#2):

utf8_unicode_ci

SET NAMES utf8 COLLATE utf8_unicode_ci;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.33 sec

MySQL-8.0     2.31 sec

MariaDB-10.4  2.79 sec  -- before the change

MariaDB-10.4  1.27 sec  -- after the change

utf8_german2_ci

SET NAMES utf8 COLLATE utf8_german2_ci;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.21 sec

MySQL-8.0     2.25 sec

MariaDB-10.4  2.79 sec  -- before the change

MariaDB-10.4  1.27 sec  -- after the change

utf8_spanish2_ci

SET NAMES utf8 COLLATE utf8_spanish2_ci;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     4.62 sec

MySQL-8.0     2.55 sec

MariaDB-10.4  3.45 sec  -- before the change

MariaDB-10.4  1.78 sec  -- after the change

utf8_thai_520_w2 (difference on the primary level)

SET NAMES utf8 COLLATE utf8_thai_520_w2;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <

                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab') AS a;

MariaDB-10.4  3.78 sec  -- before the change

MariaDB-10.4  2.33 sec  -- after the change

utf8_thai_520_w2 (equality on the primary level)

SET NAMES utf8 COLLATE utf8_thai_520_w2;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <

                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') AS a;

MariaDB-10.4  7.28 sec  -- before the change

MariaDB-10.4  4.71 sec  -- after the change

utf8mb4_unicode_ci

SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.20 sec

MySQL-8.0     1.59 sec

MariaDB-10.4  2.79 sec  -- before the change

MariaDB-10.4  1.25 sec  -- after the change

utf8mb4_german2_ci

SET NAMES utf8mb4 COLLATE utf8mb4_german2_ci;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     3.19 sec

MySQL-8.0     1.58 sec

MariaDB-10.4  2.80 sec  -- before the change

MariaDB-10.4  1.29 sec  -- after the change

utf8mb4_spanish2_ci

SET NAMES utf8mb4 COLLATE utf8mb4_spanish2_ci;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb') AS a;

MySQL-5.7     4.41 sec

MySQL-8.0     1.68 sec

MariaDB-10.4  3.45 sec  -- before the change

MariaDB-10.4  1.76 sec  -- after the change

utf8mb4_thai_520_w2 (difference on the primary level)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <

                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab') AS a;

MariaDB-10.4  3.49 sec  -- before the change

MariaDB-10.4  2.31 sec  -- after the change

utf8mb4_thai_520_w2 (equality on the primary level)

SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2;

SELECT BENCHMARK(5000000,'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' <

                         'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa') AS a;

MariaDB-10.4  7.00 sec  -- before the change

MariaDB-10.4  4.61 sec  -- after the change

Alexander Barkov added a comment - 2018-10-17 11:31 - edited Performance comparison (after the main change and part#2): utf8_unicode_ci SET NAMES utf8 COLLATE utf8_unicode_ci; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb' ) AS a; MySQL-5.7 3.33 sec MySQL-8.0 2.31 sec MariaDB-10.4 2.79 sec -- before the change MariaDB-10.4 1.27 sec -- after the change utf8_german2_ci SET NAMES utf8 COLLATE utf8_german2_ci; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb' ) AS a; MySQL-5.7 3.21 sec MySQL-8.0 2.25 sec MariaDB-10.4 2.79 sec -- before the change MariaDB-10.4 1.27 sec -- after the change utf8_spanish2_ci SET NAMES utf8 COLLATE utf8_spanish2_ci; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb' ) AS a; MySQL-5.7 4.62 sec MySQL-8.0 2.55 sec MariaDB-10.4 3.45 sec -- before the change MariaDB-10.4 1.78 sec -- after the change utf8_thai_520_w2 (difference on the primary level) SET NAMES utf8 COLLATE utf8_thai_520_w2; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab' ) AS a; MariaDB-10.4 3.78 sec -- before the change MariaDB-10.4 2.33 sec -- after the change utf8_thai_520_w2 (equality on the primary level) SET NAMES utf8 COLLATE utf8_thai_520_w2; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' ) AS a; MariaDB-10.4 7.28 sec -- before the change MariaDB-10.4 4.71 sec -- after the change utf8mb4_unicode_ci SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb' ) AS a; MySQL-5.7 3.20 sec MySQL-8.0 1.59 sec MariaDB-10.4 2.79 sec -- before the change MariaDB-10.4 1.25 sec -- after the change utf8mb4_german2_ci SET NAMES utf8mb4 COLLATE utf8mb4_german2_ci; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb' ) AS a; MySQL-5.7 3.19 sec MySQL-8.0 1.58 sec MariaDB-10.4 2.80 sec -- before the change MariaDB-10.4 1.29 sec -- after the change utf8mb4_spanish2_ci SET NAMES utf8mb4 COLLATE utf8mb4_spanish2_ci; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaabbb' ) AS a; MySQL-5.7 4.41 sec MySQL-8.0 1.68 sec MariaDB-10.4 3.45 sec -- before the change MariaDB-10.4 1.76 sec -- after the change utf8mb4_thai_520_w2 (difference on the primary level) SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaab' ) AS a; MariaDB-10.4 3.49 sec -- before the change MariaDB-10.4 2.31 sec -- after the change utf8mb4_thai_520_w2 (equality on the primary level) SET NAMES utf8mb4 COLLATE utf8mb4_thai_520_w2; SELECT BENCHMARK(5000000, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' < 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' ) AS a; MariaDB-10.4 7.00 sec -- before the change MariaDB-10.4 4.61 sec -- after the change

People

Assignee:: Alexander Barkov

Reporter:: Alexander Barkov

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2018-10-16 15:04

Updated:: 2024-03-07 13:42

Resolved:: 2018-10-18 04:02

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server