Details
-
Bug
-
Status: Stalled (View Workflow)
-
Major
-
Resolution: Unresolved
-
5.5.1
-
None
-
2021-2, 2021-3, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12
Description
2021-03-02 Update
Before starting working in this issue, we probably need to implement MCOL-4568 first. The values passed to the new function in MariaDB collation library must be padded with spaces rather than zero bytes.
2021-02-15 Update
Note, the technique described below is only valid for xxx_nopad_bin collations (having NO PAD attribute).
It would not be correct to apply the same improvement to xxx_bin (PAD SPACE) collations, because trailing spaces handing would change!
Actual description
After adding collation support into ColumnStore, performance of the comparison operator degraded for short CHAR columns, even for _bin collations.
This happened because:
- Before making ColumnStore collation aware it used to compare short CHAR values as uint32 or uint64 numbers.
- Since adding collations, ColumnStore delegates comparison to a call of CHARSET_INFO::strnncollsp(). The latter compares the data as strings (even for _bin collations), which is slower than comparing numbers.
The attached patch reconstructs old ColumnStore behavior inside MariaDB collation library. It really makes comparison work faster for CHAR(4).
Benchmarking a 10.5 RelWithDebInfo build before and after the patch applied:
Comparison performance for INT and latin1_swedish_ci (for reference)
select benchmark(100000000,1111=1111); |
1 row in set (0.689 sec) |
|
SET NAMES latin1; select benchmark(100000000,'1111'='1111'); |
1 row in set (0.975 sec) |
Comparison performance for BINARY, CHAR(4) latin1_bin, CHAR(4) latin1_nopad_bin
For `clean 10.5` versus `10.5 with patch applied`
SET NAMES binary; select benchmark(100000000,'1111'='1111'); |
1 row in set (0.958 sec) -- clean |
1 row in set (0.839 sec) -- after the patch |
|
SET NAMES latin1 COLLATE latin1_bin; select benchmark(100000000,'1111'='1111'); |
1 row in set (1.086 sec) -- before the patch |
1 row in set (0.812 sec) -- after the patch |
|
SET NAMES latin1 COLLATE latin1_nopad_bin; select benchmark(100000000,'1111'='1111'); |
1 row in set (1.066 sec) -- before the patch |
1 row in set (0.852 sec) -- after the patch |
Notice, comparison of CHAR(4) latin1_bin (versus comparison of INT data) is:
- 1.086÷0.689 = 1.58 times slower in the clean version
- 0.812÷0.689 = 1.18 times slower in the patched version
The patch gives a good performance improvement, around 25%.
Let's add new methods into my_collation_handler_st, with the following tentative API:
int (*strnncollsp_4bytes)(CHARSET_INFO *, |
const uchar *a, |
const uchar *b); |
int (*strnncollsp_8bytes)(CHARSET_INFO *, |
const uchar *a, |
const uchar *b); |
|
(and correspoding wrapper methods in CHARSET_INFO).
So ColumnStore will be able to use these optimized comparison functions for short CHAR and VARCHAR data.
ColumnStore stores short CHAR values in memory in numeric format, either in 4 bytes or in 8 bytes, depending on the width. So it will use:
- strnncollsp_4bytes() for CHAR(1), CHAR(2), CHAR(3), CHAR(4)
- strnncollsp_8bytes() for CHAR(5), CHAR(6), CHAR(7), CHAR(8)