[MCOL-4534] MariaDB collation library: improve comparison performance in 8bit nopad_bin collations Created: 2021-02-05  Updated: 2023-02-08

Status: Stalled
Project: MariaDB ColumnStore
Component/s: MariaDB Server
Affects Version/s: 5.5.1
Fix Version/s: 23.10

Type: Bug Priority: Major
Reporter: Alexander Barkov Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File cmp-str-as-num.diff    
Issue Links:
Blocks
is blocked by MCOL-4568 Change short CHAR to pad values with ... Open
Relates
relates to MCOL-495 Make string comparison not case sensi... Closed
relates to MCOL-4064 Make JOIN collation aware Closed
relates to MCOL-4539 WHERE short_char_column='literal' ign... Closed
Sprint: 2021-2, 2021-3, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12

 Description   

2021-03-02 Update

Before starting working in this issue, we probably need to implement MCOL-4568 first. The values passed to the new function in MariaDB collation library must be padded with spaces rather than zero bytes.

2021-02-15 Update

Note, the technique described below is only valid for xxx_nopad_bin collations (having NO PAD attribute).
It would not be correct to apply the same improvement to xxx_bin (PAD SPACE) collations, because trailing spaces handing would change!

Actual description

After adding collation support into ColumnStore, performance of the comparison operator degraded for short CHAR columns, even for _bin collations.
This happened because:

  • Before making ColumnStore collation aware it used to compare short CHAR values as uint32 or uint64 numbers.
  • Since adding collations, ColumnStore delegates comparison to a call of CHARSET_INFO::strnncollsp(). The latter compares the data as strings (even for _bin collations), which is slower than comparing numbers.

The attached patch reconstructs old ColumnStore behavior inside MariaDB collation library. It really makes comparison work faster for CHAR(4).

Benchmarking a 10.5 RelWithDebInfo build before and after the patch applied:

Comparison performance for INT and latin1_swedish_ci (for reference)

select benchmark(100000000,1111=1111);
1 row in set (0.689 sec)
 
SET NAMES latin1; select benchmark(100000000,'1111'='1111');
1 row in set (0.975 sec)

Comparison performance for BINARY, CHAR(4) latin1_bin, CHAR(4) latin1_nopad_bin

For `clean 10.5` versus `10.5 with patch applied`

SET NAMES binary; select benchmark(100000000,'1111'='1111');
1 row in set (0.958 sec) -- clean
1 row in set (0.839 sec) -- after the patch
 
SET NAMES latin1 COLLATE latin1_bin; select benchmark(100000000,'1111'='1111');
1 row in set (1.086 sec) -- before the patch
1 row in set (0.812 sec) -- after the patch
 
SET NAMES latin1 COLLATE latin1_nopad_bin; select benchmark(100000000,'1111'='1111');
1 row in set (1.066 sec) -- before the patch
1 row in set (0.852 sec) -- after the patch

Notice, comparison of CHAR(4) latin1_bin (versus comparison of INT data) is:

  • 1.086÷0.689 = 1.58 times slower in the clean version
  • 0.812÷0.689 = 1.18 times slower in the patched version

The patch gives a good performance improvement, around 25%.

Let's add new methods into my_collation_handler_st, with the following tentative API:

  int     (*strnncollsp_4bytes)(CHARSET_INFO *,
                                const uchar *a,
                                const uchar *b);
  int     (*strnncollsp_8bytes)(CHARSET_INFO *,
                                const uchar *a,
                                const uchar *b);

(and correspoding wrapper methods in CHARSET_INFO).

So ColumnStore will be able to use these optimized comparison functions for short CHAR and VARCHAR data.

ColumnStore stores short CHAR values in memory in numeric format, either in 4 bytes or in 8 bytes, depending on the width. So it will use:

  • strnncollsp_4bytes() for CHAR(1), CHAR(2), CHAR(3), CHAR(4)
  • strnncollsp_8bytes() for CHAR(5), CHAR(6), CHAR(7), CHAR(8)

Generated at Thu Feb 08 02:51:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.