Before starting working in this issue, we probably need to implement MCOL-4568 first. The values passed to the new function in MariaDB collation library must be padded with spaces rather than zero bytes.
Note, the technique described below is only valid for xxx_nopad_bin collations (having NO PAD attribute).
It would not be correct to apply the same improvement to xxx_bin (PAD SPACE) collations, because trailing spaces handing would change!
After adding collation support into ColumnStore, performance of the comparison operator degraded for short CHAR columns, even for _bin collations.
This happened because:
- Before making ColumnStore collation aware it used to compare short CHAR values as uint32 or uint64 numbers.
- Since adding collations, ColumnStore delegates comparison to a call of CHARSET_INFO::strnncollsp(). The latter compares the data as strings (even for _bin collations), which is slower than comparing numbers.
The attached patch reconstructs old ColumnStore behavior inside MariaDB collation library. It really makes comparison work faster for CHAR(4).
Benchmarking a 10.5 RelWithDebInfo build before and after the patch applied:
For `clean 10.5` versus `10.5 with patch applied`
Notice, comparison of CHAR(4) latin1_bin (versus comparison of INT data) is:
- 1.086÷0.689 = 1.58 times slower in the clean version
- 0.812÷0.689 = 1.18 times slower in the patched version
The patch gives a good performance improvement, around 25%.
Let's add new methods into my_collation_handler_st, with the following tentative API:
(and correspoding wrapper methods in CHARSET_INFO).
So ColumnStore will be able to use these optimized comparison functions for short CHAR and VARCHAR data.
ColumnStore stores short CHAR values in memory in numeric format, either in 4 bytes or in 8 bytes, depending on the width. So it will use:
- strnncollsp_4bytes() for CHAR(1), CHAR(2), CHAR(3), CHAR(4)
- strnncollsp_8bytes() for CHAR(5), CHAR(6), CHAR(7), CHAR(8)