[MDEV-21543] hp_rec_key_cmp suboptimal comparison Created: 2020-01-21  Updated: 2020-03-30

Status: Open
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major
Reporter: Georgy Kirichenko Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: beginner-friendly
Environment:

Linux



 Description   

hp_rec_key_cmp performs two string comparison in two phases: while the first one it searches
an octet length of comparing strings and while the second one it does comparison using strnncollsp.
The main issue is that hp_rec_key_cmp decodes all the comparing strings completely whereas they could differ starting just from their first characters. Besides the fact that this issue has no performance impact in case of fixed length encoding, UTF-8 performance is suffering a lot.



 Comments   
Comment by Arhant Jain [ 2020-03-28 ]

Hi Georgy,

I'm trying to resolve this bug which is reported by you.
I saw the code in server/storage/heap/hp_rkey.c file. Unfortunately, I'm unable to point out the case when hp_rec_key_cmp decodes all the comparing strings completely.

Please tell me whether I'm seeing the correct code or not. If not please guide me to resolve this issue.

Will be happy if anyone can help me to resolve this issue.

Thanks
Arhant

Comment by Georgy Kirichenko [ 2020-03-30 ]

Hi Arhant,

So let me give you an example. For instance, we have two 5-char utf8-encoded strings (which means that mysql reserved 15 bytes (3 octets per character) for each string) "abc" and "def". As string are 5-char length then they both are extended with spaces, so we have "abc " and "def ". Then please take a look on strnncollsp interface - it requires an octet length for each compared strings. So, before this function call, mysql should find the last octet of the last character in the each string. As utf8 is variable length encoding then both strings should be processed from the first char until the last one char by char which is suboptimal as comparison will stop on the first characters 'a' and 'd'.

Let us imagine that both string are 256 characters and have differences in their first 10 characters which means that we did 25 times more utf8 decoding work than required.

The best way to fix this issue that I see is extending collation function api with char_count parameter and get rid of string character length precalculation. Also it could be worth for other collation functions.

Also I have a patch I will be happy to share it but I wait an approve from my employer.

Generated at Thu Feb 08 09:07:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.