[MDEV-32016] change the hash used for hash unique Created: 2023-08-25  Updated: 2023-11-28

Status: Open
Project: MariaDB Server
Component/s: Data Definition - Create Table
Fix Version/s: 10.6, 10.11, 11.0, 11.1

Type: Task Priority: Critical
Reporter: Sergei Golubchik Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 1
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-27653 long uniques don't work with unicode ... Closed

 Description   

After MDEV-27653 the hash used for hash uniques was changed to be 32-bit. This is a portable value (unlike the old one) but 64-bit hash would've produced less collisions. Let's change the hash to be

  • not using ulong and other non-portable data types
  • 64-bit always, on all architectures
  • generally providing good and uniform distribution
  • fast

Old tables must keep working, of course. This is the third but supposedly not the last time we'll change the hash function, so let's store the used hash function in the extra2 area. It'll simplify future changes.

May be we can change all non-persistent hash tables in the server to use the new function.

This bug will not cause wrong answers, only affects performance (as there are more collisions).
MariaDB will remember if a hash key was created as 32 or 64 bit and continue to use the original length even if we change back the default to 64 bit at some point.



 Comments   
Comment by Sergei Golubchik [ 2023-08-30 ]

loosely translated from slack discussion:

A stream hash from a continuous buffer is best done by xxHash - they use simd and claim to be cross-platform. It is clear that xxHash won't support all our platforms thought.

Surprisingly low hash collision rates were shown by our hash function:
https://github.com/mariadb-corporation/mariadb-columnstore-engine/blob/develop/utils/common/hasher.h#L142

A function from RobinHood isn't bad, even though it's not supported:
https://github.com/mariadb-corporation/mariadb-columnstore-engine/blob/develop/utils/common/robin_hood.h#L700

There's a tooling in the xxHash repo for comparing function parameters - I highly recommend using it in research: https://github.com/Cyan4973/xxHash and https://github.com/Cyan4973/xxHash/tree/dev/tests/collisions

By the way, https://github.com/Cyan4973/xxHash/tree/dev/tests/collisions works amazingly fast with a stream that is not stored in contiguous memory. I, as you understand, tested x86_64, we did not support ARM back then. Supposedly on other platforms you can see a different picture.

Generated at Thu Feb 08 10:28:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.