[MDEV-35247] ut_fold_ull() and ut_hash_ulint() are a waste - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.5, 10.6, 10.11, 11.4
Fix Version/s: 10.6.21, 10.11.11, 11.4.5
Component/s: Storage Engine - InnoDB
Labels:
- performance

Description

On 64-bit CPU architectures, the default InnoDB data type ulint, which is an alias of size_t is 64 bits wide.

The function ut_fold_ull() is converting a 64-bit integer to ulint. It can be simplified on 32-bit systems, and we could avoid the extra instructions altogether on 64-bit systems and just use the 64-bit index or table identifiers as is.

The function ut_hash_ulint() seems to be an unnecessary obfuscation. Computing an exclusive OR with a constant before performing a modulus operation does not seem to affect the entropy at all.

Attachments

Issue Links

relates to

MDEV-35049 fix innodb-adaptive-hash-index scalability with multiple threads

Closed

MDEV-35190 HASH_SEARCH() is duplicating effort before HASH_INSERT() or HASH_DELETE()

Closed

Activity

Ascending order - Click to sort in descending order

Marko Mäkelä added a comment - 2024-11-06 07:28

In terms of code, the current state is as follows:

# if SIZEOF_SIZE_T < 8

inline size_t ut_fold_ull(uint64_t d) noexcept

  size_t n1= uint32_t(d), n2= uint32_t(d >> 32);

  return ((((n1 ^ n2 ^ 1653893711) << 8) + n1) ^ 1463735687) + n2;

# else

#  define ut_fold_ull(d) d

# endif

The 32-bit version is equivalent to the existing definition. The numeric constants seem to be arbitrary. I remember that Heikki Tuuri had a habit of creating "random numbers" by typing them on the upper row of the keyboard. Especially the 1463735687 does not look very random: it contains a long sequence of near-consecutive digits.

In fact, this function looks like "obfuscated hashing" mentioned by https://sortingsearching.com/2020/05/21/hashing.html. That function also mentions Universal hashing and a hash function devised by Carter and Wegman back in 1977:

#define h(x) ((a*x+b)%p)%m

Here, p would be a prime number and a and b would be random integers between 0 and p-1. The m applied to this case could be 1<<32, and a suitable p could be 0x1000000F. What if we "randomly" choose a=0 to avoid one expensive multiplication?

#include <cstdint>

uint32_t ut_fold_ull(uint64_t num) noexcept

  return uint32_t((3 + num) % 0x1000000F);

On IA-32, the above would emit add and adc for the addition (no matter which constant we choose) and a call to the runtime library function __umoddi3, which looks very expensive. (It is unavoidable; it has to perform 64-bit division arithmetics on a 32-bit ISA.)

We know that GCC knows how to optimize mach_read_from_4() and friends, What if we used a single carry-less addition, a.k.a. exclusive OR?

static unsigned ut_fold_ull(unsigned long long num)

  return ((unsigned) num) ^ (unsigned) (num >> 32);

int main (int argc, char **argv)

  return ut_fold_ull((unsigned)argc * 123456);

The main function would translate to the following IA-32 instructions:

imul   123456,0x4(%esp),%eax

ret

I think that the above solution should be good

We should keep in mind that the only use of ut_fold_ull is to shrink 64-bit table or index identifiers to 32 bits. On 32-bit systems, I would assume that the most significant half of these IDs would typically be 0; nobody should be creating billions of tables on such a small machine. So, I would opt for the following simple function for narrower-than-64-bit ISA:

#include <cstdint>

uint32_t ut_fold_ull(uint64_t num) noexcept

  return uint32_t(num) ^ uint32_t(num >> 32);

Marko Mäkelä added a comment - 2024-11-06 07:28 In terms of code, the current state is as follows: # if SIZEOF_SIZE_T < 8 inline size_t ut_fold_ull(uint64_t d) noexcept { size_t n1= uint32_t(d), n2= uint32_t(d >> 32); return ((((n1 ^ n2 ^ 1653893711) << 8) + n1) ^ 1463735687) + n2; } # else # define ut_fold_ull(d) d # endif The 32-bit version is equivalent to the existing definition. The numeric constants seem to be arbitrary. I remember that Heikki Tuuri had a habit of creating "random numbers" by typing them on the upper row of the keyboard. Especially the 1463735687 does not look very random: it contains a long sequence of near-consecutive digits. In fact, this function looks like "obfuscated hashing" mentioned by https://sortingsearching.com/2020/05/21/hashing.html . That function also mentions Universal hashing and a hash function devised by Carter and Wegman back in 1977: #define h(x) ((a*x+b)%p)%m Here, p would be a prime number and a and b would be random integers between 0 and p -1. The m applied to this case could be 1<<32 , and a suitable p could be 0x1000000F . What if we "randomly" choose a=0 to avoid one expensive multiplication? #include <cstdint> uint32_t ut_fold_ull(uint64_t num) noexcept { return uint32_t((3 + num) % 0x1000000F); } On IA-32, the above would emit add and adc for the addition (no matter which constant we choose) and a call to the runtime library function __umoddi3 , which looks very expensive. (It is unavoidable; it has to perform 64-bit division arithmetics on a 32-bit ISA.) We know that GCC knows how to optimize mach_read_from_4() and friends, What if we used a single carry-less addition, a.k.a. exclusive OR? static unsigned ut_fold_ull(unsigned long long num) { return ((unsigned) num) ^ (unsigned) (num >> 32); } int main ( int argc, char **argv) { return ut_fold_ull((unsigned)argc * 123456); } The main function would translate to the following IA-32 instructions: imul 123456,0x4(%esp),%eax ret I think that the above solution should be good We should keep in mind that the only use of ut_fold_ull is to shrink 64-bit table or index identifiers to 32 bits. On 32-bit systems, I would assume that the most significant half of these IDs would typically be 0; nobody should be creating billions of tables on such a small machine. So, I would opt for the following simple function for narrower-than-64-bit ISA: #include <cstdint> uint32_t ut_fold_ull(uint64_t num) noexcept { return uint32_t(num) ^ uint32_t(num >> 32); }

Marko Mäkelä added a comment - 2024-11-06 10:08

I was thinking a bit more about the 32-bit case. The Wikipedia article on universal hashing mentions Rabin–Karp rolling hash, also known as a multiplicative hash function. The following should be a little better at avoiding collisions than a straight exclusive-or:

# if SIZEOF_SIZE_T < 8

inline size_t ut_fold_ull(uint64_t d) noexcept

  return size_t(d) * 31 + size_t(d >> (SIZEOF_SIZE_T * CHAR_BIT));

# else

#  define ut_fold_ull(d) d

# endif

Marko Mäkelä added a comment - 2024-11-06 10:08 I was thinking a bit more about the 32-bit case. The Wikipedia article on universal hashing mentions Rabin–Karp rolling hash, also known as a multiplicative hash function. The following should be a little better at avoiding collisions than a straight exclusive-or: # if SIZEOF_SIZE_T < 8 inline size_t ut_fold_ull(uint64_t d) noexcept { return size_t (d) * 31 + size_t (d >> (SIZEOF_SIZE_T * CHAR_BIT)); } # else # define ut_fold_ull(d) d # endif

Marko Mäkelä added a comment - 2024-11-06 13:29

I dug up a bit more while investigating some optimized code that the above was translated into as part of a larger function:

   0x00fd8ecb <+3131>:	mov    0x64(%eax),%ecx

   0x00fd8ece <+3134>:	mov    0x60(%eax),%ebx

   0x00fd8ed1 <+3137>:	mov    %edx,%eax

   0x00fd8ed3 <+3139>:	shl    $0x5,%eax

   0x00fd8ed6 <+3142>:	sub    %edx,%eax

   0x00fd8ed8 <+3144>:	mov    0x4(%edi),%edx

   0x00fd8edb <+3147>:	add    %edx,%eax

   0x00fd8edd <+3149>:	xor    %edx,%edx

   0x00fd8edf <+3151>:	div    %ebx

The two mov load the 64-bit quantity to 32-bit registers, and the following 3 instructions multiply the least significant bits by 31 (multiply by 32 and then subtract one). Finally, the add instruction at +3147 adds the most significant 32 bits. This looks rather efficient to me.

While I was investigating this, I was wondering about the use of the edx register. I can’t explain the dereferencing of edi, but the xor %edx,%edx appears to be necessary because the IA-32 div instruction would support a 64-bit dividend in edx:eax. So, we might as well just compute table->id % hash_size and skip the ut_fold_ull() step altogether. Alas, both GCC and clang would seem to emit a call to __umoddi3 instead of generating something like the following:

mov    0x64(%eax),%edx

mov    0x60(%eax),%eax

div    %ebx

So, I think that we should stick to the above.

Marko Mäkelä added a comment - 2024-11-06 13:29 I dug up a bit more while investigating some optimized code that the above was translated into as part of a larger function: 0x00fd8ecb <+3131>: mov 0x64(%eax),%ecx 0x00fd8ece <+3134>: mov 0x60(%eax),%ebx 0x00fd8ed1 <+3137>: mov %edx,%eax 0x00fd8ed3 <+3139>: shl $0x5,%eax 0x00fd8ed6 <+3142>: sub %edx,%eax 0x00fd8ed8 <+3144>: mov 0x4(%edi),%edx 0x00fd8edb <+3147>: add %edx,%eax 0x00fd8edd <+3149>: xor %edx,%edx 0x00fd8edf <+3151>: div %ebx The two mov load the 64-bit quantity to 32-bit registers, and the following 3 instructions multiply the least significant bits by 31 (multiply by 32 and then subtract one). Finally, the add instruction at +3147 adds the most significant 32 bits. This looks rather efficient to me. While I was investigating this, I was wondering about the use of the edx register. I can’t explain the dereferencing of edi , but the xor %edx,%edx appears to be necessary because the IA-32 div instruction would support a 64-bit dividend in edx:eax . So, we might as well just compute table->id % hash_size and skip the ut_fold_ull() step altogether. Alas, both GCC and clang would seem to emit a call to __umoddi3 instead of generating something like the following: mov 0x64(%eax),%edx mov 0x60(%eax),%eax div %ebx So, I think that we should stick to the above.

Marko Mäkelä added a comment - 2024-11-06 13:34

I forgot to mention that the above experiment was without the ut_hash_ulint() function. I’d remove it as part of implementing this change. Quoting my comment in ~~MDEV-35049~~:

I had been wondering about the function ut_hash_ulint(), mainly about the usefulness of the exclusive-or operation. A simplified definition is as follows:
typedef size_t ulint;
inline ulint ut_hash_ulint(ulint key, ulint table_size) noexcept
{
  return (key ^ 1653893711) % table_size;
}
I see that Oracle simplified this to key % table_size, which is much more meaningful.

Marko Mäkelä added a comment - 2024-11-06 13:34 I forgot to mention that the above experiment was without the ut_hash_ulint() function. I’d remove it as part of implementing this change. Quoting my comment in MDEV-35049 : I had been wondering about the function ut_hash_ulint() , mainly about the usefulness of the exclusive-or operation. A simplified definition is as follows: typedef size_t ulint; inline ulint ut_hash_ulint(ulint key, ulint table_size) noexcept { return (key ^ 1653893711) % table_size; } I see that Oracle simplified this to key % table_size , which is much more meaningful.

Debarun Banerjee added a comment - 2024-11-08 06:39

One thing that seems worth improving is the common hash table abstraction/interface. Something like Hash_Table<T> that would allow us to create the hash algorithm for the best without needing to modify other modules and have external terminologies like fold(), hash() etc. which could be applied repeatedly sometimes when not needed. The Oracle's simplification has the good property that it calculates hash only once and then uses the HASH_ interface to distribute in buckets using the division method. A better approach however is for the hashing to be part of the Hash_Table abstraction to better steer the hash algorithm in future.

About the hashing algorithm itself, it is indeed based on adhoc constants today and can be improved. I agree that we could improve it for better distribution possibly using multiplication method or even could go for different hash table design like open addressing, universal hashing, perfect hashing etc. Since hash table is used in many low level constructs within Innodb including pages and locks a good amount of experiment on the distribution and performance testing is advisable.

Debarun Banerjee added a comment - 2024-11-08 06:39 One thing that seems worth improving is the common hash table abstraction/interface. Something like Hash_Table<T> that would allow us to create the hash algorithm for the best without needing to modify other modules and have external terminologies like fold(), hash() etc. which could be applied repeatedly sometimes when not needed. The Oracle's simplification has the good property that it calculates hash only once and then uses the HASH_ interface to distribute in buckets using the division method. A better approach however is for the hashing to be part of the Hash_Table abstraction to better steer the hash algorithm in future. About the hashing algorithm itself, it is indeed based on adhoc constants today and can be improved. I agree that we could improve it for better distribution possibly using multiplication method or even could go for different hash table design like open addressing, universal hashing, perfect hashing etc. Since hash table is used in many low level constructs within Innodb including pages and locks a good amount of experiment on the distribution and performance testing is advisable.

Marko Mäkelä added a comment - 2024-11-12 11:47

I agree that it could make sense to refactor an InnoDB hash table class template. The downside of that is that template metaprogramming tends to make code harder to read. In the following we have a prime example of that:

typedef Pool<trx_t, TrxFactory, TrxPoolLock> trx_pool_t;

typedef PoolManager<trx_pool_t, TrxPoolManagerLock > trx_pools_t;

These are the only uses of those templates.

Two currently used hash table variants are similar and could use some common code, instead of duplicating the logic. Both buf_pool.page_hash and lock_sys.rec_hash use slim rw-locks that are pushed down to the hash array cache lines. They diverge a little in the SUX_LOCK_GENERIC case, that is, when no system call like Linux 2.6 futex(2) is available. While we can afford to busy-wait in buf_pool.page_hash, for lock_sys.rec_hash it makes more sense to use a blocking mechanism. Changing the implementation of these hash tables is far from trivial. In fact, we ended up adding some extra padding (or waste) on ARMv8, to only use 64 bytes of each 128-byte cache line. This was based on some performance testing results. Replacing the current array of singly linked chains of buckets with something else (such as open addressing) could also require a complete redesign of the locking.

Other uses of the hash table use external locking, and there we would be more free to try different things. I think that some refactoring might make sense after the adaptive hash index has been refactored in ~~MDEV-35049~~. One thing that we should to keep in mind while refactoring is how to avoid machine code duplication. Even if the underlying hash formula (think java.lang.Object.hashCode()) differs between instances of hash_table_t, it could make sense to retain the generic hash_table_t and hash_cell_t classes. I see that currently all member functions of both classes are defined inline, so this may be a moot point.

In any case, I think that while a larger refactoring of the hash tables may be useful, it is outside the scope of this ticket.

Marko Mäkelä added a comment - 2024-11-12 11:47 I agree that it could make sense to refactor an InnoDB hash table class template. The downside of that is that template metaprogramming tends to make code harder to read. In the following we have a prime example of that: typedef Pool<trx_t, TrxFactory, TrxPoolLock> trx_pool_t; typedef PoolManager<trx_pool_t, TrxPoolManagerLock > trx_pools_t; These are the only uses of those templates. Two currently used hash table variants are similar and could use some common code, instead of duplicating the logic. Both buf_pool.page_hash and lock_sys.rec_hash use slim rw-locks that are pushed down to the hash array cache lines. They diverge a little in the SUX_LOCK_GENERIC case, that is, when no system call like Linux 2.6 futex(2) is available. While we can afford to busy-wait in buf_pool.page_hash , for lock_sys.rec_hash it makes more sense to use a blocking mechanism. Changing the implementation of these hash tables is far from trivial. In fact, we ended up adding some extra padding (or waste) on ARMv8, to only use 64 bytes of each 128-byte cache line. This was based on some performance testing results. Replacing the current array of singly linked chains of buckets with something else (such as open addressing) could also require a complete redesign of the locking. Other uses of the hash table use external locking, and there we would be more free to try different things. I think that some refactoring might make sense after the adaptive hash index has been refactored in MDEV-35049 . One thing that we should to keep in mind while refactoring is how to avoid machine code duplication. Even if the underlying hash formula (think java.lang.Object.hashCode() ) differs between instances of hash_table_t , it could make sense to retain the generic hash_table_t and hash_cell_t classes. I see that currently all member functions of both classes are defined inline , so this may be a moot point. In any case, I think that while a larger refactoring of the hash tables may be useful, it is outside the scope of this ticket.

Debarun Banerjee added a comment - 2024-11-20 15:02

The hash table abstraction looks good to me. Thanks.

Debarun Banerjee added a comment - 2024-11-20 15:02 The hash table abstraction looks good to me. Thanks.

People

Assignee:: Marko Mäkelä

Reporter:: Marko Mäkelä

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2024-10-24 07:25

Updated:: 2024-11-21 10:16

Resolved:: 2024-11-21 10:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server