US patent US7370054 (filed in 2004 by Sun Microsystems, i.e. belongs to Oracle now, https://patentimages.storage.googleapis.com/14/68/80/d8d4ce50b3430b/US7370054.pdf) addresses the problem of removal the dymmy nodes on hash table shrinking as well as the hash table with eliminated dummy node. The patent text is somewhat obscure, so there is a risk of legal claims after removing the dummy nodes from the current implementation.
The problem with the dummy nodes is also discussed in number of newer research papers. There are also newer research in the area of concurrent hash tables and tries which outperform the original implementation. The papers of the interest are:
- "Dynamic-sized nonblocking hash tables", 2014, https://dl.acm.org/doi/pdf/10.1145/2611462.2611495
- "An Efficient Wait-free Resizable Hash Table", 2018, https://dl.acm.org/doi/pdf/10.1145/3210377.3210408
- "Cache-Aware Lock-Free Concurrent Hash Tries", 2011, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.6644&rep=rep1&type=pdf
- "Towards a Lock-Free, Fixed Size and PersistentHash Map Design", 2017, https://www.dcc.fc.up.pt/~ricroc/homepage/publications/2017-SBAC-PAD.pdf
- "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings", 2007, http://www.stochasticgeometry.ie/~mdennehy/CRPITV62Askitis.pdf
- "Burst Tries: A Fast, Efficient Data Structure forString Keys", 2002, http://www.lindstaedt.com.br/estruturas/bursttries.pdf
[1] and [2] propos the faster hash table implementations with comparison against the Split-Ordered lists.
The existing dynamic array coupled with the split-order lists is very close in its structure to a burst hash trie [3, 5, 6]. There are also CPU cache-conscious data structures [3, 5], which work with as less number of cache lines as possible. E.g. in TempestaDB we use an aggregation of the ideas (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_db/core/htrie.c, note that the HTrie implementation is incomplete for the moment), which provides lock-free HTrie data access with constant time worst case data access (rw-locks are acquired only for the hash collisions in 64-bit keys space).
The split-order list is bound to use '%' operation as the hash function, which is collision prone, while hash tries can use much better hash functions.
Typically lock/wait-free resizing is more efficient for hash tries than classic hash tables because the trie can grow concurrently in different nodes/branches.
rw_trx_hash_t uses trx_id, which is more or less a sequential counter, for keys, so the keys set is significantly skewed. The keys distribution property requires a trade off for a hash trie between memory consumption and concurrency. It might be beneficial to use the keys directly without hashing.
I also investigated the microbenchmark provided by svoj a little bit:
- the original proble of thelinear scan can be mitigated with larger MAX_LOAD value: 4.0 instead of 1.0 decreases the dirty/clean ratio from about x2000 to x600. I didn't test other workoloads, but they surely will get worse though. Adjusting the load factor still could make sense.
- perf reports l_find as the hot spot 99% of the microbenchmark: ~19% for pointers dereferencing and ~45% on mfence (full memory barrier). I noticed that all atomic load and stores in include/atomic/gcc_builtins.h enforce total ordering with __ATOMIC_SEQ_CST, which doesn't seems most efficient.
- The second hot spot is cursor->curr dereferencing with takes about 15% of samples. L1-dcache-load-misses event profiling shows that the point takes 94%, so the L1d cache works in non optimal way. The misses come from dereferencing the linked list nodes in the lf_hash_iterate(). The quick study shows that most of the nodes are actually allocated from different system pages, which means that the memory allocator for the list nodes isn't efficient and a SLAB-like allocator should improve the performance.
Surely a more accurate profiling of the whole server workload is required, but I believe there is still room for further micro-optimizations.
US patent US7370054 (filed in 2004 by Sun Microsystems, i.e. belongs to Oracle now, https://patentimages.storage.googleapis.com/14/68/80/d8d4ce50b3430b/US7370054.pdf) addresses the problem of removal the dymmy nodes on hash table shrinking as well as the hash table with eliminated dummy node. The patent text is somewhat obscure, so there is a risk of legal claims after removing the dummy nodes from the current implementation.
The problem with the dummy nodes is also discussed in number of newer research papers. There are also newer research in the area of concurrent hash tables and tries which outperform the original implementation. The papers of the interest are:
[1] and [2] propos the faster hash table implementations with comparison against the Split-Ordered lists.
The existing dynamic array coupled with the split-order lists is very close in its structure to a burst hash trie [3, 5, 6]. There are also CPU cache-conscious data structures [3, 5], which work with as less number of cache lines as possible. E.g. in TempestaDB we use an aggregation of the ideas (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_db/core/htrie.c, note that the HTrie implementation is incomplete for the moment), which provides lock-free HTrie data access with constant time worst case data access (rw-locks are acquired only for the hash collisions in 64-bit keys space).
The split-order list is bound to use '%' operation as the hash function, which is collision prone, while hash tries can use much better hash functions.
Typically lock/wait-free resizing is more efficient for hash tries than classic hash tables because the trie can grow concurrently in different nodes/branches.
rw_trx_hash_t uses trx_id, which is more or less a sequential counter, for keys, so the keys set is significantly skewed. The keys distribution property requires a trade off for a hash trie between memory consumption and concurrency. It might be beneficial to use the keys directly without hashing.
I also investigated the microbenchmark provided by svoj a little bit:
Surely a more accurate profiling of the whole server workload is required, but I believe there is still room for further micro-optimizations.