Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20630

lf_hash get performance regression since the bucket size won't decrease

Details

    Description

      The rw_trx_hash is implement by lf_hash, After running oltp_write_only with 1024 threads sysbench, there is lots of trx_id insert into lf_hash, lf_hash will increase it's bucket size. However, after finish the sysbench, and running cleanup operation, the item will been deleted in lf_hash, however the bucket size won't decrease, there is still lots of dummy nodes in it.

      After that running oltp_read_write sysbench with 256 thread, the performance will get regression compare to running oltp_read_write sysbench directly without runing oltp_write_only with 1024 threads, even we have clean the data.

      in the test case, we can find that the lf_hash size is 512, however, there is only 2 item in it, so in the iterator operation, we need to iterator 512 dummy node to get the 2 item, that cause the performance regression.

      so can we add a new operation to reduce the bucket size and delete dummy node.

      Attachments

        Issue Links

          Activity

            krizhanovsky Alexander Krizhanovsky added a comment - - edited

            US patent US7370054 (filed in 2004 by Sun Microsystems, i.e. belongs to Oracle now, https://patentimages.storage.googleapis.com/14/68/80/d8d4ce50b3430b/US7370054.pdf) addresses the problem of removal the dymmy nodes on hash table shrinking as well as the hash table with eliminated dummy node. The patent text is somewhat obscure, so there is a risk of legal claims after removing the dummy nodes from the current implementation.

            The problem with the dummy nodes is also discussed in number of newer research papers. There are also newer research in the area of concurrent hash tables and tries which outperform the original implementation. The papers of the interest are:

            1. "Dynamic-sized nonblocking hash tables", 2014, https://dl.acm.org/doi/pdf/10.1145/2611462.2611495
            2. "An Efficient Wait-free Resizable Hash Table", 2018, https://dl.acm.org/doi/pdf/10.1145/3210377.3210408
            3. "Cache-Aware Lock-Free Concurrent Hash Tries", 2011, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.6644&rep=rep1&type=pdf
            4. "Towards a Lock-Free, Fixed Size and PersistentHash Map Design", 2017, https://www.dcc.fc.up.pt/~ricroc/homepage/publications/2017-SBAC-PAD.pdf
            5. "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings", 2007, http://www.stochasticgeometry.ie/~mdennehy/CRPITV62Askitis.pdf
            6. "Burst Tries: A Fast, Efficient Data Structure forString Keys", 2002, http://www.lindstaedt.com.br/estruturas/bursttries.pdf

            [1] and [2] propos the faster hash table implementations with comparison against the Split-Ordered lists.

            The existing dynamic array coupled with the split-order lists is very close in its structure to a burst hash trie [3, 5, 6]. There are also CPU cache-conscious data structures [3, 5], which work with as less number of cache lines as possible. E.g. in TempestaDB we use an aggregation of the ideas (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_db/core/htrie.c, note that the HTrie implementation is incomplete for the moment), which provides lock-free HTrie data access with constant time worst case data access (rw-locks are acquired only for the hash collisions in 64-bit keys space).

            The split-order list is bound to use '%' operation as the hash function, which is collision prone, while hash tries can use much better hash functions.

            Typically lock/wait-free resizing is more efficient for hash tries than classic hash tables because the trie can grow concurrently in different nodes/branches.

            rw_trx_hash_t uses trx_id, which is more or less a sequential counter, for keys, so the keys set is significantly skewed. The keys distribution property requires a trade off for a hash trie between memory consumption and concurrency. It might be beneficial to use the keys directly without hashing.

            I also investigated the microbenchmark provided by svoj a little bit:

            1. the original proble of thelinear scan can be mitigated with larger MAX_LOAD value: 4.0 instead of 1.0 decreases the dirty/clean ratio from about x2000 to x600. I didn't test other workoloads, but they surely will get worse though. Adjusting the load factor still could make sense.
            2. perf reports l_find as the hot spot 99% of the microbenchmark: ~19% for pointers dereferencing and ~45% on mfence (full memory barrier). I noticed that all atomic load and stores in include/atomic/gcc_builtins.h enforce total ordering with __ATOMIC_SEQ_CST, which doesn't seems most efficient.
            3. The second hot spot is cursor->curr dereferencing with takes about 15% of samples. L1-dcache-load-misses event profiling shows that the point takes 94%, so the L1d cache works in non optimal way. The misses come from dereferencing the linked list nodes in the lf_hash_iterate(). The quick study shows that most of the nodes are actually allocated from different system pages, which means that the memory allocator for the list nodes isn't efficient and a SLAB-like allocator should improve the performance.

            Surely a more accurate profiling of the whole server workload is required, but I believe there is still room for further micro-optimizations.

            krizhanovsky Alexander Krizhanovsky added a comment - - edited US patent US7370054 (filed in 2004 by Sun Microsystems, i.e. belongs to Oracle now, https://patentimages.storage.googleapis.com/14/68/80/d8d4ce50b3430b/US7370054.pdf ) addresses the problem of removal the dymmy nodes on hash table shrinking as well as the hash table with eliminated dummy node. The patent text is somewhat obscure, so there is a risk of legal claims after removing the dummy nodes from the current implementation. The problem with the dummy nodes is also discussed in number of newer research papers. There are also newer research in the area of concurrent hash tables and tries which outperform the original implementation. The papers of the interest are: "Dynamic-sized nonblocking hash tables", 2014, https://dl.acm.org/doi/pdf/10.1145/2611462.2611495 "An Efficient Wait-free Resizable Hash Table", 2018, https://dl.acm.org/doi/pdf/10.1145/3210377.3210408 "Cache-Aware Lock-Free Concurrent Hash Tries", 2011, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.6644&rep=rep1&type=pdf "Towards a Lock-Free, Fixed Size and PersistentHash Map Design", 2017, https://www.dcc.fc.up.pt/~ricroc/homepage/publications/2017-SBAC-PAD.pdf "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings", 2007, http://www.stochasticgeometry.ie/~mdennehy/CRPITV62Askitis.pdf "Burst Tries: A Fast, Efficient Data Structure forString Keys", 2002, http://www.lindstaedt.com.br/estruturas/bursttries.pdf [1] and [2] propos the faster hash table implementations with comparison against the Split-Ordered lists. The existing dynamic array coupled with the split-order lists is very close in its structure to a burst hash trie [3, 5, 6] . There are also CPU cache-conscious data structures [3, 5] , which work with as less number of cache lines as possible. E.g. in TempestaDB we use an aggregation of the ideas ( https://github.com/tempesta-tech/tempesta/blob/master/tempesta_db/core/htrie.c , note that the HTrie implementation is incomplete for the moment), which provides lock-free HTrie data access with constant time worst case data access (rw-locks are acquired only for the hash collisions in 64-bit keys space). The split-order list is bound to use '%' operation as the hash function, which is collision prone, while hash tries can use much better hash functions. Typically lock/wait-free resizing is more efficient for hash tries than classic hash tables because the trie can grow concurrently in different nodes/branches. rw_trx_hash_t uses trx_id, which is more or less a sequential counter, for keys, so the keys set is significantly skewed. The keys distribution property requires a trade off for a hash trie between memory consumption and concurrency. It might be beneficial to use the keys directly without hashing. I also investigated the microbenchmark provided by svoj a little bit: the original proble of thelinear scan can be mitigated with larger MAX_LOAD value: 4.0 instead of 1.0 decreases the dirty/clean ratio from about x2000 to x600. I didn't test other workoloads, but they surely will get worse though. Adjusting the load factor still could make sense. perf reports l_find as the hot spot 99% of the microbenchmark: ~19% for pointers dereferencing and ~45% on mfence (full memory barrier). I noticed that all atomic load and stores in include/atomic/gcc_builtins.h enforce total ordering with __ATOMIC_SEQ_CST, which doesn't seems most efficient. The second hot spot is cursor->curr dereferencing with takes about 15% of samples. L1-dcache-load-misses event profiling shows that the point takes 94%, so the L1d cache works in non optimal way. The misses come from dereferencing the linked list nodes in the lf_hash_iterate(). The quick study shows that most of the nodes are actually allocated from different system pages, which means that the memory allocator for the list nodes isn't efficient and a SLAB-like allocator should improve the performance. Surely a more accurate profiling of the whole server workload is required, but I believe there is still room for further micro-optimizations.

            When it comes to the InnoDB use of the lock-free hash table, MDEV-21423 would replace it with a fixed-size hash table where each cache line embeds a std::atomic based lock (or Microsoft Windows SRWLOCK), using a spinlock when a Linux futex(2) like interface is not available.

            Because the lock-free hash table is also being used for other purposes, I think that this ticket should remain open independent of MDEV-21423 until the remaining impact of the design limitation has been assessed.

            marko Marko Mäkelä added a comment - When it comes to the InnoDB use of the lock-free hash table, MDEV-21423 would replace it with a fixed-size hash table where each cache line embeds a std::atomic based lock (or Microsoft Windows SRWLOCK ), using a spinlock when a Linux futex(2) like interface is not available. Because the lock-free hash table is also being used for other purposes, I think that this ticket should remain open independent of MDEV-21423 until the remaining impact of the design limitation has been assessed.

            I think that one of the most expensive operations is a hash table traversal, that is, lf_hash_iterate(). In InnoDB, it is wrapped in trx_sys.rw_trx_hash.iterate().

            I was not able to get better performance with a lock-based hash table in MDEV-21423 so far. What I managed was to remove some calls to lf_hash_iterate() in MDEV-28445 and its follow-up fix MDEV-30357, related to determining if a PAGE_MAX_TRX_ID in a secondary index leaf page can correspond to any active transaction.

            All calls to trx_sys.rw_trx_hash.iterate_no_dups() are only executed during startup, related to recovered transactions. I do not think there is much that could be improved about them.

            Apart from trx_sys.find_same_or_older() which is needed for lock checks of secondary indexes, a prominent source of lf_hash_iterate() in InnoDB is trx_sys.snapshot_ids(), invoked by trx_sys_t::clone_oldest_view() (maybe once per second in the purge coordinator) and ReadView::open() (whenever a transaction needs to create a new read view). These operations would greatly benefit from these optimizations (or the successful implementation of MDEV-21423).

            marko Marko Mäkelä added a comment - I think that one of the most expensive operations is a hash table traversal, that is, lf_hash_iterate() . In InnoDB, it is wrapped in trx_sys.rw_trx_hash.iterate() . I was not able to get better performance with a lock-based hash table in MDEV-21423 so far. What I managed was to remove some calls to lf_hash_iterate() in MDEV-28445 and its follow-up fix MDEV-30357 , related to determining if a PAGE_MAX_TRX_ID in a secondary index leaf page can correspond to any active transaction. All calls to trx_sys.rw_trx_hash.iterate_no_dups() are only executed during startup, related to recovered transactions. I do not think there is much that could be improved about them. Apart from trx_sys.find_same_or_older() which is needed for lock checks of secondary indexes, a prominent source of lf_hash_iterate() in InnoDB is trx_sys.snapshot_ids() , invoked by trx_sys_t::clone_oldest_view() (maybe once per second in the purge coordinator) and ReadView::open() (whenever a transaction needs to create a new read view). These operations would greatly benefit from these optimizations (or the successful implementation of MDEV-21423 ).

            For an Oracle-internal 8.0.0 release of MySQL, the implementation was rewritten in C++11 std::atomic (using implicit std::memory_order_seq_cst) in 2015. There have been some changes to the code since then, including replacing some more volatile with more std::atomic, and adding a cmp_func parameter.

            Performance could be improved by relaxing the sequentially-consistent ordering, by adding explicit std::memory_order to the atomic operations. This will have to be reviewed and tested very carefully, of course. I expect that we would need release-acquire ordering in many places.

            marko Marko Mäkelä added a comment - For an Oracle-internal 8.0.0 release of MySQL, the implementation was rewritten in C++11 std::atomic (using implicit std::memory_order_seq_cst ) in 2015 . There have been some changes to the code since then, including replacing some more volatile with more std::atomic , and adding a cmp_func parameter . Performance could be improved by relaxing the sequentially-consistent ordering, by adding explicit std::memory_order to the atomic operations. This will have to be reviewed and tested very carefully, of course. I expect that we would need release-acquire ordering in many places.
            nikitamalyavin Nikita Malyavin added a comment - - edited

            I have looked through the existing hash table implementations and have stopped by growt for an adoption attempt
            https://github.com/MariaDB/server/commit/ea21046874fd6b84e79247d4fb9d2d046a1c9acd

            A few notes about it follow
            + As the report promises, the size reduction problem is solved.
            + It supports arbitrary key and value types
            + Has several mode switches for a better tune
            + Has a good performance, compared to out lf_hash
            But

            • A new api (the one which supports arbitrary types) is not well maintained: I had to fix a couple of bugs to kick it up.
            • Had no updates from the author in a year (I hope he's fine cause the github page has no updates as well)
            • The mode I was using ("unsynchronised") requires a pointer to store a 15-bit tag, which I guess can theoretically overflow (maybe it's used for own epoch-based reclamation implementation, I didn't investigate)
            • I had to enable c++17 to build it

            According to my tests, inserting 65K elements and then removing them doesn't affect oltp_read_only performance, so it solves the problem.

            Also I have compared the insert/erase performance to lf_hash:
            I have run oltp_update_index and after 20+ seconds in parallel made the probing under separate connection. The results follow.
            perf patch

            lf_hash
            @@lf_erase_time	@@lf_erase_count	erase_avg	@@lf_insert_time	@@lf_insert_count	insert_avg
            0	0	NULL	0	0	NULL
            722630	120284	6.0077	710752	120285	5.9089
            1341177	222192	6.0361	1319470	222193	5.9384
            1912175	313221	6.1049	1888206	313222	6.0283
            2412724	396209	6.0895	2399635	396207	6.0565
             
             
             
            growt
            @@lf_erase_time	@@lf_erase_count	erase_avg	@@lf_insert_time	@@lf_insert_count	insert_avg
            0	0	NULL	0	0	NULL
            507741	88144	5.7603	552448	88142	6.2677
            953192	167068	5.7054	1008164	167068	6.0345
            1420842	242471	5.8598	1609515	242471	6.6380
            1794036	313405	5.7243	2032143	313405	6.4841
            

            growt beats lf_hash by 0.25 clocks in erases, but looses around 0.3 clocks in inserts performance on 16 threads.

            It's also not clear whether the element is persistent: our trx_t implementation requires to have an unprotected access to rw_trx_hash_element_t. Table growth/reduction may re-allocate rw_trx_hash_element_t which is not very convenient for us: either we'll need to store it guarded by a smart pointer, or we'll need to make a thorough safety check against this problem. In my tests, nothing failed, but I don't know if reclamation worked fine.

            Other hashes to try:

            • folly:: ConcurrentHashMap: has lock-protected writes and lock-free reads.
              + concidered to be fast
            • Table growth/reduction is done by re-allocations, so moves will also happen.Reclamation is HP-based and used table-wise (i.e. the whole table pointer is hp-protected).
            • Also has a lot of dependencies: boost, glog.
            • requires c++17
            • libcuckoo can be what we just need
            • libcds: contains a number of concurrent map implementations (ordered and unordered) and a library of reclamations (HP and epoch): skip-lists, avl-tree and a number of hashes.
            • ASCYLIB is implemented in C and can be another set of alternatives
            nikitamalyavin Nikita Malyavin added a comment - - edited I have looked through the existing hash table implementations and have stopped by growt for an adoption attempt https://github.com/MariaDB/server/commit/ea21046874fd6b84e79247d4fb9d2d046a1c9acd A few notes about it follow + As the report promises, the size reduction problem is solved. + It supports arbitrary key and value types + Has several mode switches for a better tune + Has a good performance, compared to out lf_hash But A new api (the one which supports arbitrary types) is not well maintained: I had to fix a couple of bugs to kick it up. Had no updates from the author in a year (I hope he's fine cause the github page has no updates as well) The mode I was using ("unsynchronised") requires a pointer to store a 15-bit tag, which I guess can theoretically overflow (maybe it's used for own epoch-based reclamation implementation, I didn't investigate) I had to enable c++17 to build it According to my tests, inserting 65K elements and then removing them doesn't affect oltp_read_only performance, so it solves the problem . Also I have compared the insert/erase performance to lf_hash : I have run oltp_update_index and after 20+ seconds in parallel made the probing under separate connection. The results follow. perf patch lf_hash @@lf_erase_time @@lf_erase_count erase_avg @@lf_insert_time @@lf_insert_count insert_avg 0 0 NULL 0 0 NULL 722630 120284 6.0077 710752 120285 5.9089 1341177 222192 6.0361 1319470 222193 5.9384 1912175 313221 6.1049 1888206 313222 6.0283 2412724 396209 6.0895 2399635 396207 6.0565       growt @@lf_erase_time @@lf_erase_count erase_avg @@lf_insert_time @@lf_insert_count insert_avg 0 0 NULL 0 0 NULL 507741 88144 5.7603 552448 88142 6.2677 953192 167068 5.7054 1008164 167068 6.0345 1420842 242471 5.8598 1609515 242471 6.6380 1794036 313405 5.7243 2032143 313405 6.4841 growt beats lf_hash by 0.25 clocks in erases, but looses around 0.3 clocks in inserts performance on 16 threads. It's also not clear whether the element is persistent: our trx_t implementation requires to have an unprotected access to rw_trx_hash_element_t . Table growth/reduction may re-allocate rw_trx_hash_element_t which is not very convenient for us: either we'll need to store it guarded by a smart pointer, or we'll need to make a thorough safety check against this problem. In my tests, nothing failed, but I don't know if reclamation worked fine. Other hashes to try: folly:: ConcurrentHashMap: has lock-protected writes and lock-free reads. + concidered to be fast Table growth/reduction is done by re-allocations, so moves will also happen.Reclamation is HP-based and used table-wise (i.e. the whole table pointer is hp-protected). Also has a lot of dependencies: boost, glog. requires c++17 libcuckoo can be what we just need libcds : contains a number of concurrent map implementations (ordered and unordered) and a library of reclamations (HP and epoch): skip-lists, avl-tree and a number of hashes. ASCYLIB is implemented in C and can be another set of alternatives

            People

              nikitamalyavin Nikita Malyavin
              baotiao zongzhi chen
              Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.