[MDEV-20630] lf_hash get performance regression since the bucket size won't decrease Created: 2019-09-19 Updated: 2024-01-02 |
|
| Status: | Confirmed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3, 10.4, 10.5 |
| Fix Version/s: | 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | zongzhi chen | Assignee: | Oleksandr Byelkin |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | performance | ||
| Issue Links: |
|
||||||||||||||||||||
| Epic Link: | InnoDB trx_sys improvements | ||||||||||||||||||||
| Description |
|
The rw_trx_hash is implement by lf_hash, After running oltp_write_only with 1024 threads sysbench, there is lots of trx_id insert into lf_hash, lf_hash will increase it's bucket size. However, after finish the sysbench, and running cleanup operation, the item will been deleted in lf_hash, however the bucket size won't decrease, there is still lots of dummy nodes in it. After that running oltp_read_write sysbench with 256 thread, the performance will get regression compare to running oltp_read_write sysbench directly without runing oltp_write_only with 1024 threads, even we have clean the data. in the test case, we can find that the lf_hash size is 512, however, there is only 2 item in it, so in the iterator operation, we need to iterator 512 dummy node to get the 2 item, that cause the performance regression. so can we add a new operation to reduce the bucket size and delete dummy node. |
| Comments |
| Comment by Sergey Vojtovich [ 2019-09-19 ] | |||||||||||||||||||||||||||||||||||||||||
|
Hi baotiao! Thanks for the bug report, looks interesting. | |||||||||||||||||||||||||||||||||||||||||
| Comment by zongzhi chen [ 2019-09-21 ] | |||||||||||||||||||||||||||||||||||||||||
|
The version is 10.5. This is the sysbench script that I use to running the case ``` ``` In this case, in order to increase the influence, I use 2048 threads in oltp_write_only case instead of 1024, and use 64 theads in the oltp_read_write case This result before running 2048 thread oltp_write_only case: ``` General statistics: Latency (ms): The result after runing 2048 threads oltp_write_only ``` General statistics: Latency (ms): Threads fairness: ``` In fact, the case oltp_write_only can be another other test case, such as oltp_update_index, The case which can increase the rw_trx_hash size will cause the performance regression. Right now, we use a rw_lock in trx level to reinit the lf_hash and delete the unused dummy node. After that we can avoid the performance regression. However, I think the better way is to implement to re_init or delete the dummy node in lf_hash. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-09-23 ] | |||||||||||||||||||||||||||||||||||||||||
|
Thanks! Makes perfect sense now. I implemented a microbenchmark that shows performance difference:
Output:
That is clean is 2000 faster than dirty. | |||||||||||||||||||||||||||||||||||||||||
| Comment by zongzhi chen [ 2019-09-23 ] | |||||||||||||||||||||||||||||||||||||||||
|
Yes, in the trx layer, we saw the iterator operation from 2us to 70 us.. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-09-23 ] | |||||||||||||||||||||||||||||||||||||||||
|
I've discussed this with serg, author of LF_HASH. He suggested the following alternatives:
We have no decision which direction to take yet. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-09-23 ] | |||||||||||||||||||||||||||||||||||||||||
|
We just found out that we could micro optimise it a bit. A quick attempt made it 3x faster. | |||||||||||||||||||||||||||||||||||||||||
| Comment by zongzhi chen [ 2019-09-23 ] | |||||||||||||||||||||||||||||||||||||||||
|
can you show us how did you do it? | |||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-09-24 ] | |||||||||||||||||||||||||||||||||||||||||
|
A quick attempt was:
But it is not semantically correct. I'm working on a proper fix now. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-09-24 ] | |||||||||||||||||||||||||||||||||||||||||
|
baotiao, could you try your sysbench benchmark on top of my 3 patches in bb-10.5-svoj-MDEV-20630? It is not a complete solution, but it may be enough at this scale. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-09-25 ] | |||||||||||||||||||||||||||||||||||||||||
|
Apparently my patches didn't make a breakthrough. According to perf lf_hash_iterate() is down from 10% to 7%, but apparently that's not enough to fix this regression. | |||||||||||||||||||||||||||||||||||||||||
| Comment by zongzhi chen [ 2019-09-25 ] | |||||||||||||||||||||||||||||||||||||||||
|
ok. we have optimized in the trx0lf layter to reinit the lf_hash when there is too many dummy nodes in the lf_hash.. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Krizhanovsky (Inactive) [ 2020-05-25 ] | |||||||||||||||||||||||||||||||||||||||||
|
US patent US7370054 (filed in 2004 by Sun Microsystems, i.e. belongs to Oracle now, https://patentimages.storage.googleapis.com/14/68/80/d8d4ce50b3430b/US7370054.pdf) addresses the problem of removal the dymmy nodes on hash table shrinking as well as the hash table with eliminated dummy node. The patent text is somewhat obscure, so there is a risk of legal claims after removing the dummy nodes from the current implementation. The problem with the dummy nodes is also discussed in number of newer research papers. There are also newer research in the area of concurrent hash tables and tries which outperform the original implementation. The papers of the interest are:
[1] and [2] propos the faster hash table implementations with comparison against the Split-Ordered lists. The existing dynamic array coupled with the split-order lists is very close in its structure to a burst hash trie [3, 5, 6]. There are also CPU cache-conscious data structures [3, 5], which work with as less number of cache lines as possible. E.g. in TempestaDB we use an aggregation of the ideas (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_db/core/htrie.c, note that the HTrie implementation is incomplete for the moment), which provides lock-free HTrie data access with constant time worst case data access (rw-locks are acquired only for the hash collisions in 64-bit keys space). The split-order list is bound to use '%' operation as the hash function, which is collision prone, while hash tries can use much better hash functions. Typically lock/wait-free resizing is more efficient for hash tries than classic hash tables because the trie can grow concurrently in different nodes/branches. rw_trx_hash_t uses trx_id, which is more or less a sequential counter, for keys, so the keys set is significantly skewed. The keys distribution property requires a trade off for a hash trie between memory consumption and concurrency. It might be beneficial to use the keys directly without hashing. I also investigated the microbenchmark provided by svoj a little bit:
Surely a more accurate profiling of the whole server workload is required, but I believe there is still room for further micro-optimizations. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-04-13 ] | |||||||||||||||||||||||||||||||||||||||||
|
When it comes to the InnoDB use of the lock-free hash table, MDEV-21423 would replace it with a fixed-size hash table where each cache line embeds a std::atomic based lock (or Microsoft Windows SRWLOCK), using a spinlock when a Linux futex(2) like interface is not available. Because the lock-free hash table is also being used for other purposes, I think that this ticket should remain open independent of MDEV-21423 until the remaining impact of the design limitation has been assessed. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-03-15 ] | |||||||||||||||||||||||||||||||||||||||||
|
I think that one of the most expensive operations is a hash table traversal, that is, lf_hash_iterate(). In InnoDB, it is wrapped in trx_sys.rw_trx_hash.iterate(). I was not able to get better performance with a lock-based hash table in MDEV-21423 so far. What I managed was to remove some calls to lf_hash_iterate() in All calls to trx_sys.rw_trx_hash.iterate_no_dups() are only executed during startup, related to recovered transactions. I do not think there is much that could be improved about them. Apart from trx_sys.find_same_or_older() which is needed for lock checks of secondary indexes, a prominent source of lf_hash_iterate() in InnoDB is trx_sys.snapshot_ids(), invoked by trx_sys_t::clone_oldest_view() (maybe once per second in the purge coordinator) and ReadView::open() (whenever a transaction needs to create a new read view). These operations would greatly benefit from these optimizations (or the successful implementation of MDEV-21423). | |||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-05-09 ] | |||||||||||||||||||||||||||||||||||||||||
|
For an Oracle-internal 8.0.0 release of MySQL, the implementation was rewritten in C++11 std::atomic (using implicit std::memory_order_seq_cst) in 2015. There have been some changes to the code since then, including replacing some more volatile with more std::atomic, and adding a cmp_func parameter. Performance could be improved by relaxing the sequentially-consistent ordering, by adding explicit std::memory_order to the atomic operations. This will have to be reviewed and tested very carefully, of course. I expect that we would need release-acquire ordering in many places. |