[MDEV-20630] lf_hash get performance regression since the bucket size won't decrease Created: 2019-09-19  Updated: 2024-01-02

Status: Confirmed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.3, 10.4, 10.5
Fix Version/s: 10.6

Type: Bug Priority: Major
Reporter: zongzhi chen Assignee: Oleksandr Byelkin
Resolution: Unresolved Votes: 1
Labels: performance

Issue Links:
Relates
relates to MDEV-21423 lock-free trx_sys get performance reg... Stalled
relates to MDEV-28445 Secondary index locking invokes costl... Closed
relates to MDEV-30357 Performance regression in locking rea... Closed
relates to MDEV-33067 SCN(Sequence Commit Number) based MVCC Open
Epic Link: InnoDB trx_sys improvements

 Description   

The rw_trx_hash is implement by lf_hash, After running oltp_write_only with 1024 threads sysbench, there is lots of trx_id insert into lf_hash, lf_hash will increase it's bucket size. However, after finish the sysbench, and running cleanup operation, the item will been deleted in lf_hash, however the bucket size won't decrease, there is still lots of dummy nodes in it.

After that running oltp_read_write sysbench with 256 thread, the performance will get regression compare to running oltp_read_write sysbench directly without runing oltp_write_only with 1024 threads, even we have clean the data.

in the test case, we can find that the lf_hash size is 512, however, there is only 2 item in it, so in the iterator operation, we need to iterator 512 dummy node to get the 2 item, that cause the performance regression.

so can we add a new operation to reduce the bucket size and delete dummy node.



 Comments   
Comment by Sergey Vojtovich [ 2019-09-19 ]

Hi baotiao! Thanks for the bug report, looks interesting.
1. What was the MariaDB version?
2. What kind of hardware was that (specifically number of sockets/cores/threads and was that Intel/ARM/IBM?)
3. Could you also share the numbers that you actually observe? At 1024 threads, 256 after 1024 and 256 clean.

Comment by zongzhi chen [ 2019-09-21 ]

The version is 10.5.
The hardware is 64 cores with intel Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz
we since running in cpu-bound case

This is the sysbench script that I use to running the case

```
build/bin/sysbench oltp_write_only --mysql-host=$mysql_host --mysql-port=$port --mysql-password=22192219 --mysql-user=$user --tables=$tables --table_size=table_size --threads=1024 --rand-type=uniform --report-interval=10 prepare
build/bin/sysbench oltp_read_write --mysql-host=$mysql_host --mysql-port=$port --mysql-password=22192219 --mysql-user=$user --tables=$tables --table_size=$table_size --threads=64 --max-time=300 --report-interval=10 --rand-type=uniform run
build/bin/sysbench oltp_write_only --mysql-host=$mysql_host --mysql-port=$port --mysql-password=22192219 --mysql-user=$user --tables=$tables --table_size=$table_size --threads=2048 --max-time=300 --report-interval=10 --rand-type=uniform run
build/bin/sysbench $bench_type --mysql-host=$mysql_host --mysql-port=$port --mysql-password=22192219 --mysql-user=$user --tables=$tables --table_size=table_size --threads=$threads --rand-type=uniform --report-interval=10 cleanup
#
#
build/bin/sysbench oltp_write_only --mysql-host=$mysql_host --mysql-port=$port --mysql-password=22192219 --mysql-user=$user --tables=$tables --table_size=table_size --threads=1024 --rand-type=uniform --report-interval=10 prepare
build/bin/sysbench oltp_read_write --mysql-host=$mysql_host --mysql-port=$port --mysql-password=22192219 --mysql-user=$user --tables=$tables --table_size=$table_size --threads=64 --max-time=300 --report-interval=10 --rand-type=uniform run

```

In this case, in order to increase the influence, I use 2048 threads in oltp_write_only case instead of 1024, and use 64 theads in the oltp_read_write case

This result before running 2048 thread oltp_write_only case:

```
SQL statistics:
queries performed:
read: 60657506
write: 7951689
other: 18044001
total: 86653196
transactions: 4332487 (14440.51 per sec.)
queries: 86653196 (288821.75 per sec.)
ignored errors: 192 (0.64 per sec.)
reconnects: 0 (0.00 per sec.)

General statistics:
total time: 300.0211s
total number of events: 4332487

Latency (ms):
min: 1.86
avg: 4.43
max: 30.78
95th percentile: 6.09
sum: 19194908.12
```

The result after runing 2048 threads oltp_write_only

```
SQL statistics:
queries performed:
read: 50040942
write: 9934534
other: 11511582
total: 71487058
transactions: 3574352 (11913.58 per sec.)
queries: 71487058 (238271.60 per sec.)
ignored errors: 1 (0.00 per sec.)
reconnects: 0 (0.00 per sec.)

General statistics:
total time: 300.0214s
total number of events: 3574352

Latency (ms):
min: 2.37
avg: 5.37
max: 26.78
95th percentile: 7.30
sum: 19196160.86

Threads fairness:
events (avg/stddev): 55849.2500/8280.29
execution time (avg/stddev): 299.9400/0.01

```

In fact, the case oltp_write_only can be another other test case, such as oltp_update_index, The case which can increase the rw_trx_hash size will cause the performance regression.

Right now, we use a rw_lock in trx level to reinit the lf_hash and delete the unused dummy node. After that we can avoid the performance regression. However, I think the better way is to implement to re_init or delete the dummy node in lf_hash.

Comment by Sergey Vojtovich [ 2019-09-23 ]

Thanks! Makes perfect sense now. I implemented a microbenchmark that shows performance difference:

#include "my_global.h"
#include "lf.h"
 
static my_bool noop(void *arg1, void *arg2) { return FALSE; }
 
int main(void)
{
  LF_HASH lf_hash;
  LF_PINS *pins;
  ulonglong dirty, clean;
 
  lf_hash_init(&lf_hash, sizeof(int), LF_HASH_UNIQUE, 0, sizeof(int), 0,
               &my_charset_bin);
  pins= lf_hash_get_pins(&lf_hash);
 
  clean= microsecond_interval_timer();
  for (int i= 0; i < 10000; i++)
    lf_hash_iterate(&lf_hash, pins, noop, 0);
  clean= microsecond_interval_timer() - clean;
  printf("Clean: %f\n", (double) clean / 1000000);
 
  for (int i= 0; i < 10000; i++)
    lf_hash_insert(&lf_hash, pins, &i);
  for (int i= 0; i < 10000; i++)
    lf_hash_delete(&lf_hash, pins, &i, sizeof(i));
 
  dirty= microsecond_interval_timer();
  for (int i= 0; i < 10000; i++)
    lf_hash_iterate(&lf_hash, pins, noop, 0);
  dirty= microsecond_interval_timer() - dirty;
  printf("Dirty: %f\n", (double) dirty / 1000000);
 
  printf("Dirty vs clean ratio: x%llu\n", dirty / clean);
 
  lf_hash_put_pins(pins);
  lf_hash_destroy(&lf_hash);
  return 0;
}

Output:

Clean: 0.003685
Dirty: 7.710357
Dirty vs clean ratio: x2092

That is clean is 2000 faster than dirty.

Comment by zongzhi chen [ 2019-09-23 ]

Yes, in the trx layer, we saw the iterator operation from 2us to 70 us..
so what's the next step?
Can we support resize operator in lf_hash?
However, I find it's not easy to implement it

Comment by Sergey Vojtovich [ 2019-09-23 ]

I've discussed this with serg, author of LF_HASH. He suggested the following alternatives:

  • check if there were any updates (that would potentially fix this issue) to the "Split-Ordered Lists: Lock-Free Extensible Hash Tables" white paper
  • try implementing another algorithm
  • try another library

We have no decision which direction to take yet.

Comment by Sergey Vojtovich [ 2019-09-23 ]

We just found out that we could micro optimise it a bit. A quick attempt made it 3x faster.

Comment by zongzhi chen [ 2019-09-23 ]

can you show us how did you do it?
I run your test in my own machine, It show me that there is about 5000 faster than dirty

Comment by Sergey Vojtovich [ 2019-09-24 ]

A quick attempt was:

diff --git a/include/lf.h b/include/lf.h
index e5701a7..d1a36d1 100644
--- a/include/lf.h
+++ b/include/lf.h
@@ -77,7 +77,7 @@ typedef struct {
 #define lf_pin(PINS, PIN, ADDR)                                \
   do {                                                          \
     compile_time_assert(PIN < LF_PINBOX_PINS);                  \
-    my_atomic_storeptr(&(PINS)->pin[PIN], (ADDR));              \
+    my_atomic_storeptr_explicit(&(PINS)->pin[PIN], (ADDR), MY_MEMORY_ORDER_RELAXED);              \
   } while(0)

But it is not semantically correct. I'm working on a proper fix now.

Comment by Sergey Vojtovich [ 2019-09-24 ]

baotiao, could you try your sysbench benchmark on top of my 3 patches in bb-10.5-svoj-MDEV-20630? It is not a complete solution, but it may be enough at this scale.
https://github.com/MariaDB/server/commits/bb-10.5-svoj-MDEV-20630

Comment by Sergey Vojtovich [ 2019-09-25 ]

Apparently my patches didn't make a breakthrough. According to perf lf_hash_iterate() is down from 10% to 7%, but apparently that's not enough to fix this regression.

Comment by zongzhi chen [ 2019-09-25 ]

ok. we have optimized in the trx0lf layter to reinit the lf_hash when there is too many dummy nodes in the lf_hash..
Thus we can solve the regression issue..

Comment by Alexander Krizhanovsky (Inactive) [ 2020-05-25 ]

US patent US7370054 (filed in 2004 by Sun Microsystems, i.e. belongs to Oracle now, https://patentimages.storage.googleapis.com/14/68/80/d8d4ce50b3430b/US7370054.pdf) addresses the problem of removal the dymmy nodes on hash table shrinking as well as the hash table with eliminated dummy node. The patent text is somewhat obscure, so there is a risk of legal claims after removing the dummy nodes from the current implementation.

The problem with the dummy nodes is also discussed in number of newer research papers. There are also newer research in the area of concurrent hash tables and tries which outperform the original implementation. The papers of the interest are:

  1. "Dynamic-sized nonblocking hash tables", 2014, https://dl.acm.org/doi/pdf/10.1145/2611462.2611495
  2. "An Efficient Wait-free Resizable Hash Table", 2018, https://dl.acm.org/doi/pdf/10.1145/3210377.3210408
  3. "Cache-Aware Lock-Free Concurrent Hash Tries", 2011, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.465.6644&rep=rep1&type=pdf
  4. "Towards a Lock-Free, Fixed Size and PersistentHash Map Design", 2017, https://www.dcc.fc.up.pt/~ricroc/homepage/publications/2017-SBAC-PAD.pdf
  5. "HAT-trie: A Cache-conscious Trie-based Data Structure for Strings", 2007, http://www.stochasticgeometry.ie/~mdennehy/CRPITV62Askitis.pdf
  6. "Burst Tries: A Fast, Efficient Data Structure forString Keys", 2002, http://www.lindstaedt.com.br/estruturas/bursttries.pdf

[1] and [2] propos the faster hash table implementations with comparison against the Split-Ordered lists.

The existing dynamic array coupled with the split-order lists is very close in its structure to a burst hash trie [3, 5, 6]. There are also CPU cache-conscious data structures [3, 5], which work with as less number of cache lines as possible. E.g. in TempestaDB we use an aggregation of the ideas (https://github.com/tempesta-tech/tempesta/blob/master/tempesta_db/core/htrie.c, note that the HTrie implementation is incomplete for the moment), which provides lock-free HTrie data access with constant time worst case data access (rw-locks are acquired only for the hash collisions in 64-bit keys space).

The split-order list is bound to use '%' operation as the hash function, which is collision prone, while hash tries can use much better hash functions.

Typically lock/wait-free resizing is more efficient for hash tries than classic hash tables because the trie can grow concurrently in different nodes/branches.

rw_trx_hash_t uses trx_id, which is more or less a sequential counter, for keys, so the keys set is significantly skewed. The keys distribution property requires a trade off for a hash trie between memory consumption and concurrency. It might be beneficial to use the keys directly without hashing.

I also investigated the microbenchmark provided by svoj a little bit:

  1. the original proble of thelinear scan can be mitigated with larger MAX_LOAD value: 4.0 instead of 1.0 decreases the dirty/clean ratio from about x2000 to x600. I didn't test other workoloads, but they surely will get worse though. Adjusting the load factor still could make sense.
  2. perf reports l_find as the hot spot 99% of the microbenchmark: ~19% for pointers dereferencing and ~45% on mfence (full memory barrier). I noticed that all atomic load and stores in include/atomic/gcc_builtins.h enforce total ordering with __ATOMIC_SEQ_CST, which doesn't seems most efficient.
  3. The second hot spot is cursor->curr dereferencing with takes about 15% of samples. L1-dcache-load-misses event profiling shows that the point takes 94%, so the L1d cache works in non optimal way. The misses come from dereferencing the linked list nodes in the lf_hash_iterate(). The quick study shows that most of the nodes are actually allocated from different system pages, which means that the memory allocator for the list nodes isn't efficient and a SLAB-like allocator should improve the performance.

Surely a more accurate profiling of the whole server workload is required, but I believe there is still room for further micro-optimizations.

Comment by Marko Mäkelä [ 2022-04-13 ]

When it comes to the InnoDB use of the lock-free hash table, MDEV-21423 would replace it with a fixed-size hash table where each cache line embeds a std::atomic based lock (or Microsoft Windows SRWLOCK), using a spinlock when a Linux futex(2) like interface is not available.

Because the lock-free hash table is also being used for other purposes, I think that this ticket should remain open independent of MDEV-21423 until the remaining impact of the design limitation has been assessed.

Comment by Marko Mäkelä [ 2023-03-15 ]

I think that one of the most expensive operations is a hash table traversal, that is, lf_hash_iterate(). In InnoDB, it is wrapped in trx_sys.rw_trx_hash.iterate().

I was not able to get better performance with a lock-based hash table in MDEV-21423 so far. What I managed was to remove some calls to lf_hash_iterate() in MDEV-28445 and its follow-up fix MDEV-30357, related to determining if a PAGE_MAX_TRX_ID in a secondary index leaf page can correspond to any active transaction.

All calls to trx_sys.rw_trx_hash.iterate_no_dups() are only executed during startup, related to recovered transactions. I do not think there is much that could be improved about them.

Apart from trx_sys.find_same_or_older() which is needed for lock checks of secondary indexes, a prominent source of lf_hash_iterate() in InnoDB is trx_sys.snapshot_ids(), invoked by trx_sys_t::clone_oldest_view() (maybe once per second in the purge coordinator) and ReadView::open() (whenever a transaction needs to create a new read view). These operations would greatly benefit from these optimizations (or the successful implementation of MDEV-21423).

Comment by Marko Mäkelä [ 2023-05-09 ]

For an Oracle-internal 8.0.0 release of MySQL, the implementation was rewritten in C++11 std::atomic (using implicit std::memory_order_seq_cst) in 2015. There have been some changes to the code since then, including replacing some more volatile with more std::atomic, and adding a cmp_func parameter.

Performance could be improved by relaxing the sequentially-consistent ordering, by adding explicit std::memory_order to the atomic operations. This will have to be reviewed and tested very carefully, of course. I expect that we would need release-acquire ordering in many places.

Generated at Thu Feb 08 09:00:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.