[MDEV-21574] MyRocks: Range Locking: RCU-based cache for the root node Created: 2020-01-27  Updated: 2020-04-06

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Fix Version/s: N/A

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None

Attachments: File range-lock-counters1-b.diff     File range-lock-counters1.diff     File rcu4a.diff     File rcu4b.diff     File results-feb3.tgz     PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File screenshot-3.png     PNG File screenshot-4.png    
Issue Links:
Relates
relates to MDEV-15603 Gap Lock support in MyRocks Stalled
relates to MDEV-22171 Range Locking: explore other data str... Open

 Description   

Problem description

Benchmarks show that Range-Locking mode experiences a lot of contention on the
locktree's root node mutex.

This is because every lock acquisition goes as follows:

  • Acquire the locktree's root node mutex.
  • Compare the root node's range with the range we intend to lock
  • Typically, the ranges do not overlap, so we need to go to the left or right child
  • Acquire the mutex of the left(or right) child node
  • Possibly do a tree rotation
  • Release the root node mutex.

Then the process repeats for the child child node, until we either have found the node whose range overlaps with the range we intend to lock, or we have found the place where we insert the node.

Proposed solution

The idea is that most lock acquisitions do not actually need to modify the
root node. They could use an RCU mechanism to access it.

The tree will have a member variable:

class locktree { 
  bool root_rcu_enabled;
  ...
}

Fast-path reader will do this:

tree_node *get_child_with_rcu(lookup_key) {
  tree_node *res= nullptr;
  rcu_read_lock();
  if (locktree->root_rcu_enabled) {
    cmp = compare(lookup_key, locktree->root->key);
    if (cmp == LESS_THAN) {
      Get the mutex on locktree->root->left_child;
      res= locktree->root->left_child;
    }
    // if (cmp == GREATER_THAN) { same thing with right_child } 
  }
  rcu_read_unlock();
  return res;
}

If this function returns NULL, this means we need to touch the root node and so will retry using the locking.

tree_node *get_root_node() {
  
  // Stop the new readers from using the RCU
  root_rcu_enabled= false;
 
  // Wait until all RCU readers are gone
  rcu.synchronize_rcu();
 
  locktree->root->mutex_lock();
 
  // Here, can make modifications to the root node
 
  // When we are done,  release the mutex and enable the RCU
  locktree->root->mutex_unlock();
  root_rcu_enabled= true;
}

Other details

It is not clear if/how this approach could be extended to non-root nodes.



 Comments   
Comment by Sergei Petrunia [ 2020-01-27 ]

A question: how expensive is the synchronize_rcu call ? AFAIU, It does O(#active_threads) operations. Does it wait or spin if it finds a reader that's not done? (EDIT: most likely, it walks through the list of "active" threads and sees if any of them is in the read lock section. If it is, it spins until it is done)

Comment by Sergei Petrunia [ 2020-02-08 ]

Update (posting late): First implementation of RCU-based cache: rcu4a.diff rcu4b.diff

The first benchmark results results-feb3.tgz were disappointing:

SYSBENCH_TEST=oltp_write_only.lua
Threads,        QPS
1       17653.08
5       54328.13
10      75474.21
20      80733.12
40      82492.24
60      80959.63
80      78199.64
100     76098.70
ubuntu@ip-172-31-17-220:~/range-locking-benchmark$ ./summarize-result.sh 10-range-locking-rcu-fixed
SERVER_DIR=mysql-5.6-range-locking-rcu
SYSBENCH_TEST=oltp_write_only.lua
Threads,        QPS
1       15321.58
5       24620.80
10      21940.91
20      20527.80
40      19053.72
60      18122.92
80      17221.54
100     16536.54

Comment by Sergei Petrunia [ 2020-02-08 ]

First candidate reason: the locktree seems to be often unbalanced at the root node.

Comment by Sergei Petrunia [ 2020-02-10 ]

Added a counter to check how often the root is observed to have just one child.
It turns out, most of the time it is the case. Tree rotations do not touch the root node.

threads, rangelock_acquire, rangelock_acquire_one_child
 10, 1662607, 1320052
 20, 1614924, 1295149
 40, 1542945, 1211934
 60, 1488727, 1227706
 80, 1446373, 1130178
100, 1395247, 1071909

Comment by Sergei Petrunia [ 2020-02-10 ]

Added rotations. Now, if the root node has one child and the other one with depth > 3, a rotation will be performed so that the root node has two children.

Comment by Sergei Petrunia [ 2020-02-10 ]

... This didn't help. Still, counters show a picture like this:

RANGELOCK_ACQUIRE       1,675,656
RANGELOCK_ACQUIRE_RCU   23,402
...
RANGELOCK_EXTRA_COUNTER3        1,651,999

One can see that only a very small fraction of lock acquire operations used the RCU.

For lock releases, the situation is similar, although RCU is used more (not sure why):

RANGELOCK_REMOVE        1,665,812
RANGELOCK_REMOVE_RCU    192,521

Comment by Sergei Petrunia [ 2020-02-10 ]

RANGELOCK_EXTRA_COUNTER3 here counts the number of times the code examined m_rangetree->rcu_cache_usable, found it to be FALSE, and didn't use the RCU.

The issue is, current code works like so: if we can't use the RCU codepath, we take the non-RCU one (where we acquire the mutex). Could the following be happening: one thread needs to use non-RCU codepath on the root node (e.g. it is releasing the lock that is in the root node). Others start waiting, upon finishing their wait they take the non-RCU code path, too. This causes yet more threads to fail to use RCU code path and wait, and so forth...

Comment by Sergei Petrunia [ 2020-03-15 ]

Finally managed to get rid of crashes/corruption in the "improved" RCU code. Pushed it here:

https://github.com/spetrunia/mysql-5.6/tree/range-locking-fb-mysql-5.6.35-rcu
https://github.com/spetrunia/rocksdb/tree/rocksdb-range-locking-rcu

Comment by Sergei Petrunia [ 2020-03-15 ]

c5.2xlarge:

ubuntu@ip-172-31-17-220:~/range-locking-benchmark$ ./summarize-result.sh 105-orig
SERVER_DIR=mysql-5.6-orig
SYSBENCH_TEST=oltp_write_only.lua
Threads,        QPS
1       17531.27
5       58527.50
10      94065.11
20      101323.16
40      101602.47
60      100858.54
80      100453.83
100     99408.72
ubuntu@ip-172-31-17-220:~/range-locking-benchmark$ ./summarize-result.sh 105-range-locking-base 
SERVER_DIR=mysql-5.6-range-locking
SYSBENCH_TEST=oltp_write_only.lua
Threads,        QPS
1       17777.90
5       54689.06
10      76271.56
20      81926.33
40      84046.30
60      83086.25
80      80931.29
100     78644.12
ubuntu@ip-172-31-17-220:~/range-locking-benchmark$ ./summarize-result.sh 104-rcu-rel
SERVER_DIR=mysql-5.6-range-locking-rcu
SYSBENCH_TEST=oltp_write_only.lua
Threads,        QPS
1       16099.38
5       33605.91
10      29440.77
20      29464.23
40      28729.21
60      28841.09
80      29461.01
100     30389.78

Comment by Sergei Petrunia [ 2020-03-16 ]

https://docs.google.com/spreadsheets/d/1TFxlLx7rALVTuOWX8AqdVuWP9BmdxgBdLmNyZoYtbjI/edit#gid=1271981084

This shows RCU now is much more commonly used, but the %age of the operations that use the RCU is still too low.

Questions:

  • Is this a fundamental limitation or there's a way to increase the number of lock acquisitions/releases that do not use the RCU?
Comment by Sergei Petrunia [ 2020-03-17 ]

Could it be that my idea about how often the root node needs to be modified is wrong?
Added counters to check how often this happens: range-lock-counters1-b.diff range-lock-counters1.diff
and ran the benchmark.

Results: https://docs.google.com/spreadsheets/d/1TFxlLx7rALVTuOWX8AqdVuWP9BmdxgBdLmNyZoYtbjI/edit#gid=1704131097

no surprises so far...

Comment by Sergei Petrunia [ 2020-03-30 ]

Improved patch, run in c5.9xlarge (two runs with the patch, one without perf counters and one with):

threads	orig	range-locking	range-locking-rcu	range-locking-rcu-cntr
1	21916.79	22166.78	21960.01	21347.42
5	98125.35	90756.97	63205.76	61617.9
10	165994.26	143740.83	94505.42	93738.68
20	189590.09	154530.26	114082.09	113270.92
40	209484.58	145180.66	130328.97	129862.27
60	213205.54	140677.04	133218.07	132456.41
80	216881.11	138068.66	132357.3	131860.98
100	210547.05	137074.94	131686.54	130884.23

It is not as bad as before, but still it's worse than regular range locking .

Comment by Sergei Petrunia [ 2020-03-30 ]

Counters for the new patch:

https://docs.google.com/spreadsheets/d/1aApVWcYp9CogAUfho9wvH55-h4a0SkdKFAqJFNXQku0/edit#gid=0

At high number of threads, the number of operations that use the RCU read path (and don't acquire mutex on the root node) is about 90%. With previous patch, it was 50-70%.

If an operation (lock acquisition/release) couldn't use the RCU read path, it will get the root node's mutex. Then, it will examine the root node and arrive at one of these outcomes:

  • 1. It doesn't need to modify the root node. It will release the mutex and move to the child. Note that this path doesn't prevent others from using RCU read path
  • 2. It needs to modify (e.g. delete) the root node itself (or one of its direct children). This is counted in RCU_DISABLE_ROOT_HIT.
  • 3. It needs to modify the root node due to doing re-balance operation on the child. This is counted in RCU_DISABLE_DUE_TO_REBALANCE.

One can see that RCU_DISABLE_DUE_TO_REBALANCE is relatively rare.
RCU_DISABLE_ROOT_HIT is also quite infrequent.

Generated at Thu Feb 08 09:08:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.