[MDEV-15603] Gap Lock support in MyRocks Created: 2018-03-20  Updated: 2023-04-28

Status: Stalled
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Fix Version/s: None

Type: Task Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 1
Labels: None

Attachments: PNG File screenshot-1.png     PNG File screenshot-2.png     PNG File screenshot-3.png    
Issue Links:
PartOf
includes MDEV-17873 MyRocks-Gap-Lock: lock wait doesn't s... Open
includes MDEV-17874 MyRocks-Gap-Lock: Lock memory overhead Closed
includes MDEV-17887 MyRocks-Gap-Lock: information about c... Closed
includes MDEV-18104 MyRocks-Gap-Lock: range locking bound... Closed
includes MDEV-18227 MyRocks-Gap-Lock: Lock escalation and... Closed
includes MDEV-19451 MyRocks: Range Locking: shared point ... Open
includes MDEV-19986 MyRocks: Range Locking: SeekForUpdate... Open
includes MDEV-21314 Range Locking: individual rows are lo... Open
Relates
relates to MDEV-18856 Benchmark range locking Closed
relates to MDEV-21574 MyRocks: Range Locking: RCU-based cac... Open
relates to MDEV-21186 Benchmark range locking - nov-dec 2019 Closed

 Description   

(The upstream task is: https://github.com/facebook/mysql-5.6/issues/800 )

Notes about how to use PerconaFT:

1. Data structures
1.1 A Global Lock Tree Manager object
1.2 A separate Lock Tree for each table
1.3 Each transaction keeps a track of ranges it is holding locks
2. Functions
2.1 Initializing the Lock Manager
2.2 Create Lock Tree for a table
2.3 Getting a lock
2.4 Releasing a lock.
2.5 Releasing all of the transaction's locks

1. Data structures

1.1 A Global Lock Tree Manager object

There needs to be a global locktree_manager.

See PerconaFT/src/ydb-internal.h,

  struct __toku_db_env_internal {
    toku::locktree_manager ltm;

1.2 A separate Lock Tree for each table

TokuDB uses a separate Lock Tree for each table db->i->lt.

1.3 Each transaction keeps a track of ranges it is holding locks

Each transaction has a list of ranges that it is holding locks on. It is referred to like so

  db_txn_struct_i(txn)->lt_map

and is stored in this structure, together with a mutex to protect it:

  struct __toku_db_txn_internal {
      // maps a locktree to a buffer of key ranges that are locked.
      // it is protected by the txn_mutex, so hot indexing and a client
      // thread can concurrently operate on this txn.
      toku::omt<txn_lt_key_ranges> lt_map;
      toku_mutex_t txn_mutex;

The mutex is there, because the list may be modified by the lock escalation process (which may be invoked from a different thread).
(See toku_txn_destroy for how to free this)

2. Functions

Most functions that are mentioned here are from storage/tokudb/PerconaFT/src/, ydb_txn.cc, ydb_row_lock.cc - this is TokuDB's layer above the Lock Tree.

2.1 Initializing the Lock Manager

TODO

2.2 Create Lock Tree for a table

TokuDB does it when it opens a table's table_share. It is done like so:

        db->i->lt = db->dbenv->i->ltm.get_lt(db->i->dict_id,
                                             toku_ft_get_comparator(db->i->ft_handle),
                                             &on_create_extra);

Then, one needs to release it:

db->dbenv->i->ltm.release_lt(db->i->lt);

after the last release_lt call, the Lock Tree will be deleted (it is guaranteed to be empty).

(TODO: this is easy to arrange if Toku locks are invoked from MyRocks level. But if they are invoked from RocksDB, this is harder as RocksDB doesn't have any concept of tables or indexes. For start, we can pretend all keys are in one table)

2.3 Getting a lock

This function has an example:

// Get a range lock.
// Return when the range lock is acquired or the default lock tree timeout has expired.
int toku_db_get_range_lock(DB *db, DB_TXN *txn, const DBT *left_key, const DBT *right_key,
        toku::lock_request::type lock_type) {

It is also possible to start an asynchronous lock request and then wait for it (see toku_db_start_range_lock, toku_db_wait_range_lock). We don't have a use for this it seems

Point locks are obtained by passing the same key as left_key and right_key.

2.4 Releasing a lock.

TokuDB doesn't seem to release individual locks (all locks are held until transaction either commits or is aborted).

LockTree has a function to release locks from a specified range:

locktree::release_locks(TXNID txnid, const range_buffer *ranges)

Besides calling that, one will need to

  • wake up all waiting lock requests. release_locks doesn't wake them up. There is toku::lock_request::retry_all_lock_requests call which retries all pending requests (Which doesn't seem to be efficient... but maybe it is ok?)
  • Remove the released lock from the list of locks it is holding (which is in db_txn_struct_i(txn)->lt_map). This is actually not essential because that list is only used for the purpose of releasing the locks when the transaction is finished.

2.5 Releasing all of the transaction's locks

See PerconaFT/src/ydb_txn.cc:

static void toku_txn_release_locks(DB_TXN *txn) {
    // Prevent access to the locktree map while releasing.
    // It is possible for lock escalation to attempt to
    // modify this data structure while the txn commits.
    toku_mutex_lock(&db_txn_struct_i(txn)->txn_mutex);
 
    size_t num_ranges = db_txn_struct_i(txn)->lt_map.size();
    for (size_t i = 0; i < num_ranges; i++) {
        txn_lt_key_ranges ranges;
        int r = db_txn_struct_i(txn)->lt_map.fetch(i, &ranges);
        invariant_zero(r);
        toku_db_release_lt_key_ranges(txn, &ranges);
    }
 
    toku_mutex_unlock(&db_txn_struct_i(txn)->txn_mutex);
}



 Comments   
Comment by Sergei Petrunia [ 2018-03-23 ]

TokuDB's lock tree is here: storage/tokudb/PerconaFT/locktree. They lock
ranges.

(gdb) wher
  #0  toku::locktree::sto_try_acquire (this=0x7fff700342c0, prepared_lkr=0x7fffd4b6c390, txnid=11, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/locktree/locktree.cc:291
  #1  0x00007ffff4d6eaa1 in toku::locktree::acquire_lock (this=0x7fff700342c0, is_write_request=true, txnid=11, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770, conflicts=0x7fffd4b6c4c0) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/locktree/locktree.cc:380
  #2  0x00007ffff4d6eb73 in toku::locktree::try_acquire_lock (this=0x7fff700342c0, is_write_request=true, txnid=11, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770, conflicts=0x7fffd4b6c4c0, big_txn=false) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/locktree/locktree.cc:399
  #3  0x00007ffff4d6ec1a in toku::locktree::acquire_write_lock (this=0x7fff700342c0, txnid=11, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770, conflicts=0x7fffd4b6c4c0, big_txn=false) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/locktree/locktree.cc:412
  #4  0x00007ffff4d72dc4 in toku::lock_request::start (this=0x7fffd4b6c5b0) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/locktree/lock_request.cc:165
  #5  0x00007ffff4d603aa in toku_db_start_range_lock (db=0x7fff700271e0, txn=0x7fff70060600, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770, lock_type=toku::lock_request::WRITE, request=0x7fffd4b6c5b0) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/src/ydb_row_lock.cc:211
  #6  0x00007ffff4d6022e in toku_db_get_range_lock (db=0x7fff700271e0, txn=0x7fff70060600, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770, lock_type=toku::lock_request::WRITE) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/src/ydb_row_lock.cc:182
  #7  0x00007ffff4e31643 in c_set_bounds (dbc=0x7fff7005f000, left_key=0x7fffd4b6c750, right_key=0x7fffd4b6c770, pre_acquire=true, out_of_range_error=-30989) at /home/psergey/dev-git/10.3-r2/storage/tokudb/PerconaFT/src/ydb_cursor.cc:714
  #8  0x00007ffff4d195df in ha_tokudb::prelock_range (this=0x7fff7002cdf8, start_key=0x7fff7002cee0, end_key=0x7fff7002cf00) at /home/psergey/dev-git/10.3-r2/storage/tokudb/ha_tokudb.cc:5978
  #9  0x00007ffff4d19a31 in ha_tokudb::read_range_first (this=0x7fff7002cdf8, start_key=0x7fff7002cee0, end_key=0x7fff7002cf00, eq_range=false, sorted=true) at /home/psergey/dev-git/10.3-r2/storage/tokudb/ha_tokudb.cc:6025
  #10 0x0000555555d761dc in handler::multi_range_read_next (this=0x7fff7002cdf8, range_info=0x7fffd4b6c950) at /home/psergey/dev-git/10.3-r2/sql/multi_range_read.cc:291
  #11 0x0000555555d763be in Mrr_simple_index_reader::get_next (this=0x7fff7002d3d8, range_info=0x7fffd4b6c950) at /home/psergey/dev-git/10.3-r2/sql/multi_range_read.cc:323
  #12 0x0000555555d7901a in DsMrr_impl::dsmrr_next (this=0x7fff7002d298, range_info=0x7fffd4b6c950) at /home/psergey/dev-git/10.3-r2/sql/multi_range_read.cc:1399
  #13 0x00007ffff4d30b56 in ha_tokudb::multi_range_read_next (this=0x7fff7002cdf8, range_info=0x7fffd4b6c950) at /home/psergey/dev-git/10.3-r2/storage/tokudb/ha_tokudb_mrr_maria.cc:42
  #14 0x000055555601f3a2 in QUICK_RANGE_SELECT::get_next (this=0x7fff7002f800) at /home/psergey/dev-git/10.3-r2/sql/opt_range.cc:11454
  #15 0x0000555556030e64 in rr_quick (info=0x7fff700162b0) at /home/psergey/dev-git/10.3-r2/sql/records.cc:366
  #16 0x0000555555b3b03b in READ_RECORD::read_record (this=0x7fff700162b0) at /home/psergey/dev-git/10.3-r2/sql/records.h:73
  #17 0x0000555555c3e4a4 in join_init_read_record (tab=0x7fff700161e8) at /home/psergey/dev-git/10.3-r2/sql/sql_select.cc:20227
  #18 0x0000555555c3c256 in sub_select (join=0x7fff700145b0, join_tab=0x7fff700161e8, end_of_records=false) at /home/psergey/dev-git/10.3-r2/sql/sql_select.cc:19301
  #19 0x0000555555c3b821 in do_select (join=0x7fff700145b0, procedure=0x0) at /home/psergey/dev-git/10.3-r2/sql/sql_select.cc:18844

Comment by Sergei Petrunia [ 2018-07-24 ]

Data collected so far:
https://gist.github.com/spetrunia/c75b34d70aaea3b927e478557ff89ab5

Comment by Sergei Petrunia [ 2018-08-21 ]

The MDEV text now has a description of how to use the range locker from TokuDB.

Other input: there is a big concern about regressions wrt the current way of doing locking. Most likely, we will need to support both current locking mode (where gap locking is not available for any transaction) and the range locking mode (where some transactions may take range locks in some circumstances. Others take row locks. Both kinds of locks inhibit each other).

Comment by Sergei Petrunia [ 2018-09-03 ]

Current locking code does "Snapshot Checking" (See PessimisticTransaction::ValidateSnapshot):

When acquiring a point lock on $ROW_KEY, a transaction will check whether there were any changes made to $ROW_KEY after the transaction's snapshot was taken.

This apparently cannot be efficiently done for range locks.

But it seems to be also unnecessary. Here's why:

Snapshot checking (ValidateSnapshot) is needed to prevent situations like this:

trx1> start; allocate a snapshot 
 
trx2> update value for $ROW_KEY_1; commit;
 
trx1> update value for $ROW_KEY_1;   -- note that we are using a snapshot and
                                     -- dont see trx2's changes
 
trx1> commit; -- this overwrites changes by trx2.

That is, this is an "optimistic-like" method to make sure that transaction's snapshot has not been "made obsolete" by some other transaction.

With Range Locking,

  • We can't have "ValidateSnapshot for ranges"
  • but we place locks on all records we read.

Range locks would not prevent the above scenario between trx1 and trx2, as trx2 updates $ROW_KEY_1 before trx1 attempts to read it.

However, when transactions use locking, we can assume that trx1 "happened after" trx2 has committed. (The only thing that would prevent this assumption would be that trx1 has read a value that trx2 is modifying. But in that case, trx1 would have held a read lock that would have prevented trx2 from making the modification).

The only issue here is that trx1 must not use a snapshot that was created before trx2 has committed.

To sum up: RangeLockingForReads

  • Does not need to use ValidateSnapshot
  • But must not use the snapshot from the beginning of the transaction. (That is, if we are reading data using snapshot S, then S must have been acquired
    after we have obtained a lock covering the rowkey we are reading. This is our guarantee that nobody has sneaked in an update).

If we are holding all locks for the duration of the transaction, there is no problem with reading inconsistent data (the data will be the same as if we've used the snapshot made after the most-recently-modified row we've read)

Comment by Maysam Yabandeh [ 2018-09-07 ]

1. If there is no snapshot taken before acquiring the lock, then even the existing code would not call ValidateSnapshot: https://github.com/facebook/rocksdb/blob/ea212e531696cab9cc8c2c3da49119b7888402ef/utilities/transactions/pessimistic_transaction.cc#L535
2. MyRocks does allow transactions to explicitly take a snapshot at the very beginning, before any reads start. What happens to those cases?

Comment by Maysam Yabandeh [ 2018-09-07 ]

If the transaction has already taken a snapshot at the beginning, perhaps we can get the implementation to guarantee that it would never call ::Get before RangeLockingForReads, and then upgrade the snapshot after the last call to RangeLockingForReads. This would be as we have delayed the transaction's request to take the snapshot.

The problem with this approach would be losing linearlizability: If for the two transactions, the client make connections between their input/output outside the sql engine, then it might get inconsistent results as we did not actually take snapshot at the wall-clock-time that we confirmed the client that we did. For example in this sequence of events running "from the same client session":

K1=V1
txn B starts
txn B take snapshot
 
txn A writes VA to K1
txn A commits
 
txn B reads K1

The client expects Txn B to read V1 but we return VA. I think it should be fine since our supported isolation level is not linearizable anyway (it is not even serializable).

Comment by Sergei Petrunia [ 2018-09-10 ]

I've put up a tree here: https://github.com/spetrunia/mysql-5.6/tree/range-locking-fb-mysql-5.6.35

Current status:

  • MyRocks has a read-only global variable @@rocksdb_use_range_locking which one can set in my.cnf
  • In addition to class TransactionLockMgr, RocksDB (a modified copy of it) includes another class which uses PerconaFT' locktree to provide locks.
  • Currently, it only does point, write-only locks.
  • The state is: it compiled, it worked for a basic example. Lots of details are still missing and in particular, the APIs are not final.
Comment by Sergei Petrunia [ 2018-09-10 ]

1. If there is no snapshot taken before acquiring the lock, then even the existing code would not call ValidateSnapshot: https://github.com/facebook/rocksdb/blob/ea212e531696cab9cc8c2c3da49119b7888402ef/utilities/transactions/pessimistic_transaction.cc#L535

I am not sure when that happens (IRC in MyRocks, normally a transaction would create/use a snapshot before it has written any data). Will check

Comment by Sergei Petrunia [ 2018-09-12 ]

I took the current patch (it uses locktree to do point locks, all locks are
exclusive write locks under the hood, etc) and ran a benchmark.

The benchmark compares the performance of the current locking system with the new locking system with varying number of client connections.

sysbench ... --time=60 /usr/share/sysbench/oltp_write_only.lua  
--table-size=1000000 --mysql_storage_engine=RocksDB --threads=$n run

The results are:

n_threads	current_locking_tps	new_locking_tps	new_to_current_ratio
1	433.7	417.64	0.963
2	585.28	553.67	0.946
5	1358.33	1340.1	0.987
10	2435.65	2423.49	0.995
20	3968.21	3806.98	0.959
40	5306.06	4975.17	0.938
60	5913.78	5256.03	0.889
80	6122.57	5607.66	0.916
100	6280.9	5736.32	0.913
120	6423.71	5631.45	0.877

Plotting this

Plotting the slowdown ratio

Comment by Sergei Petrunia [ 2018-09-12 ]

So

  • The difference is clearly visible
  • New locking is slower, the difference is growing as the number of threads grows.
  • Maybe it's because it read locks are made write locks under the hood? (can be checked by forcing "old" locking to use write locks always)
Comment by Sergei Petrunia [ 2018-09-20 ]

pt-table-checksum works as follows:

The table is broken into chunks. Then, for each chunk, the master computes the checksum like so:

REPLACE INTO 
  percona.checksums(
    db, tbl, chunk, 
    chunk_index, lower_boundary, upper_boundary, 
    this_cnt, this_crc)
SELECT 
  'test', 't10', '48', 
  'PRIMARY', '950358', '972636', -- boundaries
  COUNT(*) AS cnt,
  ... , --  here is a long expression to compute the row checksum
FROM 
  test.t10 FORCE INDEX(PRIMARY)
WHERE 
  ((pk >= '950358')) AND ((pk <= '972636')) /*checksum chunk*/

This statement is replicated to the slave using SBR. That is, the slave will run it too, and compute the checksum of the data on the slave.

Then, the master reads the checksum data:

SELECT this_crc, this_cnt 
FROM percona.checksums 
WHERE db = 'test' AND tbl = 't10' AND chunk = '48';

And saves it in master_crc column:

UPDATE percona.checksums 
SET 
  chunk_time = '0.455180', 
  master_crc = '691e28bc', 
  master_cnt = '22279' 
WHERE 
  db = 'test' AND tbl = 't10' AND chunk = '48'

This way, on the slave we will get

  • master_crc is CRC value from the master
  • this_cnt is CRC value computed locally.

The need for Gap Locking comes from Statement replication of REPLACE INTO ... SELECT. When executed on the slave, it should read the same data as it did on the master. Fo that, execution of REPLACE INTO ... SELECT FROM t1 on the master must prevent any concurrent transaction from making modifications to t1 and committing.

pt-table-checksum code also has "LOCK IN SHARE MODE" query inside but it does not seem to be used.

Comment by Sergei Petrunia [ 2018-10-15 ]

InnoDB's equivalent of Snapshot Checking

InnoDB also uses multi-versioning and locking for intended writes. It doesn't do SnapshotChecking, so it faces a similar problem with overwriting the changes that were made after the transaction's snapshot was taken but before the lock was acquired:

1. trx1> start; allocate a snapshot 
 
2. trx2> update value for $ROW_KEY_1; commit;
 
3. trx1> update value for $ROW_KEY_1;
 
4. trx1> commit;

InnoDB solves this by having DML statements to read the latest committed data, instead of the latest snapshot.

This does look like a READ-COMMITTED isolation level:

trx1> begin;
trx1> select * from t1 where pk=3;
+----+------+
| pk | a    |
+----+------+
|  3 |    3 |
+----+------+

trx2> update t1 set a=33 where pk=3; -- autocommit=1 here

Transaction trx1 is reading from the snapshot:

trx1> select * from t1 where pk=3;
+----+------+
| pk | a    |
+----+------+
|  3 |    3 |
+----+------+

unless it's a FOR UPDATE (or DML) which will see the latest committed data:

trx1> select * from t1 where pk=3 for update;
+----+------+
| pk | a    |
+----+------+
|  3 |   33 |
+----+------+

Regardless of that, further SELECTs will continue to read from the snapshot:

trx1> select * from t1 where pk=3;
+----+------+
| pk | a    |
+----+------+
|  3 |    3 |
+----+------+

DML will operate on the latest committed data:

trx1> update t1 set a=a+1 where pk=3;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
 
trx1> select * from t1 where pk=3;
+----+------+
| pk | a    |
+----+------+
|  3 |   34 |
+----+------+

This behavior "breaks" the promise of REPEATABLE-READ on the master, but in return, the statement will have the same effect when it is run on the slave.

Use in Range Locking in MyRocks

Range Locking mode in MyRocks can use this approach too:

  • DML statements and SELECT FOR UPDATE/LOCK IN SHARE MODE should read the latest committed data (this includes the unique key checks they do)
  • No Snapshot Checking is necessary.
  • Regular SELECTs should still read from the snapshot (This should happen even if the transaction is already holding a lock on the row. Even in this case, regular SELECT may return an out-of-date version of the row).
Comment by Sergei Petrunia [ 2018-11-30 ]

Currently failing tests:

rocksdb.rqg_transactions
rocksdb.rocksdb_deadlock_stress_rc
rocksdb.rocksdb_deadlock_stress_rr
rocksdb.deadlock_stats
rocksdb.compact_deletes
rocksdb.rocksdb_deadlock_detect_rc
rocksdb.deadlock
rocksdb.deadlock_tracking
rocksdb.gap_lock_raise_error
rocksdb.i_s_deadlock
rocksdb.rocksdb_deadlock_detect_rr

rocksdb.rqg_transactions 'range_locking'

  • Assertion failure in toku::treenode::remove

rocksdb.compact_deletes 'range_locking'

  • Timed out, it was just hangint with no user activity??

rocksdb.rocksdb_deadlock_detect_rc 'range_locking'

  • Lock wait timeout error

rocksdb.rocksdb_deadlock_stress_rc 'range_locking'
rocksdb.rocksdb_deadlock_stress_rr 'range_locking'

  • Lock wait timeout error

rocksdb.deadlock 'range_locking'

  • 900 sec. timeout, several threads waiting for a lock

rocksdb.deadlock_stats 'range_locking'- "mysqltest got signal 6" - crash on the client??

  • still, the test seems to use deadlock detector.

rocksdb.deadlock_tracking 'range_locking'

  • Lock wait timeout error.

rocksdb.gap_lock_raise_error 'range_locking'

  • Lock wait timeout error.

rocksdb.i_s_deadlock 'range_locking'

  • Lock wait timeout error.

rocksdb.rocksdb_deadlock_detect_rr 'range_locking'

  • Lock wait timeout error.
Comment by Sergei Petrunia [ 2019-01-08 ]

Currently, the tests pass.
rocksdb testsuite now has three "combinations" - write_pareared, write_committed, and range_locking.
Tests that assume point locking are disabled in 'range_locking' mode.
There are also tests that target specifically range locking.

Comment by Sergei Petrunia [ 2019-01-08 ]

Remaining issues:

  • Reduce transaction's list of acquired locks to reflect the actions of lock escalation.
  • Turn off snapshot validation.
Comment by Sergei Petrunia [ 2019-01-28 ]

Now the above is done and there are no known Gap-Lock-related test failures in the rocksdb test suite.

  • Also did some code cleanup in preparation for a pull request to RocksDB, but more cleanups will be needed.
Comment by Sergei Petrunia [ 2019-01-28 ]

Also did a basic benchmark: ran sysbench oltp_read_write.lua for:

  • rocksdb_use_range_locking=1
  • rocksdb_use_range_locking=0
  • the original tree that range locking patch is currently based on.

SYSBENCH_BASE_ARGS=" --db-driver=mysql --mysql-host=127.0.0.1 --mysql-user=root \
  --time=60 \
  /usr/share/sysbench/oltp_read_write.lua --table-size=1000000"
SYSBENCH_CUR_ARGS="$SYSBENCH_BASE_ARGS --mysql_storage_engine=RocksDB"
sysbench $SYSBENCH_CUR_ARGS prepare;
 
  for threads in 1 10 20 40 ; do
    SYSBENCH_ALL_ARGS="$SYSBENCH_CUR_ARGS --threads=$threads"
  done

Results:

rangelocking=ON 
1 307.74
10 1576.26
20 1819.30 
40 1640.48 

rangelocking=OFF
1 307.58
10 1579.74
20 1838.34
40 1620.53

rangelocking-orig
1 306.23
10 1565.10
20 1811.46
40 1611.57

Comment by Sergei Petrunia [ 2019-01-28 ]

In tabular form

	rangelocking=ON	rangelocking=OFF	rangelocking-orig
1	307.74	307.58	306.23
10	1576.26	1579.74	1565.1
20	1819.3	1838.34	1811.46
40	1640.48	1620.53	1611.57

Comment by Sergei Petrunia [ 2019-04-30 ]

The pull request is at https://github.com/facebook/rocksdb/pull/5041

Comment by Sergei Petrunia [ 2019-11-18 ]

Got a question about refreshing the iterator.

Consider a query:

update t1 set col1=col1+1000 where (pk between 3 and 7) or (pk between 10 and 15);

Suppose the range locking is ON, the table has `PRIMARY KEY(pk)`, and the query is using the PK.

It will do this:

  trx->get_range_lock([3; 7]);
  iter = trx->get_iterator(); // (1)
  // Use the iter to read the latest commited rows in the [3..7] range 
  // (2)
 
  trx->get_range_lock([10; 15]);  // (3)

Now, the iterator we created at point (1) is reading the snapshot of data taken at that moment.

We need to read the latest-committed (to be precise - we need to see everything that was committed into the 10..15 range before the get_range_lock call marked with (3) was run.

We should call this:

  iter->Refresh();

But for me the iterator is `rocksdb::BaseDeltaIterator`, which doesn't override Refresh(), so it uses rocksdb::Iterator::Refresh, which is this:

  virtual Status Refresh() {
    return Status::NotSupported("Refresh() is not supported");
  }

Does this mean

  • The iterator I've got will return me the latest data (and NOT the "snapshot at the time the iterator was created, (1))
    or
  • The iterator I've got doesnt support Refresh() so I should destroy and re-create it?
Comment by Sergei Petrunia [ 2019-12-02 ]

An MTR testcase for iterator refresh:
https://gist.github.com/spetrunia/7ead10923d40bf2d9baa960740733945

Result of it:
https://gist.github.com/spetrunia/915cdeeb033251a288ec88509bb04582#file-range-locking-iterator-refresh-result-sql-L22

It shows that the iterator sees the row that has been deleted. When it attempts to read the row, we get the Got error 1 'NotFound: error.

Now, let's remove the DELETE statement from the testcase:
https://gist.github.com/spetrunia/ac3392e8279007eb15411872cbc43241
the output: https://gist.github.com/spetrunia/33ce1b208109c8b0331fc54768de58ec

30 5000

The INSERT'ed row was not updated, so it was not visible to the iterator.

For the updated rows, the result looks as if the iterator saw the latest?

40 5100
41 5100
42 5100
43 5100
44 5100
45 5100

(or is this the result of extra GetForUpdate calls?)

Comment by Sergei Petrunia [ 2019-12-05 ]

Ok,

  • the iterator obtained from TransactionDB->NewIterator() has a non-trivial Refresh implementation, ArenaWrappedDBIter::Refresh().
  • the iterator obtained from Transaction->GetIterator() doesn't support refresh. It's a BaseDeltaIterator. It has base_iterator_= ArenaWrappedDBIter, delta_iterator_=WBWIIteratorImpl.
Generated at Thu Feb 08 08:22:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.