[MDEV-15603] Gap Lock support in MyRocks - Jira

Details

Type: Task
Status: Stalled (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Storage Engine - RocksDB
Labels:
None

Description

(The upstream task is: https://github.com/facebook/mysql-5.6/issues/800 )

Notes about how to use PerconaFT:

1. Data structures
1.1 A Global Lock Tree Manager object
1.2 A separate Lock Tree for each table
1.3 Each transaction keeps a track of ranges it is holding locks
2. Functions
2.1 Initializing the Lock Manager
2.2 Create Lock Tree for a table
2.3 Getting a lock
2.4 Releasing a lock.
2.5 Releasing all of the transaction's locks

1. Data structures

1.1 A Global Lock Tree Manager object

There needs to be a global locktree_manager.

See PerconaFT/src/ydb-internal.h,

  struct __toku_db_env_internal {

    toku::locktree_manager ltm;

1.2 A separate Lock Tree for each table

TokuDB uses a separate Lock Tree for each table db->i->lt.

1.3 Each transaction keeps a track of ranges it is holding locks

Each transaction has a list of ranges that it is holding locks on. It is referred to like so

  db_txn_struct_i(txn)->lt_map

and is stored in this structure, together with a mutex to protect it:

  struct __toku_db_txn_internal {

      // maps a locktree to a buffer of key ranges that are locked.

      // it is protected by the txn_mutex, so hot indexing and a client

      // thread can concurrently operate on this txn.

      toku::omt<txn_lt_key_ranges> lt_map;

      toku_mutex_t txn_mutex;

The mutex is there, because the list may be modified by the lock escalation process (which may be invoked from a different thread).
(See toku_txn_destroy for how to free this)

2. Functions

Most functions that are mentioned here are from storage/tokudb/PerconaFT/src/, ydb_txn.cc, ydb_row_lock.cc - this is TokuDB's layer above the Lock Tree.

2.1 Initializing the Lock Manager

TODO

2.2 Create Lock Tree for a table

TokuDB does it when it opens a table's table_share. It is done like so:

        db->i->lt = db->dbenv->i->ltm.get_lt(db->i->dict_id,

                                             toku_ft_get_comparator(db->i->ft_handle),

                                             &on_create_extra);

Then, one needs to release it:

db->dbenv->i->ltm.release_lt(db->i->lt);

after the last release_lt call, the Lock Tree will be deleted (it is guaranteed to be empty).

(TODO: this is easy to arrange if Toku locks are invoked from MyRocks level. But if they are invoked from RocksDB, this is harder as RocksDB doesn't have any concept of tables or indexes. For start, we can pretend all keys are in one table)

2.3 Getting a lock

This function has an example:

// Get a range lock.

// Return when the range lock is acquired or the default lock tree timeout has expired.

int toku_db_get_range_lock(DB *db, DB_TXN *txn, const DBT *left_key, const DBT *right_key,

        toku::lock_request::type lock_type) {

It is also possible to start an asynchronous lock request and then wait for it (see toku_db_start_range_lock, toku_db_wait_range_lock). We don't have a use for this it seems

Point locks are obtained by passing the same key as left_key and right_key.

2.4 Releasing a lock.

TokuDB doesn't seem to release individual locks (all locks are held until transaction either commits or is aborted).

LockTree has a function to release locks from a specified range:

locktree::release_locks(TXNID txnid, const range_buffer *ranges)

Besides calling that, one will need to

wake up all waiting lock requests. release_locks doesn't wake them up. There is toku::lock_request::retry_all_lock_requests call which retries all pending requests (Which doesn't seem to be efficient... but maybe it is ok?)
Remove the released lock from the list of locks it is holding (which is in db_txn_struct_i(txn)->lt_map). This is actually not essential because that list is only used for the purpose of releasing the locks when the transaction is finished.

2.5 Releasing all of the transaction's locks

See PerconaFT/src/ydb_txn.cc:

static void toku_txn_release_locks(DB_TXN *txn) {

    // Prevent access to the locktree map while releasing.

    // It is possible for lock escalation to attempt to

    // modify this data structure while the txn commits.

    toku_mutex_lock(&db_txn_struct_i(txn)->txn_mutex);

    size_t num_ranges = db_txn_struct_i(txn)->lt_map.size();

    for (size_t i = 0; i < num_ranges; i++) {

        txn_lt_key_ranges ranges;

        int r = db_txn_struct_i(txn)->lt_map.fetch(i, &ranges);

        invariant_zero(r);

        toku_db_release_lt_key_ranges(txn, &ranges);

    toku_mutex_unlock(&db_txn_struct_i(txn)->txn_mutex);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
51 kB
2018-09-12 11:47
screenshot-2.png
36 kB
2018-09-12 11:49
screenshot-3.png
22 kB
2019-01-28 21:51

Issue Links

includes

MDEV-17873 MyRocks-Gap-Lock: lock wait doesn't set correct STATE

Open

MDEV-17874 MyRocks-Gap-Lock: Lock memory overhead

Closed

MDEV-17887 MyRocks-Gap-Lock: information about current lock waits

Closed

MDEV-18104 MyRocks-Gap-Lock: range locking bounds are incorrect for multi-part keys

Closed

MDEV-18227 MyRocks-Gap-Lock: Lock escalation and updates to transaction's list of owned locks

Closed

MDEV-19451 MyRocks: Range Locking: shared point lock support

Open

MDEV-19986 MyRocks: Range Locking: SeekForUpdate support

Open

MDEV-21314 Range Locking: individual rows are locked when scanning PK

Open

relates to

MDEV-18856 Benchmark range locking

Closed

MDEV-21574 MyRocks: Range Locking: RCU-based cache for the root node

Open

MDEV-21186 Benchmark range locking - nov-dec 2019

Closed

(3 includes, 3 relates to)

Activity

Ascending order - Click to sort in descending order

View 17 older comments

Sergei Petrunia added a comment - 2019-01-28 21:51

In tabular form

	rangelocking=ON	rangelocking=OFF	rangelocking-orig

1	307.74	307.58	306.23

10	1576.26	1579.74	1565.1

20	1819.3	1838.34	1811.46

40	1640.48	1620.53	1611.57

Sergei Petrunia added a comment - 2019-01-28 21:51 In tabular form rangelocking=ON rangelocking=OFF rangelocking-orig 1 307.74 307.58 306.23 10 1576.26 1579.74 1565.1 20 1819.3 1838.34 1811.46 40 1640.48 1620.53 1611.57

Sergei Petrunia added a comment - 2019-04-30 18:27

The pull request is at https://github.com/facebook/rocksdb/pull/5041

Sergei Petrunia added a comment - 2019-04-30 18:27 The pull request is at https://github.com/facebook/rocksdb/pull/5041

Sergei Petrunia added a comment - 2019-11-18 10:36

Got a question about refreshing the iterator.

Consider a query:

update t1 set col1=col1+1000 where (pk between 3 and 7) or (pk between 10 and 15);

Suppose the range locking is ON, the table has `PRIMARY KEY(pk)`, and the query is using the PK.

It will do this:

  trx->get_range_lock([3; 7]);

  iter = trx->get_iterator(); // (1)

  // Use the iter to read the latest commited rows in the [3..7] range

  // (2)

  trx->get_range_lock([10; 15]);  // (3)

Now, the iterator we created at point (1) is reading the snapshot of data taken at that moment.

We need to read the latest-committed (to be precise - we need to see everything that was committed into the 10..15 range before the get_range_lock call marked with (3) was run.

We should call this:

  iter->Refresh();

But for me the iterator is `rocksdb::BaseDeltaIterator`, which doesn't override Refresh(), so it uses rocksdb::Iterator::Refresh, which is this:

  virtual Status Refresh() {

    return Status::NotSupported("Refresh() is not supported");

Does this mean

The iterator I've got will return me the latest data (and NOT the "snapshot at the time the iterator was created, (1))
or
The iterator I've got doesnt support Refresh() so I should destroy and re-create it?

Sergei Petrunia added a comment - 2019-11-18 10:36 Got a question about refreshing the iterator. Consider a query: update t1 set col1=col1+1000 where (pk between 3 and 7) or (pk between 10 and 15); Suppose the range locking is ON, the table has `PRIMARY KEY(pk)`, and the query is using the PK. It will do this: trx->get_range_lock([3; 7]); iter = trx->get_iterator(); // (1) // Use the iter to read the latest commited rows in the [3..7] range // (2) trx->get_range_lock([10; 15]); // (3) Now, the iterator we created at point (1) is reading the snapshot of data taken at that moment. We need to read the latest-committed (to be precise - we need to see everything that was committed into the 10..15 range before the get_range_lock call marked with (3) was run. We should call this: iter->Refresh(); But for me the iterator is `rocksdb::BaseDeltaIterator`, which doesn't override Refresh(), so it uses rocksdb::Iterator::Refresh, which is this: virtual Status Refresh() { return Status::NotSupported( "Refresh() is not supported" ); } Does this mean The iterator I've got will return me the latest data (and NOT the "snapshot at the time the iterator was created, (1)) or The iterator I've got doesnt support Refresh() so I should destroy and re-create it?

Sergei Petrunia added a comment - 2019-12-02 21:08

An MTR testcase for iterator refresh:
https://gist.github.com/spetrunia/7ead10923d40bf2d9baa960740733945

Result of it:
https://gist.github.com/spetrunia/915cdeeb033251a288ec88509bb04582#file-range-locking-iterator-refresh-result-sql-L22

It shows that the iterator sees the row that has been deleted. When it attempts to read the row, we get the Got error 1 'NotFound: error.

Now, let's remove the DELETE statement from the testcase:
https://gist.github.com/spetrunia/ac3392e8279007eb15411872cbc43241
the output: https://gist.github.com/spetrunia/33ce1b208109c8b0331fc54768de58ec

30 5000

The INSERT'ed row was not updated, so it was not visible to the iterator.

For the updated rows, the result looks as if the iterator saw the latest?

40 5100
41 5100
42 5100
43 5100
44 5100
45 5100

(or is this the result of extra GetForUpdate calls?)

Sergei Petrunia added a comment - 2019-12-02 21:08 An MTR testcase for iterator refresh: https://gist.github.com/spetrunia/7ead10923d40bf2d9baa960740733945 Result of it: https://gist.github.com/spetrunia/915cdeeb033251a288ec88509bb04582#file-range-locking-iterator-refresh-result-sql-L22 It shows that the iterator sees the row that has been deleted. When it attempts to read the row, we get the Got error 1 'NotFound: error. Now, let's remove the DELETE statement from the testcase: https://gist.github.com/spetrunia/ac3392e8279007eb15411872cbc43241 the output: https://gist.github.com/spetrunia/33ce1b208109c8b0331fc54768de58ec 30 5000 The INSERT'ed row was not updated, so it was not visible to the iterator. For the updated rows, the result looks as if the iterator saw the latest? 40 5100 41 5100 42 5100 43 5100 44 5100 45 5100 (or is this the result of extra GetForUpdate calls?)

Sergei Petrunia added a comment - 2019-12-05 17:36

Ok,

the iterator obtained from TransactionDB->NewIterator() has a non-trivial Refresh implementation, ArenaWrappedDBIter::Refresh().
the iterator obtained from Transaction->GetIterator() doesn't support refresh. It's a BaseDeltaIterator. It has base_iterator_= ArenaWrappedDBIter, delta_iterator_=WBWIIteratorImpl.

Sergei Petrunia added a comment - 2019-12-05 17:36 Ok, the iterator obtained from TransactionDB->NewIterator() has a non-trivial Refresh implementation, ArenaWrappedDBIter::Refresh(). the iterator obtained from Transaction->GetIterator() doesn't support refresh. It's a BaseDeltaIterator. It has base_iterator_= ArenaWrappedDBIter, delta_iterator_=WBWIIteratorImpl.

MariaDB Server

Gap Lock support in MyRocks

Details

Description

1. Data structures

1.1 A Global Lock Tree Manager object

1.2 A separate Lock Tree for each table

1.3 Each transaction keeps a track of ranges it is holding locks

2. Functions

2.1 Initializing the Lock Manager

2.2 Create Lock Tree for a table

2.3 Getting a lock

2.4 Releasing a lock.

2.5 Releasing all of the transaction's locks

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration