[MDEV-15603] Gap Lock support in MyRocks Created: 2018-03-20 Updated: 2023-04-28 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - RocksDB |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major |
| Reporter: | Sergei Petrunia | Assignee: | Sergei Petrunia |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
(The upstream task is: https://github.com/facebook/mysql-5.6/issues/800 ) Notes about how to use PerconaFT: 1. Data structures 1. Data structures1.1 A Global Lock Tree Manager objectThere needs to be a global locktree_manager. See PerconaFT/src/ydb-internal.h,
1.2 A separate Lock Tree for each tableTokuDB uses a separate Lock Tree for each table db->i->lt. 1.3 Each transaction keeps a track of ranges it is holding locksEach transaction has a list of ranges that it is holding locks on. It is referred to like so
and is stored in this structure, together with a mutex to protect it:
The mutex is there, because the list may be modified by the lock escalation process (which may be invoked from a different thread). 2. FunctionsMost functions that are mentioned here are from storage/tokudb/PerconaFT/src/, ydb_txn.cc, ydb_row_lock.cc - this is TokuDB's layer above the Lock Tree. 2.1 Initializing the Lock ManagerTODO 2.2 Create Lock Tree for a tableTokuDB does it when it opens a table's table_share. It is done like so:
Then, one needs to release it:
after the last release_lt call, the Lock Tree will be deleted (it is guaranteed to be empty). (TODO: this is easy to arrange if Toku locks are invoked from MyRocks level. But if they are invoked from RocksDB, this is harder as RocksDB doesn't have any concept of tables or indexes. For start, we can pretend all keys are in one table) 2.3 Getting a lockThis function has an example:
It is also possible to start an asynchronous lock request and then wait for it (see toku_db_start_range_lock, toku_db_wait_range_lock). We don't have a use for this it seems Point locks are obtained by passing the same key as left_key and right_key. 2.4 Releasing a lock.TokuDB doesn't seem to release individual locks (all locks are held until transaction either commits or is aborted). LockTree has a function to release locks from a specified range:
Besides calling that, one will need to
2.5 Releasing all of the transaction's locksSee PerconaFT/src/ydb_txn.cc:
|
| Comments |
| Comment by Sergei Petrunia [ 2018-03-23 ] | |||||||||||||||||||||||||||||||||||||||||||
|
TokuDB's lock tree is here: storage/tokudb/PerconaFT/locktree. They lock
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-07-24 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Data collected so far: | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-08-21 ] | |||||||||||||||||||||||||||||||||||||||||||
|
The MDEV text now has a description of how to use the range locker from TokuDB. Other input: there is a big concern about regressions wrt the current way of doing locking. Most likely, we will need to support both current locking mode (where gap locking is not available for any transaction) and the range locking mode (where some transactions may take range locks in some circumstances. Others take row locks. Both kinds of locks inhibit each other). | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-09-03 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Current locking code does "Snapshot Checking" (See PessimisticTransaction::ValidateSnapshot):
This apparently cannot be efficiently done for range locks. But it seems to be also unnecessary. Here's why: Snapshot checking (ValidateSnapshot) is needed to prevent situations like this:
That is, this is an "optimistic-like" method to make sure that transaction's snapshot has not been "made obsolete" by some other transaction. With Range Locking,
Range locks would not prevent the above scenario between trx1 and trx2, as trx2 updates $ROW_KEY_1 before trx1 attempts to read it. However, when transactions use locking, we can assume that trx1 "happened after" trx2 has committed. (The only thing that would prevent this assumption would be that trx1 has read a value that trx2 is modifying. But in that case, trx1 would have held a read lock that would have prevented trx2 from making the modification). The only issue here is that trx1 must not use a snapshot that was created before trx2 has committed. To sum up: RangeLockingForReads
If we are holding all locks for the duration of the transaction, there is no problem with reading inconsistent data (the data will be the same as if we've used the snapshot made after the most-recently-modified row we've read) | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Maysam Yabandeh [ 2018-09-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
1. If there is no snapshot taken before acquiring the lock, then even the existing code would not call ValidateSnapshot: https://github.com/facebook/rocksdb/blob/ea212e531696cab9cc8c2c3da49119b7888402ef/utilities/transactions/pessimistic_transaction.cc#L535 | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Maysam Yabandeh [ 2018-09-07 ] | |||||||||||||||||||||||||||||||||||||||||||
|
If the transaction has already taken a snapshot at the beginning, perhaps we can get the implementation to guarantee that it would never call ::Get before RangeLockingForReads, and then upgrade the snapshot after the last call to RangeLockingForReads. This would be as we have delayed the transaction's request to take the snapshot. The problem with this approach would be losing linearlizability: If for the two transactions, the client make connections between their input/output outside the sql engine, then it might get inconsistent results as we did not actually take snapshot at the wall-clock-time that we confirmed the client that we did. For example in this sequence of events running "from the same client session":
The client expects Txn B to read V1 but we return VA. I think it should be fine since our supported isolation level is not linearizable anyway (it is not even serializable). | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-09-10 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I've put up a tree here: https://github.com/spetrunia/mysql-5.6/tree/range-locking-fb-mysql-5.6.35 Current status:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-09-10 ] | |||||||||||||||||||||||||||||||||||||||||||
I am not sure when that happens (IRC in MyRocks, normally a transaction would create/use a snapshot before it has written any data). Will check | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-09-12 ] | |||||||||||||||||||||||||||||||||||||||||||
|
I took the current patch (it uses locktree to do point locks, all locks are The benchmark compares the performance of the current locking system with the new locking system with varying number of client connections.
The results are:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-09-12 ] | |||||||||||||||||||||||||||||||||||||||||||
|
So
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-09-20 ] | |||||||||||||||||||||||||||||||||||||||||||
|
pt-table-checksum works as follows: The table is broken into chunks. Then, for each chunk, the master computes the checksum like so:
This statement is replicated to the slave using SBR. That is, the slave will run it too, and compute the checksum of the data on the slave. Then, the master reads the checksum data:
And saves it in master_crc column:
This way, on the slave we will get
The need for Gap Locking comes from Statement replication of REPLACE INTO ... SELECT. When executed on the slave, it should read the same data as it did on the master. Fo that, execution of REPLACE INTO ... SELECT FROM t1 on the master must prevent any concurrent transaction from making modifications to t1 and committing. pt-table-checksum code also has "LOCK IN SHARE MODE" query inside but it does not seem to be used. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-10-15 ] | |||||||||||||||||||||||||||||||||||||||||||
InnoDB's equivalent of Snapshot CheckingInnoDB also uses multi-versioning and locking for intended writes. It doesn't do SnapshotChecking, so it faces a similar problem with overwriting the changes that were made after the transaction's snapshot was taken but before the lock was acquired:
InnoDB solves this by having DML statements to read the latest committed data, instead of the latest snapshot. This does look like a READ-COMMITTED isolation level:
Transaction trx1 is reading from the snapshot:
unless it's a FOR UPDATE (or DML) which will see the latest committed data:
Regardless of that, further SELECTs will continue to read from the snapshot:
DML will operate on the latest committed data:
This behavior "breaks" the promise of REPEATABLE-READ on the master, but in return, the statement will have the same effect when it is run on the slave. Use in Range Locking in MyRocksRange Locking mode in MyRocks can use this approach too:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2018-11-30 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Currently failing tests:
rocksdb.rqg_transactions 'range_locking'
rocksdb.compact_deletes 'range_locking'
rocksdb.rocksdb_deadlock_detect_rc 'range_locking'
rocksdb.rocksdb_deadlock_stress_rc 'range_locking'
rocksdb.deadlock 'range_locking'
rocksdb.deadlock_stats 'range_locking'- "mysqltest got signal 6" - crash on the client??
rocksdb.deadlock_tracking 'range_locking'
rocksdb.gap_lock_raise_error 'range_locking'
rocksdb.i_s_deadlock 'range_locking'
rocksdb.rocksdb_deadlock_detect_rr 'range_locking'
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-01-08 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Currently, the tests pass. | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-01-08 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Remaining issues:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-01-28 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Now the above is done and there are no known Gap-Lock-related test failures in the rocksdb test suite.
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-01-28 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Also did a basic benchmark: ran sysbench oltp_read_write.lua for:
Results:
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-01-28 ] | |||||||||||||||||||||||||||||||||||||||||||
|
In tabular form
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-04-30 ] | |||||||||||||||||||||||||||||||||||||||||||
|
The pull request is at https://github.com/facebook/rocksdb/pull/5041 | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-11-18 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Got a question about refreshing the iterator. Consider a query:
Suppose the range locking is ON, the table has `PRIMARY KEY(pk)`, and the query is using the PK. It will do this:
Now, the iterator we created at point (1) is reading the snapshot of data taken at that moment. We need to read the latest-committed (to be precise - we need to see everything that was committed into the 10..15 range before the get_range_lock call marked with (3) was run. We should call this:
But for me the iterator is `rocksdb::BaseDeltaIterator`, which doesn't override Refresh(), so it uses rocksdb::Iterator::Refresh, which is this:
Does this mean
| |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-12-02 ] | |||||||||||||||||||||||||||||||||||||||||||
|
An MTR testcase for iterator refresh: Result of it: It shows that the iterator sees the row that has been deleted. When it attempts to read the row, we get the Got error 1 'NotFound: error. Now, let's remove the DELETE statement from the testcase:
The INSERT'ed row was not updated, so it was not visible to the iterator. For the updated rows, the result looks as if the iterator saw the latest?
(or is this the result of extra GetForUpdate calls?) | |||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2019-12-05 ] | |||||||||||||||||||||||||||||||||||||||||||
|
Ok,
|