Details

    Description

      (The upstream task is: https://github.com/facebook/mysql-5.6/issues/800 )

      Notes about how to use PerconaFT:

      1. Data structures
      1.1 A Global Lock Tree Manager object
      1.2 A separate Lock Tree for each table
      1.3 Each transaction keeps a track of ranges it is holding locks
      2. Functions
      2.1 Initializing the Lock Manager
      2.2 Create Lock Tree for a table
      2.3 Getting a lock
      2.4 Releasing a lock.
      2.5 Releasing all of the transaction's locks

      1. Data structures

      1.1 A Global Lock Tree Manager object

      There needs to be a global locktree_manager.

      See PerconaFT/src/ydb-internal.h,

        struct __toku_db_env_internal {
          toku::locktree_manager ltm;
      

      1.2 A separate Lock Tree for each table

      TokuDB uses a separate Lock Tree for each table db->i->lt.

      1.3 Each transaction keeps a track of ranges it is holding locks

      Each transaction has a list of ranges that it is holding locks on. It is referred to like so

        db_txn_struct_i(txn)->lt_map
      

      and is stored in this structure, together with a mutex to protect it:

        struct __toku_db_txn_internal {
            // maps a locktree to a buffer of key ranges that are locked.
            // it is protected by the txn_mutex, so hot indexing and a client
            // thread can concurrently operate on this txn.
            toku::omt<txn_lt_key_ranges> lt_map;
            toku_mutex_t txn_mutex;
      

      The mutex is there, because the list may be modified by the lock escalation process (which may be invoked from a different thread).
      (See toku_txn_destroy for how to free this)

      2. Functions

      Most functions that are mentioned here are from storage/tokudb/PerconaFT/src/, ydb_txn.cc, ydb_row_lock.cc - this is TokuDB's layer above the Lock Tree.

      2.1 Initializing the Lock Manager

      TODO

      2.2 Create Lock Tree for a table

      TokuDB does it when it opens a table's table_share. It is done like so:

              db->i->lt = db->dbenv->i->ltm.get_lt(db->i->dict_id,
                                                   toku_ft_get_comparator(db->i->ft_handle),
                                                   &on_create_extra);
      

      Then, one needs to release it:

      db->dbenv->i->ltm.release_lt(db->i->lt);
      

      after the last release_lt call, the Lock Tree will be deleted (it is guaranteed to be empty).

      (TODO: this is easy to arrange if Toku locks are invoked from MyRocks level. But if they are invoked from RocksDB, this is harder as RocksDB doesn't have any concept of tables or indexes. For start, we can pretend all keys are in one table)

      2.3 Getting a lock

      This function has an example:

      // Get a range lock.
      // Return when the range lock is acquired or the default lock tree timeout has expired.
      int toku_db_get_range_lock(DB *db, DB_TXN *txn, const DBT *left_key, const DBT *right_key,
              toku::lock_request::type lock_type) {
      

      It is also possible to start an asynchronous lock request and then wait for it (see toku_db_start_range_lock, toku_db_wait_range_lock). We don't have a use for this it seems

      Point locks are obtained by passing the same key as left_key and right_key.

      2.4 Releasing a lock.

      TokuDB doesn't seem to release individual locks (all locks are held until transaction either commits or is aborted).

      LockTree has a function to release locks from a specified range:

      locktree::release_locks(TXNID txnid, const range_buffer *ranges)
      

      Besides calling that, one will need to

      • wake up all waiting lock requests. release_locks doesn't wake them up. There is toku::lock_request::retry_all_lock_requests call which retries all pending requests (Which doesn't seem to be efficient... but maybe it is ok?)
      • Remove the released lock from the list of locks it is holding (which is in db_txn_struct_i(txn)->lt_map). This is actually not essential because that list is only used for the purpose of releasing the locks when the transaction is finished.

      2.5 Releasing all of the transaction's locks

      See PerconaFT/src/ydb_txn.cc:

      static void toku_txn_release_locks(DB_TXN *txn) {
          // Prevent access to the locktree map while releasing.
          // It is possible for lock escalation to attempt to
          // modify this data structure while the txn commits.
          toku_mutex_lock(&db_txn_struct_i(txn)->txn_mutex);
       
          size_t num_ranges = db_txn_struct_i(txn)->lt_map.size();
          for (size_t i = 0; i < num_ranges; i++) {
              txn_lt_key_ranges ranges;
              int r = db_txn_struct_i(txn)->lt_map.fetch(i, &ranges);
              invariant_zero(r);
              toku_db_release_lt_key_ranges(txn, &ranges);
          }
       
          toku_mutex_unlock(&db_txn_struct_i(txn)->txn_mutex);
      }
      

      Attachments

        1. screenshot-1.png
          screenshot-1.png
          51 kB
        2. screenshot-2.png
          screenshot-2.png
          36 kB
        3. screenshot-3.png
          screenshot-3.png
          22 kB

        Issue Links

          Activity

            In tabular form

            	rangelocking=ON	rangelocking=OFF	rangelocking-orig
            1	307.74	307.58	306.23
            10	1576.26	1579.74	1565.1
            20	1819.3	1838.34	1811.46
            40	1640.48	1620.53	1611.57
            

            psergei Sergei Petrunia added a comment - In tabular form rangelocking=ON rangelocking=OFF rangelocking-orig 1 307.74 307.58 306.23 10 1576.26 1579.74 1565.1 20 1819.3 1838.34 1811.46 40 1640.48 1620.53 1611.57
            psergei Sergei Petrunia added a comment - The pull request is at https://github.com/facebook/rocksdb/pull/5041

            Got a question about refreshing the iterator.

            Consider a query:

            update t1 set col1=col1+1000 where (pk between 3 and 7) or (pk between 10 and 15);
            

            Suppose the range locking is ON, the table has `PRIMARY KEY(pk)`, and the query is using the PK.

            It will do this:

              trx->get_range_lock([3; 7]);
              iter = trx->get_iterator(); // (1)
              // Use the iter to read the latest commited rows in the [3..7] range 
              // (2)
             
              trx->get_range_lock([10; 15]);  // (3)
            

            Now, the iterator we created at point (1) is reading the snapshot of data taken at that moment.

            We need to read the latest-committed (to be precise - we need to see everything that was committed into the 10..15 range before the get_range_lock call marked with (3) was run.

            We should call this:

              iter->Refresh();
            

            But for me the iterator is `rocksdb::BaseDeltaIterator`, which doesn't override Refresh(), so it uses rocksdb::Iterator::Refresh, which is this:

              virtual Status Refresh() {
                return Status::NotSupported("Refresh() is not supported");
              }
            

            Does this mean

            • The iterator I've got will return me the latest data (and NOT the "snapshot at the time the iterator was created, (1))
              or
            • The iterator I've got doesnt support Refresh() so I should destroy and re-create it?
            psergei Sergei Petrunia added a comment - Got a question about refreshing the iterator. Consider a query: update t1 set col1=col1+1000 where (pk between 3 and 7) or (pk between 10 and 15); Suppose the range locking is ON, the table has `PRIMARY KEY(pk)`, and the query is using the PK. It will do this: trx->get_range_lock([3; 7]); iter = trx->get_iterator(); // (1) // Use the iter to read the latest commited rows in the [3..7] range // (2)   trx->get_range_lock([10; 15]); // (3) Now, the iterator we created at point (1) is reading the snapshot of data taken at that moment. We need to read the latest-committed (to be precise - we need to see everything that was committed into the 10..15 range before the get_range_lock call marked with (3) was run. We should call this: iter->Refresh(); But for me the iterator is `rocksdb::BaseDeltaIterator`, which doesn't override Refresh(), so it uses rocksdb::Iterator::Refresh, which is this: virtual Status Refresh() { return Status::NotSupported( "Refresh() is not supported" ); } Does this mean The iterator I've got will return me the latest data (and NOT the "snapshot at the time the iterator was created, (1)) or The iterator I've got doesnt support Refresh() so I should destroy and re-create it?

            An MTR testcase for iterator refresh:
            https://gist.github.com/spetrunia/7ead10923d40bf2d9baa960740733945

            Result of it:
            https://gist.github.com/spetrunia/915cdeeb033251a288ec88509bb04582#file-range-locking-iterator-refresh-result-sql-L22

            It shows that the iterator sees the row that has been deleted. When it attempts to read the row, we get the Got error 1 'NotFound: error.

            Now, let's remove the DELETE statement from the testcase:
            https://gist.github.com/spetrunia/ac3392e8279007eb15411872cbc43241
            the output: https://gist.github.com/spetrunia/33ce1b208109c8b0331fc54768de58ec

            30 5000

            The INSERT'ed row was not updated, so it was not visible to the iterator.

            For the updated rows, the result looks as if the iterator saw the latest?

            40 5100
            41 5100
            42 5100
            43 5100
            44 5100
            45 5100

            (or is this the result of extra GetForUpdate calls?)

            psergei Sergei Petrunia added a comment - An MTR testcase for iterator refresh: https://gist.github.com/spetrunia/7ead10923d40bf2d9baa960740733945 Result of it: https://gist.github.com/spetrunia/915cdeeb033251a288ec88509bb04582#file-range-locking-iterator-refresh-result-sql-L22 It shows that the iterator sees the row that has been deleted. When it attempts to read the row, we get the Got error 1 'NotFound: error. Now, let's remove the DELETE statement from the testcase: https://gist.github.com/spetrunia/ac3392e8279007eb15411872cbc43241 the output: https://gist.github.com/spetrunia/33ce1b208109c8b0331fc54768de58ec 30 5000 The INSERT'ed row was not updated, so it was not visible to the iterator. For the updated rows, the result looks as if the iterator saw the latest? 40 5100 41 5100 42 5100 43 5100 44 5100 45 5100 (or is this the result of extra GetForUpdate calls?)

            Ok,

            • the iterator obtained from TransactionDB->NewIterator() has a non-trivial Refresh implementation, ArenaWrappedDBIter::Refresh().
            • the iterator obtained from Transaction->GetIterator() doesn't support refresh. It's a BaseDeltaIterator. It has base_iterator_= ArenaWrappedDBIter, delta_iterator_=WBWIIteratorImpl.
            psergei Sergei Petrunia added a comment - Ok, the iterator obtained from TransactionDB->NewIterator() has a non-trivial Refresh implementation, ArenaWrappedDBIter::Refresh(). the iterator obtained from Transaction->GetIterator() doesn't support refresh. It's a BaseDeltaIterator. It has base_iterator_= ArenaWrappedDBIter, delta_iterator_=WBWIIteratorImpl.

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.