Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19670

MyRocks: key lookups into deleted data are very slow

Details

    Description

      This is based on a report from a user.
      Index lookups become very slow after a bulk delete operation.

      Steps to reproduce with a totally synthetic dataset:

      create table ten(a int);
      insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
       
      create table one_k(a int);
      insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
      

      create table t1 (
        pk bigint primary key,
        a bigint,
        filler varchar(32),
        key(a)
      )engine=rocksdb;
       
      set rocksdb_max_row_locks=1000*1000*1024;
      insert into t1 select 
        A.a+ 1000*B.a,  A.a+ 1000*B.a, A.a+ 1000*B.a 
      from 
        one_k A, one_k B;
      set global rocksdb_force_flush_memtable_now=1;
      

      The query (originally it was a join):

      explain 
      select * from t1 
      where 
        a in (
      100,101,102,103,104,105,106,107,108,109,
      110,111,112,113,114,115,116,117,118,119,
      120,121,122,123,124,125,126,127,128,129,
      130,131,132,133,134,135,136,137,138,139,
      140,141,142,143,144,145,146,147,148,149,
      150,151,152,153,154,155,156,157,158,159,
      160,161,162,163,164,165,166,167,168,169,
      170,171,172,173,174,175,176,177,178,179,
      180,181,182,183,184,185,186,187,188,189,
      190,191,192,193,194,195,196,197,198,199,200);
      

      +------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
      | id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                 |
      +------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
      |    1 | SIMPLE      | t1    | range | a             | a    | 9       | NULL |  293 | Using index condition |
      +------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
      

      The query itself returns 100 rows in 0.01 sec.

      Now, let's do the same in the transaction:

      begin;
      delete from t1 where a between 100 and 100+99000;
      # Query OK, 99001 rows affected (0.67 sec)
      delete from t1 where a between 99000 and 200000;
      # Query OK, 100900 rows affected (0.73 sec)
      delete from t1 where a between 20000 and 300000;
      # Query OK, 100000 rows affected (4.11 sec)
      

      Re-running the query:

      explain 
      select * from t1 
      where 
        a in (
      100,101,102,103,104,105,106,107,108,109,
      110,111,112,113,114,115,116,117,118,119,
      120,121,122,123,124,125,126,127,128,129,
      130,131,132,133,134,135,136,137,138,139,
      140,141,142,143,144,145,146,147,148,149,
      150,151,152,153,154,155,156,157,158,159,
      160,161,162,163,164,165,166,167,168,169,
      170,171,172,173,174,175,176,177,178,179,
      180,181,182,183,184,185,186,187,188,189,
      190,191,192,193,194,195,196,197,198,199,200);
      

      The EXPLAIN is the same, but the query now takes 1.25 sec.
      (this is a reduced testcase. In the original report, the join query couldn't finish at all)

      Attachments

        Activity

          Excerpt from perf report:

                      |          myrocks::ha_rocksdb::read_range_first
                      |          myrocks::ha_rocksdb::index_read_map_impl
                      |          myrocks::ha_rocksdb::read_key_exact
                      |          myrocks::ha_rocksdb::read_key_exact
                      |          rocksdb::BaseDeltaIterator::Seek
                      |          rocksdb::BaseDeltaIterator::UpdateCurrent
                      |          |          
                      |          |--56.68%--?? (inlined)
                      |          |          |          
                      |          |          |--55.69%--rocksdb::BaseDeltaIterator::UpdateCurrent
                      |          |          |          
                      |          |           --0.99%--?? (inlined)
                      |          |                     |          
                      |          |                      --0.74%--rocksdb::BaseDeltaIterator::UpdateCurrent
                      |          |          
                      |           --10.38%--rocksdb::BaseDeltaIterator::UpdateCurrent
                      |          
                      |--17.05%--rocksdb::BaseDeltaIterator::UpdateCurrent
                      |          |          
                      |          |--14.82%--rocksdb::BaseDeltaIterator::UpdateCurrent
                      |          |          |          
                      |          |           --13.34%--rocksdb::ReadableWriteBatch::GetEntryFromDataOffset
                      |          |                     |          
                      |          |                     |--8.15%--rocksdb::ReadRecordFromWriteBatch
                      |          |                     |          |          
                      |          |                     |          |--4.94%--rocksdb::ReadRecordFromWriteBatch
                      |          |                     |          |          
          

          psergei Sergei Petrunia added a comment - Excerpt from perf report: | myrocks::ha_rocksdb::read_range_first | myrocks::ha_rocksdb::index_read_map_impl | myrocks::ha_rocksdb::read_key_exact | myrocks::ha_rocksdb::read_key_exact | rocksdb::BaseDeltaIterator::Seek | rocksdb::BaseDeltaIterator::UpdateCurrent | | | |--56.68%--?? (inlined) | | | | | |--55.69%--rocksdb::BaseDeltaIterator::UpdateCurrent | | | | | --0.99%--?? (inlined) | | | | | --0.74%--rocksdb::BaseDeltaIterator::UpdateCurrent | | | --10.38%--rocksdb::BaseDeltaIterator::UpdateCurrent | |--17.05%--rocksdb::BaseDeltaIterator::UpdateCurrent | | | |--14.82%--rocksdb::BaseDeltaIterator::UpdateCurrent | | | | | --13.34%--rocksdb::ReadableWriteBatch::GetEntryFromDataOffset | | | | | |--8.15%--rocksdb::ReadRecordFromWriteBatch | | | | | | | |--4.94%--rocksdb::ReadRecordFromWriteBatch | | | |

          Debugging manually, I see that it spends a lot of time reading tombstones from the WriteBatch.

          RocksDB actually has "Iterator bounds" optimization which allows to avoid enumerating a lot of tombstones. Apparently that optimization only applies to committed data.

          psergei Sergei Petrunia added a comment - Debugging manually, I see that it spends a lot of time reading tombstones from the WriteBatch. RocksDB actually has "Iterator bounds" optimization which allows to avoid enumerating a lot of tombstones. Apparently that optimization only applies to committed data.

          The slowdown happens in BaseDeltaIterator::UpdateCurrent

          There, base_iterator_ reads committed data, delta_iterator_ reads uncommitted changes.

          When an index lookup is made, the following happens:

          • we find a match with base_iterator_.
          • we find its tombstone with delta_iterator_, and don't return it
          • we try to read the next row with base_iterator_. It returns EOF, because the
            next record is outside the iterator's upper bound.
          • but delta_iterator_ doesn't support iterator bounds, and so it continues to enumerate the tombstones.
          psergei Sergei Petrunia added a comment - The slowdown happens in BaseDeltaIterator::UpdateCurrent There, base_iterator_ reads committed data, delta_iterator_ reads uncommitted changes. When an index lookup is made, the following happens: we find a match with base_iterator_. we find its tombstone with delta_iterator_, and don't return it we try to read the next row with base_iterator_. It returns EOF, because the next record is outside the iterator's upper bound. but delta_iterator_ doesn't support iterator bounds, and so it continues to enumerate the tombstones.
          psergei Sergei Petrunia added a comment - A patch against RocksDB that fixes this: https://gist.github.com/spetrunia/919dd95b1943ab52a032f134af8951e4
          psergei Sergei Petrunia added a comment - Draft pull request: https://github.com/facebook/rocksdb/pull/5403

          Between 10.4.12 & 10.4.13, rocksdb was upgraded (no release notes about it) and the fix for this bug seems to be included. Do you confirm?

          jmuchemb Julien Muchembled added a comment - Between 10.4.12 & 10.4.13, rocksdb was upgraded (no release notes about it) and the fix for this bug seems to be included. Do you confirm?

          People

            psergei Sergei Petrunia
            psergei Sergei Petrunia
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.