[MDEV-19670] MyRocks: key lookups into deleted data are very slow Created: 2019-06-01  Updated: 2023-04-27

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.2, 10.3, 10.4
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 1
Labels: None


 Description   

This is based on a report from a user.
Index lookups become very slow after a bulk delete operation.

Steps to reproduce with a totally synthetic dataset:

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
 
create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 (
  pk bigint primary key,
  a bigint,
  filler varchar(32),
  key(a)
)engine=rocksdb;
 
set rocksdb_max_row_locks=1000*1000*1024;
insert into t1 select 
  A.a+ 1000*B.a,  A.a+ 1000*B.a, A.a+ 1000*B.a 
from 
  one_k A, one_k B;
set global rocksdb_force_flush_memtable_now=1;

The query (originally it was a join):

explain 
select * from t1 
where 
  a in (
100,101,102,103,104,105,106,107,108,109,
110,111,112,113,114,115,116,117,118,119,
120,121,122,123,124,125,126,127,128,129,
130,131,132,133,134,135,136,137,138,139,
140,141,142,143,144,145,146,147,148,149,
150,151,152,153,154,155,156,157,158,159,
160,161,162,163,164,165,166,167,168,169,
170,171,172,173,174,175,176,177,178,179,
180,181,182,183,184,185,186,187,188,189,
190,191,192,193,194,195,196,197,198,199,200);

+------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
| id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                 |
+------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+
|    1 | SIMPLE      | t1    | range | a             | a    | 9       | NULL |  293 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+------+-----------------------+

The query itself returns 100 rows in 0.01 sec.

Now, let's do the same in the transaction:

begin;
delete from t1 where a between 100 and 100+99000;
# Query OK, 99001 rows affected (0.67 sec)
delete from t1 where a between 99000 and 200000;
# Query OK, 100900 rows affected (0.73 sec)
delete from t1 where a between 20000 and 300000;
# Query OK, 100000 rows affected (4.11 sec)

Re-running the query:

explain 
select * from t1 
where 
  a in (
100,101,102,103,104,105,106,107,108,109,
110,111,112,113,114,115,116,117,118,119,
120,121,122,123,124,125,126,127,128,129,
130,131,132,133,134,135,136,137,138,139,
140,141,142,143,144,145,146,147,148,149,
150,151,152,153,154,155,156,157,158,159,
160,161,162,163,164,165,166,167,168,169,
170,171,172,173,174,175,176,177,178,179,
180,181,182,183,184,185,186,187,188,189,
190,191,192,193,194,195,196,197,198,199,200);

The EXPLAIN is the same, but the query now takes 1.25 sec.
(this is a reduced testcase. In the original report, the join query couldn't finish at all)



 Comments   
Comment by Sergei Petrunia [ 2019-06-01 ]

Excerpt from perf report:

            |          myrocks::ha_rocksdb::read_range_first
            |          myrocks::ha_rocksdb::index_read_map_impl
            |          myrocks::ha_rocksdb::read_key_exact
            |          myrocks::ha_rocksdb::read_key_exact
            |          rocksdb::BaseDeltaIterator::Seek
            |          rocksdb::BaseDeltaIterator::UpdateCurrent
            |          |          
            |          |--56.68%--?? (inlined)
            |          |          |          
            |          |          |--55.69%--rocksdb::BaseDeltaIterator::UpdateCurrent
            |          |          |          
            |          |           --0.99%--?? (inlined)
            |          |                     |          
            |          |                      --0.74%--rocksdb::BaseDeltaIterator::UpdateCurrent
            |          |          
            |           --10.38%--rocksdb::BaseDeltaIterator::UpdateCurrent
            |          
            |--17.05%--rocksdb::BaseDeltaIterator::UpdateCurrent
            |          |          
            |          |--14.82%--rocksdb::BaseDeltaIterator::UpdateCurrent
            |          |          |          
            |          |           --13.34%--rocksdb::ReadableWriteBatch::GetEntryFromDataOffset
            |          |                     |          
            |          |                     |--8.15%--rocksdb::ReadRecordFromWriteBatch
            |          |                     |          |          
            |          |                     |          |--4.94%--rocksdb::ReadRecordFromWriteBatch
            |          |                     |          |          

Comment by Sergei Petrunia [ 2019-06-01 ]

Debugging manually, I see that it spends a lot of time reading tombstones from the WriteBatch.

RocksDB actually has "Iterator bounds" optimization which allows to avoid enumerating a lot of tombstones. Apparently that optimization only applies to committed data.

Comment by Sergei Petrunia [ 2019-06-01 ]

The slowdown happens in BaseDeltaIterator::UpdateCurrent

There, base_iterator_ reads committed data, delta_iterator_ reads uncommitted changes.

When an index lookup is made, the following happens:

  • we find a match with base_iterator_.
  • we find its tombstone with delta_iterator_, and don't return it
  • we try to read the next row with base_iterator_. It returns EOF, because the
    next record is outside the iterator's upper bound.
  • but delta_iterator_ doesn't support iterator bounds, and so it continues to enumerate the tombstones.
Comment by Sergei Petrunia [ 2019-06-01 ]

A patch against RocksDB that fixes this: https://gist.github.com/spetrunia/919dd95b1943ab52a032f134af8951e4

Comment by Sergei Petrunia [ 2019-06-01 ]

Draft pull request: https://github.com/facebook/rocksdb/pull/5403

Comment by Julien Muchembled [ 2020-05-15 ]

Between 10.4.12 & 10.4.13, rocksdb was upgraded (no release notes about it) and the fix for this bug seems to be included. Do you confirm?

Generated at Thu Feb 08 08:53:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.