[MDEV-20693] ha_rocksdb::records_in_range() vastly overestimates the number of rows in certain ranges Created: 2019-09-27  Updated: 2023-06-22

Status: Confirmed
Project: MariaDB Server
Component/s: Storage Engine - RocksDB
Affects Version/s: 10.2, 10.3, 10.4
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: upstream, upstream-fixed

Attachments: File psergey-mdev20693.diff    
Issue Links:
Blocks
is blocked by MDEV-30610 Update RocksDB to the latest upstream... Open

 Description   

Consider this testcase (not necessarily minimal):

CREATE TABLE obj2 (
  part_id smallint(5) unsigned NOT NULL,
  oid bigint(20) unsigned NOT NULL,
  tid bigint(20) unsigned NOT NULL,
  filler char(32),
  KEY tid (part_id,tid,oid)
) ENGINE=ROCKSDB;

create table ten(a int primary key);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
 
create table one_k(a int primary key);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
 
create table hundred(a int primary key);
insert into hundred select A.a + B.a* 10 from ten A, ten B;
 
set rocksdb_bulk_load_size=10000, rocksdb_commit_in_the_middle=1;
 
insert into obj2
select 
  0, 
  A.a + 1000*B.a + 1000000*C.a,
  A.a + 1000*B.a + 1000000*C.a,
  'filler-data'
from one_k A, one_k B, hundred C;
 
insert into obj2
select 
  1,
  A.a + 1000*B.a + 1000000*C.a,
  A.a + 1000*B.a + 1000000*C.a,
  'filler-data'
from one_k A, one_k B, hundred C;
set global rocksdb_force_flush_memtable_and_lzero_now =1;

Ok, now table obj2 has these rows:

part_id=0, tid=0
part_id=0, tid=1
...
part_id=0, tid=100M
part_id=1, tid=0
part_id=1, tid=1
...
part_id=1, tid=100M

Reading the rows at the very end of the group with part_id=0:

analyze format=json select * from obj2 force index (tid) where part_id=0 and tid>100000000\G
*************************** 1. row ***************************
ANALYZE: {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 0.2885,
    "table": {
      "table_name": "obj2",
      "access_type": "range",
      "possible_keys": ["tid"],
      "key": "tid",
      "key_length": "10",
      "used_key_parts": ["part_id", "tid"],
      "r_loops": 1,
      "rows": 104100545,
      "r_rows": 0,
      "filtered": 52.05,
      "r_filtered": 100,
      "index_condition": "obj2.part_id = 0 and obj2.tid > 100000000"
    }
  }
}

Note this part:

      "rows": 104100545,
      "r_rows": 0,

There are 0 rows in the range, but the estimate is 100M rows! Without FORCE INDEX, index tid will not be used, and full table scan will be done, which will kill the performance.



 Comments   
Comment by Sergei Petrunia [ 2019-09-27 ]

The WHERE clause

part_id=0 and tid>100000000

Examining range bounds that are passed to ha_rocksdb::records_in_range

(gdb) p *min_key
  $6 = {key = 0x7fa920195730 "", length = 10, keypart_map = 3, flag = HA_READ_AFTER_KEY}
(gdb) p *max_key
  $7 = {key = 0x7fa8e3a2dd60 "", length = 2, keypart_map = 1, flag = HA_READ_AFTER_KEY}

(gdb) x/10x min_key->key
  0x7fa920195730:	0x00	0x00	0x00	0xe1	0xf5	0x05	0x00	0x00
  0x7fa920195738:	0x00	0x00
(gdb) x/2x max_key->key
  0x7fa8e3a2dd60:	0x00	0x00

The lookup constants:
part_id is 2 bytes, 00 00 at the front.

tid is 8 bytes, converting to hex:

100000000 = 0x5f5e100
# in hex it is:
00 00 00 00 05 f5 e1 00

everything is as expected so far

Comment by Sergei Petrunia [ 2019-09-27 ]

Following the execution in DBImpl::GetApproximateSizes(), one sees these endpoints:

(gdb) p range->start
  $12 = {data_ = 0x7fa8d00c35c8 "", size_ = 14}
(gdb) p range->limit
  $13 = {data_ = 0x7fa8d00c14f8 "", size_ = 14}

(gdb) x/14x range->start.data_
  0x00	0x00	0x01	0x00	0x00	0x00	0x00	0x00
  |== index nr=============|   |= part_id =|   |== tid ===...
  0x00	0x00	0x05	0xf5	0xe1	0x01
  |==== ok, this is the lookup value +1 ===|
 
(gdb) x/14x range->limit.data_
  0x00	0x00	0x01	0x00	0x00	0x01	0xff	0xff
  |== index nr=============|   |= part_id? =|   |== tid? ===...
  0xff	0xff	0xff	0xff	0xff	0xff
  |==== we got 0xFF instead of tid? =====|

Comment by Sergei Petrunia [ 2019-09-27 ]

So, the problem is the upper endpoint:

0x00	0x01	0xff ...

The code in ha_rocksdb::records_in_range modifies it to include half of the table!

Comment by Sergei Petrunia [ 2019-09-28 ]

On the upstream fb/mysql-5.6, the optimizer will use ref access on the first key component only:

mysql> explain select * from obj2 where part_id=0 and tid>100000000;
+----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows     | Extra                 |
+----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+
|  1 | SIMPLE      | obj2  | ref  | tid           | tid  | 2       | const | 50535761 | Using index condition |
+----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+

It's not the best query plan but the issue is not directly reproducible.

Comment by Sergei Petrunia [ 2019-09-28 ]

... but one can use a single-table DELETE and then the issue is reproducible:

mysql> explain delete from obj2 where part_id=0 and tid>100000000;
+----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+
|  1 | SIMPLE      | obj2  | ALL  | tid           | NULL | NULL    | NULL | 100074094 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+
1 row in set (0.00 sec)

...

        "analyzing_range_alternatives": {
          "range_scan_alternatives": [
            {
              "index": "tid",
              "ranges": [
                "0 <= part_id <= 0 AND 100000000 < tid"
              ],
              "index_dives_for_eq_ranges": true,
              "rowid_ordered": false,
              "using_mrr": false,
              "index_only": false,
              "rows": 50535761,
              "cost": 6.06e7,
              "chosen": false,
              "cause": "cost"
            }
          ]

Comment by Sergei Petrunia [ 2019-09-29 ]

Filed an issue against the upstream: https://github.com/facebook/mysql-5.6/issues/1052

Filed a pull request against the upstream: https://github.com/facebook/mysql-5.6/pull/1053

Comment by Sergei Petrunia [ 2019-09-29 ]

Patch against MariaDB: psergey-mdev20693.diff

Comment by Sergei Petrunia [ 2019-10-27 ]

Fixed in the upstream by:

commit 2b1e7918066a967b3a48fe486e5687d786aee052
Author: Sergei Petrunia <psergey@askmonty.org>
Date:   Tue Oct 1 15:29:38 2019 -0700
 
    #1052: ha_rocksdb::records_in_range() vastly overestimates #rows (#1053)

Generated at Thu Feb 08 09:01:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.