Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20693

ha_rocksdb::records_in_range() vastly overestimates the number of rows in certain ranges

Details

    Description

      Consider this testcase (not necessarily minimal):

      CREATE TABLE obj2 (
        part_id smallint(5) unsigned NOT NULL,
        oid bigint(20) unsigned NOT NULL,
        tid bigint(20) unsigned NOT NULL,
        filler char(32),
        KEY tid (part_id,tid,oid)
      ) ENGINE=ROCKSDB;
      

      create table ten(a int primary key);
      insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
       
      create table one_k(a int primary key);
      insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;
       
      create table hundred(a int primary key);
      insert into hundred select A.a + B.a* 10 from ten A, ten B;
       
      set rocksdb_bulk_load_size=10000, rocksdb_commit_in_the_middle=1;
       
      insert into obj2
      select 
        0, 
        A.a + 1000*B.a + 1000000*C.a,
        A.a + 1000*B.a + 1000000*C.a,
        'filler-data'
      from one_k A, one_k B, hundred C;
       
      insert into obj2
      select 
        1,
        A.a + 1000*B.a + 1000000*C.a,
        A.a + 1000*B.a + 1000000*C.a,
        'filler-data'
      from one_k A, one_k B, hundred C;
      set global rocksdb_force_flush_memtable_and_lzero_now =1;
      

      Ok, now table obj2 has these rows:

      part_id=0, tid=0
      part_id=0, tid=1
      ...
      part_id=0, tid=100M
      part_id=1, tid=0
      part_id=1, tid=1
      ...
      part_id=1, tid=100M
      

      Reading the rows at the very end of the group with part_id=0:

      analyze format=json select * from obj2 force index (tid) where part_id=0 and tid>100000000\G
      *************************** 1. row ***************************
      ANALYZE: {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 0.2885,
          "table": {
            "table_name": "obj2",
            "access_type": "range",
            "possible_keys": ["tid"],
            "key": "tid",
            "key_length": "10",
            "used_key_parts": ["part_id", "tid"],
            "r_loops": 1,
            "rows": 104100545,
            "r_rows": 0,
            "filtered": 52.05,
            "r_filtered": 100,
            "index_condition": "obj2.part_id = 0 and obj2.tid > 100000000"
          }
        }
      }
      

      Note this part:

            "rows": 104100545,
            "r_rows": 0,
      

      There are 0 rows in the range, but the estimate is 100M rows! Without FORCE INDEX, index tid will not be used, and full table scan will be done, which will kill the performance.

      Attachments

        Issue Links

          Activity

            psergei Sergei Petrunia added a comment - - edited

            The WHERE clause

            part_id=0 and tid>100000000
            

            Examining range bounds that are passed to ha_rocksdb::records_in_range

            (gdb) p *min_key
              $6 = {key = 0x7fa920195730 "", length = 10, keypart_map = 3, flag = HA_READ_AFTER_KEY}
            (gdb) p *max_key
              $7 = {key = 0x7fa8e3a2dd60 "", length = 2, keypart_map = 1, flag = HA_READ_AFTER_KEY}
            

            (gdb) x/10x min_key->key
              0x7fa920195730:	0x00	0x00	0x00	0xe1	0xf5	0x05	0x00	0x00
              0x7fa920195738:	0x00	0x00
            (gdb) x/2x max_key->key
              0x7fa8e3a2dd60:	0x00	0x00
            

            The lookup constants:
            part_id is 2 bytes, 00 00 at the front.

            tid is 8 bytes, converting to hex:

            100000000 = 0x5f5e100
            # in hex it is:
            00 00 00 00 05 f5 e1 00
            

            everything is as expected so far

            psergei Sergei Petrunia added a comment - - edited The WHERE clause part_id=0 and tid>100000000 Examining range bounds that are passed to ha_rocksdb::records_in_range (gdb) p *min_key $6 = {key = 0x7fa920195730 "", length = 10, keypart_map = 3, flag = HA_READ_AFTER_KEY} (gdb) p *max_key $7 = {key = 0x7fa8e3a2dd60 "", length = 2, keypart_map = 1, flag = HA_READ_AFTER_KEY} (gdb) x/10x min_key->key 0x7fa920195730: 0x00 0x00 0x00 0xe1 0xf5 0x05 0x00 0x00 0x7fa920195738: 0x00 0x00 (gdb) x/2x max_key->key 0x7fa8e3a2dd60: 0x00 0x00 The lookup constants: part_id is 2 bytes, 00 00 at the front. tid is 8 bytes, converting to hex: 100000000 = 0x5f5e100 # in hex it is: 00 00 00 00 05 f5 e1 00 everything is as expected so far

            Following the execution in DBImpl::GetApproximateSizes(), one sees these endpoints:

            (gdb) p range->start
              $12 = {data_ = 0x7fa8d00c35c8 "", size_ = 14}
            (gdb) p range->limit
              $13 = {data_ = 0x7fa8d00c14f8 "", size_ = 14}
            

            (gdb) x/14x range->start.data_
              0x00	0x00	0x01	0x00	0x00	0x00	0x00	0x00
              |== index nr=============|   |= part_id =|   |== tid ===...
              0x00	0x00	0x05	0xf5	0xe1	0x01
              |==== ok, this is the lookup value +1 ===|
             
            (gdb) x/14x range->limit.data_
              0x00	0x00	0x01	0x00	0x00	0x01	0xff	0xff
              |== index nr=============|   |= part_id? =|   |== tid? ===...
              0xff	0xff	0xff	0xff	0xff	0xff
              |==== we got 0xFF instead of tid? =====|
            

            psergei Sergei Petrunia added a comment - Following the execution in DBImpl::GetApproximateSizes(), one sees these endpoints: (gdb) p range->start $12 = {data_ = 0x7fa8d00c35c8 "", size_ = 14} (gdb) p range->limit $13 = {data_ = 0x7fa8d00c14f8 "", size_ = 14} (gdb) x/14x range->start.data_ 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00 |== index nr=============| |= part_id =| |== tid ===... 0x00 0x00 0x05 0xf5 0xe1 0x01 |==== ok, this is the lookup value +1 ===|   (gdb) x/14x range->limit.data_ 0x00 0x00 0x01 0x00 0x00 0x01 0xff 0xff |== index nr=============| |= part_id? =| |== tid? ===... 0xff 0xff 0xff 0xff 0xff 0xff |==== we got 0xFF instead of tid? =====|

            So, the problem is the upper endpoint:

            0x00	0x01	0xff ...
            

            The code in ha_rocksdb::records_in_range modifies it to include half of the table!

            psergei Sergei Petrunia added a comment - So, the problem is the upper endpoint: 0x00 0x01 0xff ... The code in ha_rocksdb::records_in_range modifies it to include half of the table!

            On the upstream fb/mysql-5.6, the optimizer will use ref access on the first key component only:

            mysql> explain select * from obj2 where part_id=0 and tid>100000000;
            +----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+
            | id | select_type | table | type | possible_keys | key  | key_len | ref   | rows     | Extra                 |
            +----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+
            |  1 | SIMPLE      | obj2  | ref  | tid           | tid  | 2       | const | 50535761 | Using index condition |
            +----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+
            

            It's not the best query plan but the issue is not directly reproducible.

            psergei Sergei Petrunia added a comment - On the upstream fb/mysql-5.6, the optimizer will use ref access on the first key component only: mysql> explain select * from obj2 where part_id=0 and tid>100000000; +----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+ | 1 | SIMPLE | obj2 | ref | tid | tid | 2 | const | 50535761 | Using index condition | +----+-------------+-------+------+---------------+------+---------+-------+----------+-----------------------+ It's not the best query plan but the issue is not directly reproducible.

            ... but one can use a single-table DELETE and then the issue is reproducible:

            mysql> explain delete from obj2 where part_id=0 and tid>100000000;
            +----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+
            | id | select_type | table | type | possible_keys | key  | key_len | ref  | rows      | Extra       |
            +----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+
            |  1 | SIMPLE      | obj2  | ALL  | tid           | NULL | NULL    | NULL | 100074094 | Using where |
            +----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+
            1 row in set (0.00 sec)
            

            ...

                    "analyzing_range_alternatives": {
                      "range_scan_alternatives": [
                        {
                          "index": "tid",
                          "ranges": [
                            "0 <= part_id <= 0 AND 100000000 < tid"
                          ],
                          "index_dives_for_eq_ranges": true,
                          "rowid_ordered": false,
                          "using_mrr": false,
                          "index_only": false,
                          "rows": 50535761,
                          "cost": 6.06e7,
                          "chosen": false,
                          "cause": "cost"
                        }
                      ]
            

            psergei Sergei Petrunia added a comment - ... but one can use a single-table DELETE and then the issue is reproducible: mysql> explain delete from obj2 where part_id=0 and tid>100000000; +----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+ | 1 | SIMPLE | obj2 | ALL | tid | NULL | NULL | NULL | 100074094 | Using where | +----+-------------+-------+------+---------------+------+---------+------+-----------+-------------+ 1 row in set (0.00 sec) ... "analyzing_range_alternatives": { "range_scan_alternatives": [ { "index": "tid", "ranges": [ "0 <= part_id <= 0 AND 100000000 < tid" ], "index_dives_for_eq_ranges": true, "rowid_ordered": false, "using_mrr": false, "index_only": false, "rows": 50535761, "cost": 6.06e7, "chosen": false, "cause": "cost" } ]

            Filed an issue against the upstream: https://github.com/facebook/mysql-5.6/issues/1052

            Filed a pull request against the upstream: https://github.com/facebook/mysql-5.6/pull/1053

            psergei Sergei Petrunia added a comment - Filed an issue against the upstream: https://github.com/facebook/mysql-5.6/issues/1052 Filed a pull request against the upstream: https://github.com/facebook/mysql-5.6/pull/1053

            Patch against MariaDB: psergey-mdev20693.diff

            psergei Sergei Petrunia added a comment - Patch against MariaDB: psergey-mdev20693.diff

            Fixed in the upstream by:

            commit 2b1e7918066a967b3a48fe486e5687d786aee052
            Author: Sergei Petrunia <psergey@askmonty.org>
            Date:   Tue Oct 1 15:29:38 2019 -0700
             
                #1052: ha_rocksdb::records_in_range() vastly overestimates #rows (#1053)
            

            psergei Sergei Petrunia added a comment - Fixed in the upstream by: commit 2b1e7918066a967b3a48fe486e5687d786aee052 Author: Sergei Petrunia <psergey@askmonty.org> Date: Tue Oct 1 15:29:38 2019 -0700   #1052: ha_rocksdb::records_in_range() vastly overestimates #rows (#1053)

            People

              psergei Sergei Petrunia
              psergei Sergei Petrunia
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.