[MDEV-16746] Estimates for table greater than records in the table Created: 2018-07-13  Updated: 2023-04-27

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer, Storage Engine - RocksDB
Affects Version/s: 10.2, 10.3, 10.4
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Varun Gupta (Inactive) Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: upstream


 Description   

--source include/have_rocksdb.inc

Dataset

create table t1 (id1 bigint, id2 bigint, c1 bigint, c2 bigint, c3 bigint, c4 bigint, c5 bigint, c6 bigint, c7 bigint, primary key (id1, id2), index i(c1, c2))engine=rocksdb;
 
let $i=0;
while ($i<10000)
{
  inc $i;
  eval insert t1(id1, id2, c1, c2, c3, c4, c5, c6, c7)
          values($i,    0, $i,  0,  0,  0,  0,  0,  0);
}

explain select c1 from t1 where c1 > 5 limit 10;

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	t1	range	i	i	9	NULL	13300	Using where; Using index

If you look closely the optimizer predicts reading 13300 rows but the table t1 has in total 10000 rows. These estimates look incorrect.



 Comments   
Comment by Varun Gupta (Inactive) [ 2018-07-16 ]

Running the query on upstream, the estimate is still greater than the rows of the table.

explain select c1 from t1 where c1 > 5 limit 10;
id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	t1	range	i	i	9	NULL	13492	Using where; Using index

Comment by Varun Gupta (Inactive) [ 2018-07-16 ]

In the function ha_rocksdb::info
we see there is a call made to update_stats, that sets the stats.records to the appropriate value when there is a primary key available. Here is the code snippet

for (uint i = 0; i < m_tbl_def->m_key_count; i++) {
    if (is_pk(i, table, m_tbl_def)) {
      stats.data_file_length = m_pk_descr->m_stats.m_actual_disk_size;
      stats.records = m_pk_descr->m_stats.m_rows;
    } else {
      stats.index_file_length += m_key_descr_arr[i]->m_stats.m_actual_disk_size;
    }
  }

So for our example we have a primary key and so we get the value set for stats.records.

Then going ahead in ha_rocksdb::info we have

    if (stats.records == 0 ||
        (rocksdb_force_compute_memtable_stats &&
         rocksdb_debug_optimizer_n_rows == 0))
    {

rocksdb_debug_optimizer_n_rows is by default 0 and rocksdb_force_compute_memtable_stats is TRUE by default.
So we enter this IF branch.

Comment by Varun Gupta (Inactive) [ 2018-07-16 ]

Then going inside the IF branch we have

        rdb->GetApproximateMemTableStats(m_pk_descr->get_cf(), r,
                                         &memtableCount, &memtableSize);

and then we add the stats from here to the stats.records

        stats.records += memtableCount;
        stats.data_file_length += memtableSize;

So we had stats.records as 10k already and on top of it we add the value of memtableCount which we get from GetApproximateMemTableStats function.

Comment by Sergei Petrunia [ 2018-07-18 ]

So I have added this printout patch: https://gist.github.com/spetrunia/82c494ad7f26b93e6774bdd875c7fa3b

Ran this test (the testcase from this bug, with some output suppressed) : https://gist.github.com/spetrunia/73a1616827dccfbbae9008381cbb6080

Get this output with 20K rows reported

TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------
worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
create table t1 (id1 bigint, id2 bigint, c1 bigint, c2 bigint, c3 bigint, c4 bigint, c5 bigint, c6 bigint, c7 bigint, primary key (id1, id2), index i(c1, c2))engine=rocksdb;
explain select c1 from t1 where c1 > 5 limit 10;
id      select_type     table   type    possible_keys   key     key_len ref     rows    Extra
1       SIMPLE  t1      range   i       i       9       NULL    19800   Using where; Using index
rocksdb._e                               [ pass ]   3989

however looking into mysqld.err I see that we've got the 20K number from GetApproximateMemTableStats:

AAA: GetApproximateSizes() call #1, value 0
AAA: GetApproximateMemTableStats() call#1 returns 0 
AAA: GetApproximateSizes() call #1, value 0
AAA: GetApproximateMemTableStats() call#1 returns 20000 
AAA: GetApproximateMemTableStats() call#2 returns 20000
AAA: GetApproximateMemTableStats() call#2 returns 20000

Comment by Sergei Petrunia [ 2018-07-18 ]

Changed unique key to be non-unique and I see GetApproximateMemTableStats return even 30K rows:

AAA: GetApproximateSizes() call #1, value 0
AAA: GetApproximateMemTableStats() call#1 returns 0 
AAA: GetApproximateSizes() call #1, value 0
AAA: GetApproximateMemTableStats() call#1 returns 13103 
AAA: GetApproximateMemTableStats() call#2 returns 30000
AAA: GetApproximateMemTableStats() call#2 returns 30000

This is fairly easy to reproduce.
The EXPLAIN output in the test returns either about 14K or 30K:

explain select c1 from t1 where c1 > 5 limit 10;
id      select_type     table   type    possible_keys   key     key_len ref     rows    Extra
1       SIMPLE  t1      range   i       i       9       NULL    29700   Using where; Using index

Comment by Sergei Petrunia [ 2018-07-18 ]

In the comment dated 2018-07-16, Varun wrote:

In the function ha_rocksdb::info we see there is a call made to update_stats, that sets the stats.records to the appropriate value when there is a primary key available. Here is the code snippet

I'm debugging and I don't see anything wrong:

  • ha_rocksdb::info calls ha_rocksdb::update_stats, which sets stats.records from m_pk_descr->m_stats.m_rows. The latter is on-disk data statistics.
  • then, ha_rocksdb::info has this piece:

      if (stats.records == 0) {
        rdb->GetApproximateSizes(m_pk_descr->get_cf(), &r, 1, &sz,
                                 include_flags);

That is, if on-disk data statistics had zero, we will make a call to get the current estimate.

  • then, we add memtable estimates (which we either get from GetApproximateMemTableStats, or get the cached value).

So far I don't see how either on-disk or memtable data estimate could be counted twice.

Comment by Ming Lin [ 2018-08-25 ]

I can't reproduce it with latest 10.3 branch(commit 7f73f5e4e5bdae80561ff2596f5142965397a1b1)

MariaDB [test]> select count(*) from t1;
+----------+
| count(*) |
+----------+
|    10000 |
+----------+
1 row in set (0.756 sec)
 
MariaDB [test]> explain select c1 from t1 where c1 > 5 limit 10;
+------+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
| id   | select_type | table | type  | possible_keys | key  | key_len | ref  | rows | Extra                    |
+------+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
|    1 | SIMPLE      | t1    | range | i             | i    | 9       | NULL | 9900 | Using where; Using index |
+------+-------------+-------+-------+---------------+------+---------+------+------+--------------------------+
1 row in set (0.001 sec)

Comment by Varun Gupta (Inactive) [ 2018-12-05 ]

Here I see this on buildbot while working on MDEV-15253

CURRENT_TEST: rocksdb.rocksdb_range2
--- D:/winx64-packages/build/src/storage/rocksdb/mysql-test/rocksdb/r/rocksdb_range2.result	2018-11-14 10:37:57.000000000 +0000
+++ D:\winx64-packages\build\src\storage\rocksdb\mysql-test\rocksdb\r\rocksdb_range2.reject	2018-11-14 15:00:03.924782500 +0000
@@ -8,7 +8,7 @@
 10000
 explain select c1 from t1 where c1 > 5 limit 10;
 id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
-1	SIMPLE	t1	index	i	i	18	NULL	#	Using where; Using index
+1	SIMPLE	t1	range	i	i	9	NULL	#	Using where; Using index
 drop table t1;

https://buildbot.askmonty.org/buildbot/builders/winx64-packages/builds/10524/steps/test/logs/stdio

Generated at Thu Feb 08 08:31:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.