[MDEV-26125] Histograms: limited histogram precision causes imprecise estimates Created: 2021-07-11  Updated: 2023-10-03  Resolved: 2023-10-03

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6
Fix Version/s: 10.8.1

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: histogram-test

Attachments: PNG File screenshot-1.png    
Issue Links:
Relates
relates to MDEV-21130 Histograms: use JSON as on-disk format Closed

 Description   

This is a known limitation but I'm filing a testcase for the record.

Generate a 1-millon row population with column "Country" having value distribution according to the "world" database:

source mysql-test/include/world_schema.inc 
source mysql-test/include/world.inc 
 
create table generated_pop (
  pk int auto_increment primary key,
  Country char(52)
);

set @total_pop= (select Sum(population) from Country) - 1;
 
create table tmp1 as 
SELECT 
  name, 
  SUM(Population) over (order by Name) as cumulative_pop
FROM Country;

create table ten(a int primary key);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);
 
create table one_k(a int primary key);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table rand_numbers as
select 
  rand() * @total_pop as col
from 
  one_k A, one_k B;
 
insert into generated_pop (Country)
 select 
  (select name from tmp1 
   where cumulative_pop > col 
   order by cumulative_pop limit 1
  )
from rand_numbers;

analyze table generated_pop persistent for all;

mysql> select count(*) from generated_pop where country='China';
+----------+
| count(*) |
+----------+
|   210306 |
+----------+
1 row in set (3.00 sec)
 
mysql> select count(*) from generated_pop where country='Chile';
+----------+
| count(*) |
+----------+
|     2479 |
+----------+
1 row in set (3.49 sec)

As expected, China has a much larger population than Chile.

But histogram doesn't allow the optimizer to find this out:

mysql> explain extended select  * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.01 sec)
 
mysql> explain extended select  * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.09 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.00 sec)



 Comments   
Comment by Sergei Petrunia [ 2021-07-11 ]

MySQL 8, for comparison:
(not exactly the same data due to different seed for RAND, real numbers are: 210369 in China and 2509 in Chile)

mysql> explain select  * from generated_pop where country='China'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: generated_pop
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 917136
     filtered: 21.07
        Extra: Using where
1 row in set, 1 warning (0.00 sec)
 
mysql> explain select  * from generated_pop where country='Chile'\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: generated_pop
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 917136
     filtered: 0.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

Comment by Michael Okoko [ 2021-08-23 ]

Using the build from PR-1854, this is the output using JSON histograms:

MariaDB [pop_test]> set histogram_type=json;
Query OK, 0 rows affected (0.000 sec)
 
MariaDB [pop_test]> analyze table generated_pop persistent for all;
+------------------------+---------+----------+-----------------------------------------+
| Table                  | Op      | Msg_type | Msg_text                                |
+------------------------+---------+----------+-----------------------------------------+
| pop_test.generated_pop | analyze | status   | Engine-independent statistics collected |
| pop_test.generated_pop | analyze | status   | OK                                      |
+------------------------+---------+----------+-----------------------------------------+
2 rows in set (6.719 sec)
 
MariaDB [pop_test]> explain extended select  * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |     0.78 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)
 
MariaDB [pop_test]> explain extended select  * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 |    21.18 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+----------+-------------+
1 row in set, 1 warning (0.001 sec)
 
MariaDB [pop_test]>

Comment by Sergei Petrunia [ 2022-01-15 ]

Re-trying with the latest 10.8 tree with JSON_HB Histograms:

MariaDB [pop1]> analyze select * from generated_pop where country='China';
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | r_rows     | filtered | r_filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | 1000000.00 |    21.04 |      21.04 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
1 row in set (0.164 sec)

MariaDB [pop1]> analyze select * from generated_pop where country='Chile';
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | r_rows     | filtered | r_filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | 1000000.00 |     0.13 |       0.25 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
1 row in set (0.157 sec)

MariaDB [pop1]> analyze select * from generated_pop where country='Sweden';
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
| id   | select_type | table         | type | possible_keys | key  | key_len | ref  | rows    | r_rows     | filtered | r_filtered | Extra       |
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
|    1 | SIMPLE      | generated_pop | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | 1000000.00 |     0.08 |       0.15 | Using where |
+------+-------------+---------------+------+---------------+------+---------+------+---------+------------+----------+------------+-------------+
1 row in set (0.157 sec)

Comment by Sergei Petrunia [ 2022-01-18 ]

Comment by Sergei Petrunia [ 2023-10-03 ]

Fixed in 10.8 by MDEV-21130, JSON_HB histograms.

Generated at Thu Feb 08 09:42:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.