[MDEV-19424] InnoDB's records_in_range estimates are capped at about 50% - Jira

Sergei Petrunia created issue - 2019-05-09 09:13

Sergei Petrunia made changes - 2019-05-09 09:15

Field	Original Value	New Value
Description	InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this). If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table. Test dataset: {code:sql} create table ten(a int); insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); create table one_k(a int); insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C; create table t1 ( a int, b varchar(32), c int, key(a), key(b) ) engine=innodb; {code} {code:sql} # 100K NULLs insert into t1 select null, null, A.a + 1000B.a from one_k A, one_k B where B.a<100; # 900 K non-NULLs insert into t1 select A.a + 1000B.a, A.a + 1000B.a, A.a + 1000B.a from one_k A, one_k B where B.a>=100; {code} {code} explain select * from t1 force index (a) where a is not null ; +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| 1 \| SIMPLE \| t1 \| range \| a \| a \| 5 \| NULL \| 494308 \| Using index condition \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ {code} {code} explain select * from t1 force index (b) where b is not null ; +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| 1 \| SIMPLE \| t1 \| range \| b \| b \| 35 \| NULL \| 494308 \| Using index condition \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ {code}	InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this). If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table. Test dataset: {code:sql} create table ten(a int); insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9); create table one_k(a int); insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C; create table t1 ( a int, b varchar(32), c int, key(a), key(b) ) engine=innodb; {code} {code:sql} # 100K NULLs insert into t1 select null, null, A.a + 1000B.a from one_k A, one_k B where B.a<100; # 900 K non-NULLs insert into t1 select A.a + 1000B.a, A.a + 1000B.a, A.a + 1000B.a from one_k A, one_k B where B.a>=100; {code} Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table. But EXPLAIN will show the estimates of about 50% of the table: {code} explain select * from t1 force index (a) where a is not null ; +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| 1 \| SIMPLE \| t1 \| range \| a \| a \| 5 \| NULL \| 494308 \| Using index condition \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ {code} {code} explain select * from t1 force index (b) where b is not null ; +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| Extra \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ \| 1 \| SIMPLE \| t1 \| range \| b \| b \| 35 \| NULL \| 494308 \| Using index condition \| +------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+ {code}

Sergei Petrunia made changes - 2019-05-09 09:18

Description

InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

Test dataset:
{code:sql}
create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 (
  a int,
  b varchar(32),
  c int,
  key(a),
  key(b)
) engine=innodb;
{code}

{code:sql}
# 100K NULLs
insert into t1 select
  null,
  null,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a<100;

# 900 K non-NULLs
insert into t1 select
  A.a + 1000*B.a,
  A.a + 1000*B.a,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a>=100;
{code}

Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
But EXPLAIN will show the estimates of about 50% of the table:

{code}
explain select * from t1 force index (a) where a is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

{code}
explain select * from t1 force index (b) where b is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

Test dataset:
{code:sql}
create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 (
  a int,
  b varchar(32),
  c int,
  key(a),
  key(b)
) engine=innodb;
{code}

{code:sql}
# 100K NULLs
insert into t1 select
  null,
  null,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a<100;

# 900 K non-NULLs
insert into t1 select
  A.a + 1000*B.a,
  A.a + 1000*B.a,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a>=100;
{code}

Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
But EXPLAIN will show the estimates of about 50% of the table:

{code}
explain select * from t1 force index (a) where a is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

{code}
explain select * from t1 force index (b) where b is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

When the optimizer was simple, this property was not a problem.

A MySQL-5.0 era optimizer

Sergei Petrunia made changes - 2019-05-09 09:26

Description

InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

Test dataset:
{code:sql}
create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 (
  a int,
  b varchar(32),
  c int,
  key(a),
  key(b)
) engine=innodb;
{code}

{code:sql}
# 100K NULLs
insert into t1 select
  null,
  null,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a<100;

# 900 K non-NULLs
insert into t1 select
  A.a + 1000*B.a,
  A.a + 1000*B.a,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a>=100;
{code}

Now, either {{a IS NOT NULL}} or {{b IS NOT NULL}} should match about 50% of the rows in the table.
But EXPLAIN will show the estimates of about 50% of the table:

{code}
explain select * from t1 force index (a) where a is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

{code}
explain select * from t1 force index (b) where b is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

When the optimizer was simple, this property was not a problem.

A MySQL-5.0 era optimizer

InnoDB's {{records_in_range}} estimates seem to be capped at ~50% of the table. (We used to observe this on various occasions before but I haven't been able to find an MDEV for this).

If I pass a range that contains a bigger fraction of the table, the estimated number of rows is still around 50% of the total rows in the table.

Test dataset:
{code:sql}
create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 (
  a int,
  b varchar(32),
  c int,
  key(a),
  key(b)
) engine=innodb;
{code}

{code:sql}
# 100K NULLs
insert into t1 select
  null,
  null,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a<100;

# 900 K non-NULLs
insert into t1 select
  A.a + 1000*B.a,
  A.a + 1000*B.a,
  A.a + 1000*B.a
from
  one_k A,
  one_k B
where
  B.a>=100;
{code}

Now, both {{a IS NOT NULL}} or {{b IS NOT NULL}} match 900K rows (90% of the table).
But EXPLAIN will show the estimates of about 500K rows, which is 50% of the table:

{code}
explain select * from t1 force index (a) where a is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | a | a | 5 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

{code}
explain select * from t1 force index (b) where b is not null ;
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
| 1 | SIMPLE | t1 | range | b | b | 35 | NULL | 494308 | Using index condition |
+------+-------------+-------+-------+---------------+------+---------+------+--------+-----------------------+
{code}

When the optimizer was simple, this property was not a problem.

A simple optimizer would only use range estimates to construct range access. Range access is cheaper than full table if it covers about 30% of the table. Returning 50% of the table instead of 90% was not an issue.

A more advanced optimizer also attempts to use range estimates for condition selectivity, etc. Here, returning 50% selectivity instead of 90% is a problem. (One must take into account that selectivity is computed for multiple indexes. For example, for 5 indexes 0.5^2= 1/32 . 32x under-estimation of selectivity)

Sergei Petrunia made changes - 2019-05-09 09:34

Link

This issue relates to ~~MDEV-15777~~ [ ~~MDEV-15777~~ ]

Sachin Setiya (Inactive) made changes - 2019-05-15 05:17

Assignee

Sachin Setiya [ sachin.setiya.007 ]

Elena Stepanova made changes - 2019-06-28 00:40

Fix Version/s		10.1 [ 16100 ]
Fix Version/s		10.2 [ 14601 ]
Fix Version/s		10.3 [ 22126 ]
Fix Version/s		10.4 [ 22408 ]

Andrei Elkin made changes - 2019-06-28 16:08

Link

This issue relates to ~~MDEV-7140~~ [ ~~MDEV-7140~~ ]

Andrei Elkin made changes - 2019-06-28 18:13

Link

This issue relates to ~~MDEV-7140~~ [ ~~MDEV-7140~~ ]

Sergei Petrunia made changes - 2019-11-23 18:56

Link

This issue relates to ~~MDEV-17111~~ [ ~~MDEV-17111~~ ]

Sergei Petrunia made changes - 2019-11-23 19:06

Labels

records_in_range

Sergei Petrunia made changes - 2019-11-23 19:07

Link

This issue relates to ~~MDEV-21136~~ [ ~~MDEV-21136~~ ]

Marko Mäkelä made changes - 2020-03-09 11:20

Link

This issue relates to MDEV-21895 [ MDEV-21895 ]

Julien Fritsch made changes - 2020-12-01 12:38

Fix Version/s

10.1 [ 16100 ]

Sergei Golubchik made changes - 2021-12-06 21:33

Workflow

MariaDB v3 [ 96631 ]

MariaDB v4 [ 141284 ]

Marko Mäkelä made changes - 2022-02-21 14:35

Assignee

Sachin Setiya [ sachin.setiya.007 ]

Sergei Petrunia [ psergey ]

Ralf Gebhardt made changes - 2022-08-04 08:44

Fix Version/s

10.2 [ 14601 ]

Julien Fritsch made changes - 2023-04-27 14:27

Fix Version/s

10.3 [ 22126 ]

Sergei Petrunia made changes - 2025-01-07 13:57

Fix Version/s		11.0.1 [ 28548 ]
Fix Version/s	10.4(EOL) [ 22408 ]
Resolution		Fixed [ 1 ]
Status	Open [ 1 ]	Closed [ 6 ]

MariaDB Server

InnoDB's records_in_range estimates are capped at about 50%

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration