[MDEV-26849] JSON Histograms: point selectivity estimates are off for non-existent values - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.7(EOL)
Fix Version/s: 10.7.1
Component/s: Optimizer
Labels:
None

Description

Suppose the number of values in the dataset is less than the number of buckets in the histogram.

Which value should be returned when producing estimates for col='non-existant-value' ?

Currently, MariaDB's code produces the average. It uses "inclusion assumption" and assumes that the value that is looked up exists in the database. This is not what other databases do.

The deficiency comes from the fact that when one looks at MariaDB's histograms as they were specified in ~~MDEV-21130~~ and ~~MDEV-26519~~, there is no obvious way to tell whether the histogram has an exhaustive list of all values encountered in the table or not.

(Technically, if one walks through the whole histogram and only sees buckets with ndv=1, this will guarantee that the histogram bucket endpoints are all the values that were encountered... but I won't count this as "obvious").

drop table if exists ten, one_k, t1

create table ten(a int);

insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);

insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

create table t1 ( col int);

insert into t1 select 100*A.a+100 from ten A, one_k B;

set histogram_type=json_hb;

analyze table t1 persistent for all;

Then run

analyze select * from t1 where $WHERE

and for various kinds of WHERE clause we get this:

cond	real	mariadb	mysql	postgresql
col=0	0	1000	1	1
col=50	0	1000	1	1
col=70	0	1000	1	1
col=100	1000	1000	1000	1000
col=150	0	1000	1	1
col=200	1000	1000	1000	1000

Apparently, MariaDB's assumption differs from that of other databases' and is worse...

Attachments

Issue Links

relates to

MDEV-21130 Histograms: use JSON as on-disk format

Closed

MDEV-26519 JSON Histograms: improve histogram collection

Closed

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2021-10-20 19:46 - edited

The effect can be seen when the number of values in the table is greater than the number of buckets in the histogram, too.

# 10 constants + long tail

@dataset_cmds= (

  "drop table if exists ten, one_k, t1",

  "create table ten(a int)",

  "insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)",

  "create table one_k(a int)",

  "insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C",

  "create table t1 ( col int)",

# 100 , 200, 300, ...

  "insert into t1 select 100*A.a+100 from ten A, one_k B",

# 10 rows in the middle of each:

#  110..120

#  210..220

#  etc

  "insert into t1 select A.a*100 + 10 + B.a from ten A, ten B, ten D",

# the same but 130-140, 230-240, etc

  "insert into t1 select A.a*100 + 30 + B.a from ten A, ten B, ten D"

);

@where_clauses= (

  "col=80",

  "col=100",

  "col=115",

  "col=185",

  "col=200",

  "col=205",

  "col=215",

  "col=255",

);

cond	real	mariadb	mysql	postgresql
col=80	0	57.1429002	1.2	10
col=100	1000	999.9999618	999.6	1000
col=115	10	8.0000004	9.6	10
col=185	0	57.1429002	1.2	10
col=200	1000	999.9999618	999.6	1000
col=205	0	57.1429002	1.2	10
col=215	10	8.0000004	9.6	10
col=255	0	57.1429002	1.2	10

Sergei Petrunia added a comment - 2021-10-20 19:46 - edited The effect can be seen when the number of values in the table is greater than the number of buckets in the histogram, too. # 10 constants + long tail @dataset_cmds = ( "drop table if exists ten, one_k, t1" , "create table ten(a int)" , "insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)" , "create table one_k(a int)" , "insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C" , "create table t1 ( col int)" , # 100 , 200, 300, ... "insert into t1 select 100*A.a+100 from ten A, one_k B" , # 10 rows in the middle of each: # 110..120 # 210..220 # etc "insert into t1 select A.a*100 + 10 + B.a from ten A, ten B, ten D" , # the same but 130-140, 230-240, etc "insert into t1 select A.a*100 + 30 + B.a from ten A, ten B, ten D" ); @where_clauses = ( "col=80" , "col=100" , "col=115" , "col=185" , "col=200" , "col=205" , "col=215" , "col=255" , ); cond real mariadb mysql postgresql col=80 0 57.1429002 1.2 10 col=100 1000 999.9999618 999.6 1000 col=115 10 8.0000004 9.6 10 col=185 0 57.1429002 1.2 10 col=200 1000 999.9999618 999.6 1000 col=205 0 57.1429002 1.2 10 col=215 10 8.0000004 9.6 10 col=255 0 57.1429002 1.2 10

Sergei Petrunia added a comment - 2021-10-22 16:40

With the patch:

cond	real	mysql	mariadb	mariadb_old	postgresql
col=80	0	1.2	1.00000008	7.12883016	10
col=100	1000	999.6	999.9999618	937.5	1000
col=115	10	9.6	8.0000004	22.09721988	10
col=185	0	1.2	1.00000008	6.40471056	10
col=200	1000	999.6	999.9999618	1031.25	1000
col=205	0	1.2	1.00000008	34.01944044	10
col=215	10	9.6	8.0000004	22.09721988	10
col=255	0	1.2	1.00000008	6.59529264	10

Sergei Petrunia added a comment - 2021-10-22 16:40 With the patch: cond real mysql mariadb mariadb_old postgresql col=80 0 1.2 1.00000008 7.12883016 10 col=100 1000 999.6 999.9999618 937.5 1000 col=115 10 9.6 8.0000004 22.09721988 10 col=185 0 1.2 1.00000008 6.40471056 10 col=200 1000 999.6 999.9999618 1031.25 1000 col=205 0 1.2 1.00000008 34.01944044 10 col=215 10 9.6 8.0000004 22.09721988 10 col=255 0 1.2 1.00000008 6.59529264 10

Sergei Petrunia added a comment - 2021-10-25 07:19

This has already been pushed:

commit a4b4cffb3c9f03d2cea0211c0a6302c779f043aa

Author: Sergei Petrunia <psergey@askmonty.org>

Date:   Fri Oct 22 19:43:19 2021 +0300

    MDEV-26849: JSON Histograms: point selectivity estimates are off

    .. for non-existent values.

    Handle this special case.

Sergei Petrunia added a comment - 2021-10-25 07:19 This has already been pushed: commit a4b4cffb3c9f03d2cea0211c0a6302c779f043aa Author: Sergei Petrunia <psergey@askmonty.org> Date: Fri Oct 22 19:43:19 2021 +0300 MDEV-26849: JSON Histograms: point selectivity estimates are off .. for non-existent values. Handle this special case.

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-10-18 13:49

Updated:: 2022-01-19 15:10

Resolved:: 2021-10-25 07:19

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server