[MDEV-31067] selectivity_from_histogram >1.0 for col=const on DOUBLE_PREC_HB histogram - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Affects Version/s: 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL)
Fix Version/s: 10.11.3, 10.4.29, 10.5.20, 10.6.13, 10.8.8, 10.9.6, 10.10.4
Component/s: Optimizer
Labels:
None

Description

This is a public part of TODO-3823.

For certain data distributions, histogram code can produce selectivity>1.0 for a column=const equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram).

create table t10 (a int);

-- Fill the table, will provide INSERTs below

analyze table t10 persistent for all;

set optimizer_trace=1;

explain select * from t10  where a in (91303);

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

|    1 | SIMPLE      | t10   | ALL  | NULL          | NULL | NULL    | NULL | 99840 |   100.00 | Using where |

+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+

select json_detailed(json_extract(trace, '$**.selectivity_for_columns')) from information_schema.optimizer_trace\G

*************************** 1. row ***************************

json_detailed(json_extract(trace, '$**.selectivity_for_columns')): [

            "column_name": "a",

            "ranges":

            ["91303 <= a <= 91303"],

            "selectivity_from_histogram": 18.28543534

INSERTs (synthetic data):

create table bucket(a int);  # This holds how many rows we hold in a bucket.

insert into bucket select 1 from seq_1_to_780;

create table one_third_of_bucket(a int);  # one-third of a bucket

insert into one_third_of_bucket select 1 from seq_1_to_260;

create table t10 (a int);

insert into t10 select 0 from bucket, seq_1_to_4;

insert into t10 select 8693 from one_third_of_bucket;

insert into t10 select 8694 from one_third_of_bucket;

insert into t10 select 8695 from one_third_of_bucket;

insert into t10 select 34783 from one_third_of_bucket;

insert into t10 select 34784 from one_third_of_bucket;

insert into t10 select 34785 from one_third_of_bucket;

insert into t10 select 34785 from bucket, seq_1_to_8;

insert into t10 select 65214 from one_third_of_bucket;

insert into t10 select 65215 from one_third_of_bucket;

insert into t10 select 65216 from one_third_of_bucket;

insert into t10 select 65216 from bucket, seq_1_to_52;

insert into t10 select 65217 from one_third_of_bucket;

insert into t10 select 65218 from one_third_of_bucket;

insert into t10 select 65219 from one_third_of_bucket;

insert into t10 select 65219 from bucket;

insert into t10 select 73913 from one_third_of_bucket;

insert into t10 select 73914 from one_third_of_bucket;

insert into t10 select 73915 from one_third_of_bucket;

insert into t10 select 73915 from bucket, seq_1_to_40;

insert into t10 select 78257 from one_third_of_bucket;

insert into t10 select 78258 from one_third_of_bucket;

insert into t10 select 78259 from one_third_of_bucket;

insert into t10 select 91300 from one_third_of_bucket;

insert into t10 select 91301 from one_third_of_bucket;

insert into t10 select 91302 from one_third_of_bucket;

insert into t10 select 91302 from bucket, seq_1_to_6;

insert into t10 select 91303 from one_third_of_bucket;

insert into t10 select 91304 from one_third_of_bucket;

insert into t10 select 91305 from one_third_of_bucket;

insert into t10 select 91305 from bucket, seq_1_to_8;

insert into t10 select  99998 from one_third_of_bucket;

insert into t10 select  99999 from one_third_of_bucket;

insert into t10 select 100000 from one_third_of_bucket;

Attachments

Issue Links

relates to

MDEV-31327 Histogram range selectivity estimates should be merged, not added.

Stalled

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia created issue - 2023-04-17 20:12

Sergei Petrunia made changes - 2023-04-17 20:13

Field	Original Value	New Value
Description	This is a public part of TODO-3823. For certain data distributions, histogram code can produce selectivity>1.0 for a {{column=const}} equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram). {code:sql} create table t10 (a int); -- Fill the table, will provide INSERTs below analyze table t10 persistent for all; {code} {code} set optimizer_trace=1; explain select * from t10 where a in (91303); +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| filtered \| Extra \| +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+ \| 1 \| SIMPLE \| t10 \| ALL \| NULL \| NULL \| NULL \| NULL \| 99840 \| 100.00 \| Using where \| +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+ {code} {code:sql} select json_detailed(json_extract(trace, '$.selectivity_for_columns')) from information_schema.optimizer_trace\G *********************** 1. row *********************** json_detailed(json_extract(trace, '$.selectivity_for_columns')): [ [ { "column_name": "a", "ranges": ["91303 <= a <= 91303"], "selectivity_from_histogram": 18.28543534 } ] ] {code} INSERTs (synthetic data): {code} create table t10 (a int); insert into t10 select 0 from bucket, seq_1_to_4; insert into t10 select 8693 from one_third_of_bucket; insert into t10 select 8694 from one_third_of_bucket; insert into t10 select 8695 from one_third_of_bucket; insert into t10 select 34783 from one_third_of_bucket; insert into t10 select 34784 from one_third_of_bucket; insert into t10 select 34785 from one_third_of_bucket; insert into t10 select 34785 from bucket, seq_1_to_8; insert into t10 select 65214 from one_third_of_bucket; insert into t10 select 65215 from one_third_of_bucket; insert into t10 select 65216 from one_third_of_bucket; insert into t10 select 65216 from bucket, seq_1_to_52; insert into t10 select 65217 from one_third_of_bucket; insert into t10 select 65218 from one_third_of_bucket; insert into t10 select 65219 from one_third_of_bucket; insert into t10 select 65219 from bucket; insert into t10 select 73913 from one_third_of_bucket; insert into t10 select 73914 from one_third_of_bucket; insert into t10 select 73915 from one_third_of_bucket; insert into t10 select 73915 from bucket, seq_1_to_40; insert into t10 select 78257 from one_third_of_bucket; insert into t10 select 78258 from one_third_of_bucket; insert into t10 select 78259 from one_third_of_bucket; insert into t10 select 91300 from one_third_of_bucket; insert into t10 select 91301 from one_third_of_bucket; insert into t10 select 91302 from one_third_of_bucket; insert into t10 select 91302 from bucket, seq_1_to_6; insert into t10 select 91303 from one_third_of_bucket; insert into t10 select 91304 from one_third_of_bucket; insert into t10 select 91305 from one_third_of_bucket; insert into t10 select 91305 from bucket, seq_1_to_8; insert into t10 select 99998 from one_third_of_bucket; insert into t10 select 99999 from one_third_of_bucket; insert into t10 select 100000 from one_third_of_bucket; {code}	This is a public part of TODO-3823. For certain data distributions, histogram code can produce selectivity>1.0 for a {{column=const}} equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram). {code:sql} create table t10 (a int); -- Fill the table, will provide INSERTs below analyze table t10 persistent for all; {code} {code} set optimizer_trace=1; explain select * from t10 where a in (91303); +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+ \| id \| select_type \| table \| type \| possible_keys \| key \| key_len \| ref \| rows \| filtered \| Extra \| +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+ \| 1 \| SIMPLE \| t10 \| ALL \| NULL \| NULL \| NULL \| NULL \| 99840 \| 100.00 \| Using where \| +------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+ {code} {code:sql} select json_detailed(json_extract(trace, '$.selectivity_for_columns')) from information_schema.optimizer_trace\G *********************** 1. row *********************** json_detailed(json_extract(trace, '$.selectivity_for_columns')): [ [ { "column_name": "a", "ranges": ["91303 <= a <= 91303"], "selectivity_from_histogram": 18.28543534 } ] ] {code} INSERTs (synthetic data): {code:sql} create table bucket(a int); # This holds how many rows we hold in a bucket. insert into bucket select 1 from seq_1_to_780; create table one_third_of_bucket(a int); # Half-bucket full. insert into one_third_of_bucket select 1 from seq_1_to_260; create table t10 (a int); insert into t10 select 0 from bucket, seq_1_to_4; insert into t10 select 8693 from one_third_of_bucket; insert into t10 select 8694 from one_third_of_bucket; insert into t10 select 8695 from one_third_of_bucket; insert into t10 select 34783 from one_third_of_bucket; insert into t10 select 34784 from one_third_of_bucket; insert into t10 select 34785 from one_third_of_bucket; insert into t10 select 34785 from bucket, seq_1_to_8; insert into t10 select 65214 from one_third_of_bucket; insert into t10 select 65215 from one_third_of_bucket; insert into t10 select 65216 from one_third_of_bucket; insert into t10 select 65216 from bucket, seq_1_to_52; insert into t10 select 65217 from one_third_of_bucket; insert into t10 select 65218 from one_third_of_bucket; insert into t10 select 65219 from one_third_of_bucket; insert into t10 select 65219 from bucket; insert into t10 select 73913 from one_third_of_bucket; insert into t10 select 73914 from one_third_of_bucket; insert into t10 select 73915 from one_third_of_bucket; insert into t10 select 73915 from bucket, seq_1_to_40; insert into t10 select 78257 from one_third_of_bucket; insert into t10 select 78258 from one_third_of_bucket; insert into t10 select 78259 from one_third_of_bucket; insert into t10 select 91300 from one_third_of_bucket; insert into t10 select 91301 from one_third_of_bucket; insert into t10 select 91302 from one_third_of_bucket; insert into t10 select 91302 from bucket, seq_1_to_6; insert into t10 select 91303 from one_third_of_bucket; insert into t10 select 91304 from one_third_of_bucket; insert into t10 select 91305 from one_third_of_bucket; insert into t10 select 91305 from bucket, seq_1_to_8; insert into t10 select 99998 from one_third_of_bucket; insert into t10 select 99999 from one_third_of_bucket; insert into t10 select 100000 from one_third_of_bucket; {code}

Sergei Petrunia made changes - 2023-04-17 20:13

Description

This is a public part of TODO-3823.

For certain data distributions, histogram code can produce selectivity>1.0 for a {{column=const}} equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram).

{code:sql}
create table t10 (a int);
-- Fill the table, will provide INSERTs below
analyze table t10 persistent for all;
{code}

{code}
set optimizer_trace=1;
explain select * from t10 where a in (91303);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | t10 | ALL | NULL | NULL | NULL | NULL | 99840 | 100.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
{code}

{code:sql}
select json_detailed(json_extract(trace, '$**.selectivity_for_columns')) from information_schema.optimizer_trace\G
*************************** 1. row ***************************
json_detailed(json_extract(trace, '$**.selectivity_for_columns')): [
    [
        {
            "column_name": "a",
            "ranges":
            ["91303 <= a <= 91303"],
            "selectivity_from_histogram": 18.28543534
        }
    ]
]
{code}

INSERTs (synthetic data):

{code:sql}
create table bucket(a int); # This holds how many rows we hold in a bucket.
insert into bucket select 1 from seq_1_to_780;

create table one_third_of_bucket(a int); # Half-bucket full.
insert into one_third_of_bucket select 1 from seq_1_to_260;

create table t10 (a int);
insert into t10 select 0 from bucket, seq_1_to_4;

insert into t10 select 8693 from one_third_of_bucket;
insert into t10 select 8694 from one_third_of_bucket;
insert into t10 select 8695 from one_third_of_bucket;

insert into t10 select 34783 from one_third_of_bucket;
insert into t10 select 34784 from one_third_of_bucket;
insert into t10 select 34785 from one_third_of_bucket;

insert into t10 select 34785 from bucket, seq_1_to_8;

insert into t10 select 65214 from one_third_of_bucket;
insert into t10 select 65215 from one_third_of_bucket;
insert into t10 select 65216 from one_third_of_bucket;

insert into t10 select 65216 from bucket, seq_1_to_52;

insert into t10 select 65217 from one_third_of_bucket;
insert into t10 select 65218 from one_third_of_bucket;
insert into t10 select 65219 from one_third_of_bucket;

insert into t10 select 65219 from bucket;

insert into t10 select 73913 from one_third_of_bucket;
insert into t10 select 73914 from one_third_of_bucket;
insert into t10 select 73915 from one_third_of_bucket;

insert into t10 select 73915 from bucket, seq_1_to_40;

insert into t10 select 78257 from one_third_of_bucket;
insert into t10 select 78258 from one_third_of_bucket;
insert into t10 select 78259 from one_third_of_bucket;

insert into t10 select 91300 from one_third_of_bucket;
insert into t10 select 91301 from one_third_of_bucket;
insert into t10 select 91302 from one_third_of_bucket;

insert into t10 select 91302 from bucket, seq_1_to_6;

insert into t10 select 91303 from one_third_of_bucket;
insert into t10 select 91304 from one_third_of_bucket;
insert into t10 select 91305 from one_third_of_bucket;

insert into t10 select 91305 from bucket, seq_1_to_8;

insert into t10 select 99998 from one_third_of_bucket;
insert into t10 select 99999 from one_third_of_bucket;
insert into t10 select 100000 from one_third_of_bucket;
{code}

This is a public part of TODO-3823.

For certain data distributions, histogram code can produce selectivity>1.0 for a {{column=const}} equality on a DOUBLE_PREC_HB histogram (I think one can achieve the same on SINGLE_PREC_HB, the point is that it's not JSON_HB histogram).

{code:sql}
create table t10 (a int);
-- Fill the table, will provide INSERTs below
analyze table t10 persistent for all;
{code}

{code}
set optimizer_trace=1;
explain select * from t10 where a in (91303);
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
| 1 | SIMPLE | t10 | ALL | NULL | NULL | NULL | NULL | 99840 | 100.00 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+-------+----------+-------------+
{code}

{code:sql}
select json_detailed(json_extract(trace, '$**.selectivity_for_columns')) from information_schema.optimizer_trace\G
*************************** 1. row ***************************
json_detailed(json_extract(trace, '$**.selectivity_for_columns')): [
    [
        {
            "column_name": "a",
            "ranges":
            ["91303 <= a <= 91303"],
            "selectivity_from_histogram": 18.28543534
        }
    ]
]
{code}

INSERTs (synthetic data):

{code:sql}
create table bucket(a int); # This holds how many rows we hold in a bucket.
insert into bucket select 1 from seq_1_to_780;

create table one_third_of_bucket(a int); # one-third of a bucket
insert into one_third_of_bucket select 1 from seq_1_to_260;

create table t10 (a int);
insert into t10 select 0 from bucket, seq_1_to_4;

insert into t10 select 8693 from one_third_of_bucket;
insert into t10 select 8694 from one_third_of_bucket;
insert into t10 select 8695 from one_third_of_bucket;

insert into t10 select 34783 from one_third_of_bucket;
insert into t10 select 34784 from one_third_of_bucket;
insert into t10 select 34785 from one_third_of_bucket;

insert into t10 select 34785 from bucket, seq_1_to_8;

insert into t10 select 65214 from one_third_of_bucket;
insert into t10 select 65215 from one_third_of_bucket;
insert into t10 select 65216 from one_third_of_bucket;

insert into t10 select 65216 from bucket, seq_1_to_52;

insert into t10 select 65217 from one_third_of_bucket;
insert into t10 select 65218 from one_third_of_bucket;
insert into t10 select 65219 from one_third_of_bucket;

insert into t10 select 65219 from bucket;

insert into t10 select 73913 from one_third_of_bucket;
insert into t10 select 73914 from one_third_of_bucket;
insert into t10 select 73915 from one_third_of_bucket;

insert into t10 select 73915 from bucket, seq_1_to_40;

insert into t10 select 78257 from one_third_of_bucket;
insert into t10 select 78258 from one_third_of_bucket;
insert into t10 select 78259 from one_third_of_bucket;

insert into t10 select 91300 from one_third_of_bucket;
insert into t10 select 91301 from one_third_of_bucket;
insert into t10 select 91302 from one_third_of_bucket;

insert into t10 select 91302 from bucket, seq_1_to_6;

insert into t10 select 91303 from one_third_of_bucket;
insert into t10 select 91304 from one_third_of_bucket;
insert into t10 select 91305 from one_third_of_bucket;

insert into t10 select 91305 from bucket, seq_1_to_8;

insert into t10 select 99998 from one_third_of_bucket;
insert into t10 select 99999 from one_third_of_bucket;
insert into t10 select 100000 from one_third_of_bucket;
{code}

Sergei Petrunia added a comment - 2023-04-17 20:17 - edited

The problem is in Histogram::point_selectivity(), this part of code:

    else

/*

        The value 'pos' fits within one single histogram bucket.

...

*/

      HERE!

In particular, we arrive at the value that's greater than 1.0 here:

/*

So:

        - each bucket has the same #rows

        - values are unformly distributed across the [min_value,max_value] domain.

        If a bucket has value range that's N times bigger then average, than

        each value will have to have N times fewer rows than average.

*/

      sel= avg_sel * avg_bucket_width / current_bucket_width;

Sergei Petrunia added a comment - 2023-04-17 20:17 - edited The problem is in Histogram::point_selectivity(), this part of code: else { /* The value 'pos' fits within one single histogram bucket. ... */ HERE! In particular, we arrive at the value that's greater than 1.0 here: /* So: - each bucket has the same #rows - values are unformly distributed across the [min_value,max_value] domain. If a bucket has value range that's N times bigger then average, than each value will have to have N times fewer rows than average. */ sel= avg_sel * avg_bucket_width / current_bucket_width;

Sergei Petrunia added a comment - 2023-04-17 20:22

Introduced in:

commit ad842b5f058d5342c22cdc86542baa2ae9db5e70

Author:	Sergey Petrunya <psergey@askmonty.org>  Wed Mar 26 17:55:00 2014

Committer:	Sergey Petrunya <psergey@askmonty.org>  Wed Mar 26 17:55:00 2014

Original File:	sql/sql_statistics.h

MDEV-5926: EITS: Histogram estimates for column=least_possible_value are wrong

[Attempt #2]

Sergei Petrunia added a comment - 2023-04-17 20:22 Introduced in: commit ad842b5f058d5342c22cdc86542baa2ae9db5e70 Author: Sergey Petrunya <psergey@askmonty.org> Wed Mar 26 17:55:00 2014 Committer: Sergey Petrunya <psergey@askmonty.org> Wed Mar 26 17:55:00 2014 Original File: sql/sql_statistics.h MDEV-5926: EITS: Histogram estimates for column=least_possible_value are wrong [Attempt #2]

Sergei Petrunia added a comment - 2023-04-17 20:48

Trying to understand the logic behind the

 sel= avg_sel * avg_bucket_width / current_bucket_width;

So, we are looking to get #rows estimate for a value X that fits within a single bucket. That is,

  bucket_Y.left_endpoint < position(X) < bucket_Y.right_endpoint

How many rows are in the bucket that have value X?

point_selectivity() has avg_sel parameter, but that's a global per-table average.

Suppose the histogram for this particular table has two buckets:

bucketY1 = {left_endp1,  left_endp1 + 0.5}

bucketY2 = {left_endp2,  left_endp2 + 0.0001}

both buckets hold the same number of rows.
But do they hold the same number of different values?

If we assume that the values are distributed uniformly across the 0.0 (min value) to 1.0 (max value) range, we might guess that bucketY1 has more different values in it than bucketY2.

This is what this code is trying to do. It assumes that buckets that span bigger value range have lower avg_frequency (rows have different values, plenty of values to choose from). Contrary, a bucket that span small value range has fewer different values, and so rows in the bucket will tend to have the same value.

Sergei Petrunia added a comment - 2023-04-17 20:48 Trying to understand the logic behind the sel= avg_sel * avg_bucket_width / current_bucket_width; So, we are looking to get #rows estimate for a value X that fits within a single bucket. That is, bucket_Y.left_endpoint < position(X) < bucket_Y.right_endpoint How many rows are in the bucket that have value X? point_selectivity() has avg_sel parameter, but that's a global per-table average. Suppose the histogram for this particular table has two buckets: bucketY1 = {left_endp1, left_endp1 + 0.5} bucketY2 = {left_endp2, left_endp2 + 0.0001} both buckets hold the same number of rows. But do they hold the same number of different values? If we assume that the values are distributed uniformly across the 0.0 (min value) to 1.0 (max value) range, we might guess that bucketY1 has more different values in it than bucketY2. This is what this code is trying to do. It assumes that buckets that span bigger value range have lower avg_frequency (rows have different values, plenty of values to choose from). Contrary, a bucket that span small value range has fewer different values, and so rows in the bucket will tend to have the same value.

Sergei Petrunia added a comment - 2023-04-17 20:53

In the testcase for this MDEV, the bucket has a very narrow value range: its "right_endpoint - left_endpoint" is 500x smaller than what the average bucket's right_endpoint - left_endpoint has.

The code multiplies the estimate by this factor and ends up with a non-sensical estimate.

Sergei Petrunia added a comment - 2023-04-17 20:53 In the testcase for this MDEV, the bucket has a very narrow value range: its "right_endpoint - left_endpoint" is 500x smaller than what the average bucket's right_endpoint - left_endpoint has. The code multiplies the estimate by this factor and ends up with a non-sensical estimate.

Sergei Petrunia made changes - 2023-04-17 20:54

Link

This issue relates to TODO-3823 [ TODO-3823 ]

Sergei Petrunia made changes - 2023-04-18 09:16

Status

Open [ 1 ]

In Progress [ 3 ]

Sergei Petrunia made changes - 2023-04-18 10:35

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Petrunia made changes - 2023-04-18 10:36

Priority

Major [ 3 ]

Blocker [ 1 ]

Sergei Petrunia added a comment - 2023-04-18 10:38

bb-10.6-mdev31067

Sergei Petrunia added a comment - 2023-04-18 10:38 bb-10.6-mdev31067

Sergei Petrunia added a comment - 2023-04-18 17:31 - edited

Takeaways from the optimizer call:

Igor would prefer to see the "brave heuristic" removed completely. I'm hesitant as that will cause more changes in the estimates (and so changes in the query plans).

But let me implement that in a patch and see if that one would have an agreement.

Sergei Petrunia added a comment - 2023-04-18 17:31 - edited Takeaways from the optimizer call: Igor would prefer to see the "brave heuristic" removed completely. I'm hesitant as that will cause more changes in the estimates (and so changes in the query plans). But let me implement that in a patch and see if that one would have an agreement.

Sergei Petrunia added a comment - 2023-04-19 12:18

The alternative patch implementing suggestions from the above call: https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant2. igor please review.

Sergei Petrunia added a comment - 2023-04-19 12:18 The alternative patch implementing suggestions from the above call: https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant2 . igor please review.

Sergei Petrunia added a comment - 2023-04-19 12:21

igor please review.

Sergei Petrunia added a comment - 2023-04-19 12:21 igor please review.

Sergei Petrunia made changes - 2023-04-19 12:21

Assignee	Sergei Petrunia [ psergey ]	Igor Babaev [ igor ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Igor Babaev (Inactive) added a comment - 2023-04-24 04:21

psergei Sergei,
1. This is a bug of 10.0 code. Thus the fix should be applied starting from 10.4.
2. Let's call a value frequently used if its position in histogram coincides with the beginning of two adjacent buckets. Whether the probed value is frequently used can be easily checked. The number of buckets with the same starting point coinciding with the position of the value provides us with a good low bound for the selectivity of such value. For not frequently used values we should calculate the average frequency for them only and it will be less than the average frequency for all values.
Here's an example:
100 rows with 7 values and 5 buckets with 20 rows in each bucket, one value is encountered in 40 rows and this is the only frequently used value, all remaining values are encountered in 10 rows. The average frequency is 100/7 ~ 14.3, so the average selectivity is 14.3% that is greater than 10% expected for non-frequently used values.

Igor Babaev (Inactive) added a comment - 2023-04-24 04:21 psergei Sergei, 1. This is a bug of 10.0 code. Thus the fix should be applied starting from 10.4. 2. Let's call a value frequently used if its position in histogram coincides with the beginning of two adjacent buckets. Whether the probed value is frequently used can be easily checked. The number of buckets with the same starting point coinciding with the position of the value provides us with a good low bound for the selectivity of such value. For not frequently used values we should calculate the average frequency for them only and it will be less than the average frequency for all values. Here's an example: 100 rows with 7 values and 5 buckets with 20 rows in each bucket, one value is encountered in 40 rows and this is the only frequently used value, all remaining values are encountered in 10 rows. The average frequency is 100/7 ~ 14.3, so the average selectivity is 14.3% that is greater than 10% expected for non-frequently used values.

Igor Babaev (Inactive) made changes - 2023-04-24 04:23

Assignee

Igor Babaev [ igor ]

Sergei Petrunia [ psergey ]

Sergei Petrunia made changes - 2023-04-25 13:16

Status

In Review [ 10002 ]

Stalled [ 10000 ]

Sergei Petrunia added a comment - 2023-04-27 17:10

https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant3 . igor please review.

Sergei Petrunia added a comment - 2023-04-27 17:10 https://github.com/MariaDB/server/commits/bb-10.6-mdev31067-variant3 . igor please review.

Sergei Petrunia made changes - 2023-04-27 17:12

Assignee	Sergei Petrunia [ psergey ]	Igor Babaev [ igor ]
Status	Stalled [ 10000 ]	In Review [ 10002 ]

Igor Babaev (Inactive) made changes - 2023-04-28 00:13

Assignee

Igor Babaev [ igor ]

Sergei Petrunia [ psergey ]

Igor Babaev (Inactive) made changes - 2023-04-28 01:54

Assignee

Sergei Petrunia [ psergey ]

Igor Babaev [ igor ]

Igor Babaev (Inactive) added a comment - 2023-04-28 01:58

In my opinion the function Histogram::point_selectivity() the has patch touched still requires some improvements.

Igor Babaev (Inactive) added a comment - 2023-04-28 01:58 In my opinion the function Histogram::point_selectivity() the has patch touched still requires some improvements.

Igor Babaev (Inactive) made changes - 2023-04-28 01:58

Status

In Review [ 10002 ]

Stalled [ 10000 ]

Igor Babaev (Inactive) made changes - 2023-04-28 01:59

Assignee

Igor Babaev [ igor ]

Sergei Petrunia [ psergey ]

Sergei Petrunia added a comment - 2023-04-28 10:55

igor,

Yet I really mind the code that calculates selectivity for popular values: it looks too complicated and your comment does not help to understand it.

How is this part of this MDEV? Nor the testcase for this MDEV, neither any of the fixes touch that part of the code.

Sergei Petrunia added a comment - 2023-04-28 10:55 igor , Yet I really mind the code that calculates selectivity for popular values: it looks too complicated and your comment does not help to understand it. How is this part of this MDEV? Nor the testcase for this MDEV, neither any of the fixes touch that part of the code.

Sergei Golubchik added a comment - 2023-04-28 18:37

ok to push variant 2. commit https://github.com/MariaDB/server/commit/a87c17ab566def817da2f79fb7badb7318a1e311

It is a conservative upper-bound estimate, and is delivers lower estimates than a complex magic of popular values in the variant 3. Thus I see no reason to use the complex approach if the simple and safe approach produces better estimates.

Sergei Golubchik added a comment - 2023-04-28 18:37 ok to push variant 2. commit https://github.com/MariaDB/server/commit/a87c17ab566def817da2f79fb7badb7318a1e311 It is a conservative upper-bound estimate, and is delivers lower estimates than a complex magic of popular values in the variant 3. Thus I see no reason to use the complex approach if the simple and safe approach produces better estimates.

Sergei Petrunia made changes - 2023-04-29 07:16

Fix Version/s		10.4.29 [ 28510 ]
Fix Version/s		10.5.20 [ 28512 ]
Fix Version/s		10.6.13 [ 28514 ]
Fix Version/s		10.9.6 [ 28520 ]
Fix Version/s		10.10.4 [ 28522 ]
Fix Version/s		10.11.3 [ 28524 ]
Fix Version/s		10.8.8 [ 28518 ]
Fix Version/s	10.6 [ 24028 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Ralf Gebhardt made changes - 2023-05-03 19:54

Link

This issue blocks TODO-3922 [ TODO-3922 ]

Rex Johnston made changes - 2023-05-17 22:15

Assignee	Sergei Petrunia [ psergey ]	Rex Johnston [ JIRAUSER52533 ]
Resolution	Fixed [ 1 ]
Status	Closed [ 6 ]	Stalled [ 10000 ]

Rex Johnston made changes - 2023-05-17 22:48

Status

Stalled [ 10000 ]

In Progress [ 3 ]

Rex Johnston made changes - 2023-05-17 22:48

Priority

Blocker [ 1 ]

Major [ 3 ]

Julien Fritsch made changes - 2023-05-22 07:32

Fix Version/s		10.4 [ 22408 ]
Fix Version/s		10.5 [ 23123 ]
Fix Version/s		10.6 [ 24028 ]
Fix Version/s		10.8 [ 26121 ]
Fix Version/s		10.9 [ 26905 ]
Fix Version/s		10.10 [ 27530 ]
Fix Version/s		10.11 [ 27614 ]
Fix Version/s	10.4.29 [ 28510 ]
Fix Version/s	10.5.20 [ 28512 ]
Fix Version/s	10.6.13 [ 28514 ]
Fix Version/s	10.8.8 [ 28518 ]
Fix Version/s	10.9.6 [ 28520 ]
Fix Version/s	10.10.4 [ 28522 ]
Fix Version/s	10.11.3 [ 28524 ]

Julien Fritsch made changes - 2023-05-22 08:00

Priority

Major [ 3 ]

Critical [ 2 ]

Ralf Gebhardt made changes - 2023-05-22 14:26

Priority

Critical [ 2 ]

Blocker [ 1 ]

Rex Johnston made changes - 2023-05-22 17:57

Fix Version/s		10.10.4 [ 28522 ]
Fix Version/s		10.9.6 [ 28520 ]
Fix Version/s		10.8.8 [ 28518 ]
Fix Version/s		10.6.13 [ 28514 ]
Fix Version/s		10.5.20 [ 28512 ]
Fix Version/s		10.4.29 [ 28510 ]
Fix Version/s		10.11.3 [ 28524 ]
Fix Version/s	10.4 [ 22408 ]
Fix Version/s	10.5 [ 23123 ]
Fix Version/s	10.6 [ 24028 ]
Fix Version/s	10.8 [ 26121 ]
Fix Version/s	10.9 [ 26905 ]
Fix Version/s	10.10 [ 27530 ]
Fix Version/s	10.11 [ 27614 ]
Assignee	Rex Johnston [ JIRAUSER52533 ]	Sergei Petrunia [ psergey ]
Resolution		Fixed [ 1 ]
Status	In Progress [ 3 ]	Closed [ 6 ]

Rex Johnston made changes - 2023-05-22 17:57

Comment

[ Example, with patch 85cc83188059d0cd280aa9f9e290dc8f025a4c3c

{code:sql}
drop table if exists bucket, one_third_of_bucket, t10;
create table bucket(a int); # This holds how many rows we hold in a bucket.
insert into bucket select 1 from seq_1_to_780;

create table one_third_of_bucket(a int); # one-third of a bucket
insert into one_third_of_bucket select 1 from seq_1_to_260;

create table t10 (a int);
insert into t10 select 0 from bucket, seq_1_to_4;

insert into t10 select 86930 from one_third_of_bucket;
insert into t10 select 86940 from one_third_of_bucket;
insert into t10 select 86950 from one_third_of_bucket;

insert into t10 select 347830 from one_third_of_bucket;
insert into t10 select 347840 from one_third_of_bucket;
insert into t10 select 347850 from one_third_of_bucket;

insert into t10 select 347850 from bucket, seq_1_to_8;

insert into t10 select 652140 from one_third_of_bucket;
insert into t10 select 652150 from one_third_of_bucket;
insert into t10 select 652160 from one_third_of_bucket;

insert into t10 select 652160 from bucket, seq_1_to_52;

insert into t10 select 652170 from one_third_of_bucket;
insert into t10 select 652180 from one_third_of_bucket;
insert into t10 select 652190 from one_third_of_bucket;

insert into t10 select 652190 from bucket;

insert into t10 select 739130 from one_third_of_bucket;
insert into t10 select 739140 from one_third_of_bucket;
insert into t10 select 739150 from one_third_of_bucket;

insert into t10 select 739150 from bucket, seq_1_to_40;

insert into t10 select 782570 from one_third_of_bucket;
insert into t10 select 782580 from one_third_of_bucket;
insert into t10 select 782590 from one_third_of_bucket;

insert into t10 select 913000 from one_third_of_bucket;
insert into t10 select 913010 from one_third_of_bucket;
insert into t10 select 913020 from one_third_of_bucket;

insert into t10 select 913020 from bucket, seq_1_to_6;

insert into t10 select 913030 from one_third_of_bucket;
insert into t10 select 913040 from one_third_of_bucket;
insert into t10 select 913050 from one_third_of_bucket;

insert into t10 select 913050 from bucket, seq_1_to_8;

insert into t10 select 999980 from one_third_of_bucket;
insert into t10 select 999990 from one_third_of_bucket;
insert into t10 select 1000000 from one_third_of_bucket;

analyze table t10 persistent for all;
{code}

and

{code:sql}
set optimizer_trace=1;
explain select * from t10 where a in (86930, 86931, 86932, 86933, 86934, 86935, 86936, 86937, 86938, 86939, 86940, 86941, 86942, 86943, 86944, 86945, 86946, 86947, 86948, 86949, 86950, 86951, 86952, 86953, 86954, 86955, 86956, 86957, 86958, 86959, 86960, 86961, 86962, 86963, 86964, 86965, 86966, 86967, 86968, 86969, 86960, 86971, 86972, 86973, 86974, 86975, 86976, 86977, 86978, 86979, 86980, 86981, 86982, 86983, 86984, 86985, 86986, 86987, 86988, 86989, 86990, 86991, 86992, 86993, 86994, 86995, 86996, 86997, 86998, 86999, 87000, 87001, 87002, 87003, 87004, 87005, 87006, 87007, 87008, 87009, 87010, 87011, 87012, 87013, 87014, 87015, 87016, 87017, 87018, 87019, 87020, 87021, 87022, 87023, 87024, 87025, 87026, 87027, 87028, 87029, 87030, 87031, 87032, 87033, 87034, 87035, 87036, 87037, 87038, 87039, 87040, 87041, 87042, 87043, 87044, 87045, 87046, 87047, 87048, 87049, 87050, 87051, 87052, 87053, 87054, 87055, 87056, 87057, 87058, 87059);
select json_detailed(json_extract(trace, '$**.selectivity_for_columns')) from information_schema.optimizer_trace;
{code}

produces

{noformat}
| [
    [
        {
            "column_name": "a",
            "ranges":
            [
                "86930 <= a <= 86930",
                "86931 <= a <= 86931",
                "86932 <= a <= 86932",
                "86933 <= a <= 86933",
                "86934 <= a <= 86934",
                "86935 <= a <= 86935",
<SNIP>
                "87055 <= a <= 87055",
                "87056 <= a <= 87056",
                "87057 <= a <= 87057",
                "87058 <= a <= 87058",
                "87059 <= a <= 87059"
            ],
            "selectivity_from_histogram": 1.0078
        }
    ]
] |

{noformat}

All these values fit into one bucket. The correct selectivity is 0.0078125.
]

Rex Johnston made changes - 2023-05-22 17:58

Comment

[ Both of these patches can still produce a selectivity > 1. ]

Rex Johnston made changes - 2023-05-22 19:11

Link

This issue relates to MDEV-31327 [ MDEV-31327 ]

Jira Automation (IT) made changes - 2024-07-04 01:12

Zendesk Related Tickets

172089

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023-04-17 20:12

Updated:: 2024-07-07 19:37

Resolved:: 2023-05-22 17:57

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration