[MDEV-26886] Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3) - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 10.8.0
Component/s: Optimizer
Labels:
None

Description

--source include/have_sequence.inc

# One third are zeros, the rest are unique values

create or replace table t (a tinyint) as select if(seq%3,seq,0) as a from seq_1_to_100;

select count(*) from t where a <= 0;

set histogram_type = JSON_HB;

analyze table t persistent for all;

explain format=json select * from t where a <= 0;

set histogram_type = DOUBLE_PREC_HB;

analyze table t persistent for all;

explain format=json select * from t where a <= 0;

# Cleanup

drop table t;

preview-10.7-MDEV-26519-json-histograms 508f5f3f
select count(*) from t where a <= 0;
count(*)
33

JSON histogram
{
"query_block": {
"select_id": 1,
"table": {
"table_name": "t",
"access_type": "ALL",
"rows": 100,
"filtered": 1.470600009,
"attached_condition": "t.a <= 0"
}
}
}

DOUBLE_PREC histogram
{
"query_block": {
"select_id": 1,
"table": {
"table_name": "t",
"access_type": "ALL",
"rows": 100,
"filtered": 32.8125,
"attached_condition": "t.a <= 0"
}
}
}

So, the actual result set contains 33 rows out of 100, DOUBLE_PREC gives filtered=32.8 which is very close, JSON histogram gives filtered=1.47 which is far off.

Reproducible with InnoDB and MyISAM alike.

Attachments

Issue Links

is caused by

MDEV-26519 JSON Histograms: improve histogram collection

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2021-10-22 12:06

Raised to critical on procedural reasons. Feel free to demote and remove from "must-do" scope.

Elena Stepanova added a comment - 2021-10-22 12:06 Raised to critical on procedural reasons. Feel free to demote and remove from "must-do" scope.

Sergei Petrunia added a comment - 2021-11-26 14:56

Histogram_size=68.

  "histogram_hb_v2": [

      "start": "0",

      "size": 0.33,

      "ndv": 1

},

      "start": "1",

      "size": 0.01,

      "ndv": 1

},

      "start": "2",

      "size": 0.01,

      "ndv": 1

},

...

The issue is that a condition

  ... WHERE  a <= 0

produces a very small estimate that doesn't include the first bucket.

Sergei Petrunia added a comment - 2021-11-26 14:56 Histogram_size=68. "histogram_hb_v2": [ { "start": "0", "size": 0.33, "ndv": 1 }, { "start": "1", "size": 0.01, "ndv": 1 }, { "start": "2", "size": 0.01, "ndv": 1 }, ... The issue is that a condition ... WHERE a <= 0 produces a very small estimate that doesn't include the first bucket.

Sergei Petrunia added a comment - 2021-11-26 15:29

The computations in Histogram_json_hb::range_selectivity go as follows:

(gdb) print equal

  $1 = true

(gdb) print inclusive_endp

  $2 = true

(gdb) print idx

  $3 = 0

Correct so far.

Then, we get here:

    double sel= position_in_interval(field, max_key, max_key_len,

                                     buckets[idx].start_value,

                                     get_end_value(idx));

In get_end_value(idx), it takes this path

    return buckets[idx+1].start_value;

and returns "\001". (ISSUE-1)

Then, position_in_interval() is called with

position_in_interval(key="\0", left="\0", right="\001")

which returns 0.0 (this is correct).

with sel=0.0, this formmula

    max= left_fract + sel * (buckets[idx].cum_fract - left_fract);

returns 0.0 also.

Sergei Petrunia added a comment - 2021-11-26 15:29 The computations in Histogram_json_hb::range_selectivity go as follows: (gdb) print equal $1 = true (gdb) print inclusive_endp $2 = true (gdb) print idx $3 = 0 Correct so far. Then, we get here: double sel= position_in_interval(field, max_key, max_key_len, buckets[idx].start_value, get_end_value(idx)); In get_end_value(idx) , it takes this path return buckets[idx+1].start_value; and returns "\001". (ISSUE-1) Then, position_in_interval() is called with position_in_interval(key="\0", left="\0", right="\001") which returns 0.0 (this is correct). with sel=0.0, this formmula max= left_fract + sel * (buckets[idx].cum_fract - left_fract); returns 0.0 also.

Sergei Petrunia added a comment - 2021-11-26 15:31

... the problem is marked with ISSUE-1. The code assumes that the values in the first bucket are "spread" between 0 and 1. It ignores the fact that the bucket has ndv=1.

Sergei Petrunia added a comment - 2021-11-26 15:31 ... the problem is marked with ISSUE-1. The code assumes that the values in the first bucket are "spread" between 0 and 1. It ignores the fact that the bucket has ndv=1.

People

Assignee:: Sergei Petrunia

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-10-22 12:04

Updated:: 2022-01-19 15:10

Resolved:: 2021-11-26 17:10

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server