[MDEV-26886] Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3) Created: 2021-10-22  Updated: 2022-01-19  Resolved: 2021-11-26

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: N/A
Fix Version/s: 10.8.0

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-26519 JSON Histograms: improve histogram co... Closed

 Description   

--source include/have_sequence.inc
 
# One third are zeros, the rest are unique values
create or replace table t (a tinyint) as select if(seq%3,seq,0) as a from seq_1_to_100;
select count(*) from t where a <= 0;
 
set histogram_type = JSON_HB;
analyze table t persistent for all;
explain format=json select * from t where a <= 0;
 
set histogram_type = DOUBLE_PREC_HB;
analyze table t persistent for all;
explain format=json select * from t where a <= 0;
 
# Cleanup
drop table t;

preview-10.7-MDEV-26519-json-histograms 508f5f3f

select count(*) from t where a <= 0;
count(*)
33

JSON histogram

{
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "t",
      "access_type": "ALL",
      "rows": 100,
      "filtered": 1.470600009,
      "attached_condition": "t.a <= 0"
    }
  }
}

DOUBLE_PREC histogram

{
  "query_block": {
    "select_id": 1,
    "table": {
      "table_name": "t",
      "access_type": "ALL",
      "rows": 100,
      "filtered": 32.8125,
      "attached_condition": "t.a <= 0"
    }
  }
}

So, the actual result set contains 33 rows out of 100, DOUBLE_PREC gives filtered=32.8 which is very close, JSON histogram gives filtered=1.47 which is far off.

Reproducible with InnoDB and MyISAM alike.



 Comments   
Comment by Elena Stepanova [ 2021-10-22 ]

Raised to critical on procedural reasons. Feel free to demote and remove from "must-do" scope.

Comment by Sergei Petrunia [ 2021-11-26 ]

Histogram_size=68.

  "histogram_hb_v2": [
    {
      "start": "0",
      "size": 0.33,
      "ndv": 1
    },
    {
      "start": "1",
      "size": 0.01,
      "ndv": 1
    },
    {
      "start": "2",
      "size": 0.01,
      "ndv": 1
    },
    ...

The issue is that a condition

  ... WHERE  a <= 0

produces a very small estimate that doesn't include the first bucket.

Comment by Sergei Petrunia [ 2021-11-26 ]

The computations in Histogram_json_hb::range_selectivity go as follows:

(gdb) print equal
  $1 = true
(gdb) print inclusive_endp
  $2 = true
(gdb) print idx
  $3 = 0

Correct so far.

Then, we get here:

    double sel= position_in_interval(field, max_key, max_key_len,
                                     buckets[idx].start_value,
                                     get_end_value(idx));

In get_end_value(idx), it takes this path

    return buckets[idx+1].start_value;

and returns "\001". (ISSUE-1)

Then, position_in_interval() is called with

position_in_interval(key="\0", left="\0", right="\001")

which returns 0.0 (this is correct).

with sel=0.0, this formmula

    max= left_fract + sel * (buckets[idx].cum_fract - left_fract);

returns 0.0 also.

Comment by Sergei Petrunia [ 2021-11-26 ]

... the problem is marked with ISSUE-1. The code assumes that the values in the first bucket are "spread" between 0 and 1. It ignores the fact that the bucket has ndv=1.

Generated at Thu Feb 08 09:48:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.