[MDEV-27229] Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#5) Created: 2021-12-11  Updated: 2022-01-25  Resolved: 2022-01-08

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: N/A
Fix Version/s: 10.8.1

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-26519 JSON Histograms: improve histogram co... Closed

 Description   

--source include/have_sequence.inc
 
create table t1 (id int, a varchar(8)) engine=MyISAM;
insert into t1 select seq, 'bar' from seq_1_to_100;
insert into t1 select id, 'qux' from t1;
 
set histogram_type=JSON_HB;
analyze table t1 persistent for all;
 
explain format=json select COUNT(*) FROM t1 WHERE a > 'foo';
 
set histogram_type=DOUBLE_PREC_HB;
analyze table t1 persistent for all;
 
explain format=json select COUNT(*) FROM t1 WHERE a > 'foo';
 
# Cleanup
drop table t1;

The condition filters a half of the rows (obviously).

With JSON_HB the estimation is quite far off:

preview-10.8-MDEV-26519-json-histograms 98cb4351

  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows": 200,
          "filtered": 86.55463409,
          "attached_condition": "t1.a > 'foo'"
        }
      }
    ]
  }

With DOUBLE_PREC the estimation is in this case precise:

  "query_block": {
    "select_id": 1,
    "nested_loop": [
      {
        "table": {
          "table_name": "t1",
          "access_type": "ALL",
          "rows": 200,
          "filtered": 50,
          "attached_condition": "t1.a > 'foo'"
        }
      }
    ]
  }



 Comments   
Comment by Elena Stepanova [ 2021-12-11 ]

Raised to critical on procedural reasons.

Comment by Sergei Petrunia [ 2021-12-13 ]

Investigation:
The histogram is:

{
  "target_histogram_size": 254,
  "histogram_hb": [
    {
      "start": "bar",
      "size": 0.5,
      "ndv": 1
    },
    {
      "start": "qux",
      "end": "qux",
      "size": 0.5,
      "ndv": 1
    }
  ]
}

We search for

a > 'foo'

'foo' is between 'bar' and 'qux'.

Histogram_json_hb::range_selectivity finds that 'foo' is after buckets[0].start_value and buckets[1].start_value.

It assumes that buckets[0].size values are unformly distributed between buckets[0].start_value and buckets[1].start_value.

It computes the position of 'foo' betwen 'bar' and 'qux', which is 0.26, and then returns the selectivity of 1 - (0.5 * 0.26)=0.86

Generated at Thu Feb 08 09:51:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.