[MDEV-26709] JSON histogram may contain bucketS than histogram_size allows Created: 2021-09-28  Updated: 2022-01-19  Resolved: 2021-10-11

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: N/A
Fix Version/s: 10.7.1

Type: Bug Priority: Critical
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-21130 Histograms: use JSON as on-disk format Closed
is caused by MDEV-26519 JSON Histograms: improve histogram co... Closed

 Description   

One of points from our earlier slack discussion was that there can not be more buckets in a JSON histogram than histogram_size. The below example proves the contrary.

It is possible that the conclusion was wrong or too generic and there are allowed exceptions from the rule. These exception will need to be a part of histogram_size description in the KB. The description has to be adjust anyway, as now it says that histogram_size means the number of buckets for JSON histograms, and it's certainly not the case.

--source include/have_sequence.inc
 
create or replace table t (a int);
insert into t values (1),(3),(5),(7);
insert into t select 2 from seq_1_to_25;
insert into t select 4 from seq_1_to_25;
insert into t select 6 from seq_1_to_25;
 
set histogram_size=4, histogram_type=JSON_HB;
analyze table t persistent for all;
 
select histogram from mysql.column_stats where table_name = 't';
 
drop table t;

preview-10.7-MDEV-26519-json-histograms da8bb4b470

select histogram from mysql.column_stats where table_name = 't';
histogram
{
  "histogram_hb_v2": [
    {
      "start": "1",
      "size": 0.240506329,
      "ndv": 2
    },
    {
      "start": "2",
      "size": 0.240506329,
      "ndv": 3
    },
    {
      "start": "4",
      "size": 0.189873418,
      "ndv": 2
    },
    {
      "start": "6",
      "size": 0.316455696,
      "ndv": 1
    },
    {
      "start": "7",
      "end": "7",
      "size": 0.012658228,
      "ndv": 1
    }
  ]
}



 Comments   
Comment by Elena Stepanova [ 2021-09-28 ]

Another case of an extra bucket. This one also shows that this extra bucket is of a "wrong" size.

--source include/have_sequence.inc
 
create or replace table t (a int);
insert into t select seq from seq_10_to_90;
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
 
set histogram_type= JSON_HB, histogram_size= 5;
analyze table t persistent for all;
 
select histogram from mysql.column_stats where table_name in ('t');
 
# Cleanup
drop table t;

select histogram from mysql.column_stats where table_name in ('t');
histogram
{
  "histogram_hb_v2": [
    {
      "start": "10",
      "size": 0.199074074,
      "ndv": 17
    },
    {
      "start": "26",
      "size": 0.199074074,
      "ndv": 17
    },
    {
      "start": "42",
      "size": 0.199074074,
      "ndv": 17
    },
    {
      "start": "58",
      "size": 0.199074074,
      "ndv": 17
    },
    {
      "start": "74",
      "size": 0.199074074,
      "ndv": 17
    },
    {
      "start": "90",
      "end": "90",
      "size": 0.00462963,
      "ndv": 1
    }
  ]
}

Comment by Sergei Petrunia [ 2021-10-11 ]

... no, it's not about "one more bucket". It can be way more buckets.

Example:

source mysql-test/include/world_schema.inc
source mysql-test/include/world.inc
set histogram_type='JSON_HB';
set histogram_size=50;
analyze table Country persistent for all;
 
select JSON_LENGTH(histogram, '$.histogram_hb_v2') from mysql.column_stats where column_name='Code';
+---------------------------------------------+
| JSON_LENGTH(histogram, '$.histogram_hb_v2') |
+---------------------------------------------+
|                                          60 |
+---------------------------------------------+
1 row in set (0.00 sec)

Comment by Sergei Petrunia [ 2021-10-11 ]

239 rows, 50 buckets...

gives bucket_capacity=4.78 = 4

The current code will put 4 rows in each bucket and will collect 500 buckets after having examined 50*4= 200 rows.
39 rows that are left will need 39/4 = 9.8 rows. This will give 59.8 = 60 buckets.

It looks like the idea that we should compute some integer number for rows_in_bucket is not a good one.
How about allowing fractions in rows_in_bucket ...

Generated at Thu Feb 08 09:47:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.