[MDEV-26709] JSON histogram may contain bucketS than histogram_size allows - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: N/A
Fix Version/s: 10.7.1
Component/s: Optimizer
Labels:
None

Description

One of points from our earlier slack discussion was that there can not be more buckets in a JSON histogram than histogram_size. The below example proves the contrary.

It is possible that the conclusion was wrong or too generic and there are allowed exceptions from the rule. These exception will need to be a part of histogram_size description in the KB. The description has to be adjust anyway, as now it says that histogram_size means the number of buckets for JSON histograms, and it's certainly not the case.

--source include/have_sequence.inc

create or replace table t (a int);

insert into t values (1),(3),(5),(7);

insert into t select 2 from seq_1_to_25;

insert into t select 4 from seq_1_to_25;

insert into t select 6 from seq_1_to_25;

set histogram_size=4, histogram_type=JSON_HB;

analyze table t persistent for all;

select histogram from mysql.column_stats where table_name = 't';

drop table t;

preview-10.7-MDEV-26519-json-histograms da8bb4b470
select histogram from mysql.column_stats where table_name = 't';
histogram
{
"histogram_hb_v2": [
{
"start": "1",
"size": 0.240506329,
"ndv": 2
},
{
"start": "2",
"size": 0.240506329,
"ndv": 3
},
{
"start": "4",
"size": 0.189873418,
"ndv": 2
},
{
"start": "6",
"size": 0.316455696,
"ndv": 1
},
{
"start": "7",
"end": "7",
"size": 0.012658228,
"ndv": 1
}
]
}

Attachments

Issue Links

is caused by

MDEV-21130 Histograms: use JSON as on-disk format

Closed

MDEV-26519 JSON Histograms: improve histogram collection

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2021-09-28 21:49

Another case of an extra bucket. This one also shows that this extra bucket is of a "wrong" size.

--source include/have_sequence.inc

create or replace table t (a int);

insert into t select seq from seq_10_to_90;

insert into t select * from t;

insert into t select * from t;

insert into t select * from t;

set histogram_type= JSON_HB, histogram_size= 5;

analyze table t persistent for all;

select histogram from mysql.column_stats where table_name in ('t');

# Cleanup

drop table t;

select histogram from mysql.column_stats where table_name in ('t');

histogram

  "histogram_hb_v2": [

      "start": "10",

      "size": 0.199074074,

      "ndv": 17

},

      "start": "26",

      "size": 0.199074074,

      "ndv": 17

},

      "start": "42",

      "size": 0.199074074,

      "ndv": 17

},

      "start": "58",

      "size": 0.199074074,

      "ndv": 17

},

      "start": "74",

      "size": 0.199074074,

      "ndv": 17

},

      "start": "90",

      "end": "90",

      "size": 0.00462963,

      "ndv": 1

Elena Stepanova added a comment - 2021-09-28 21:49 Another case of an extra bucket. This one also shows that this extra bucket is of a "wrong" size. --source include/have_sequence.inc create or replace table t (a int ); insert into t select seq from seq_10_to_90; insert into t select * from t; insert into t select * from t; insert into t select * from t; set histogram_type= JSON_HB, histogram_size= 5; analyze table t persistent for all ; select histogram from mysql.column_stats where table_name in ( 't' ); # Cleanup drop table t; select histogram from mysql.column_stats where table_name in ( 't' ); histogram { "histogram_hb_v2" : [ { "start" : "10" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "26" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "42" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "58" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "74" , "size" : 0.199074074, "ndv" : 17 }, { "start" : "90" , "end" : "90" , "size" : 0.00462963, "ndv" : 1 } ] }

Sergei Petrunia added a comment - 2021-10-11 12:12 - edited

... no, it's not about "one more bucket". It can be way more buckets.

Example:

source mysql-test/include/world_schema.inc

source mysql-test/include/world.inc

set histogram_type='JSON_HB';

set histogram_size=50;

analyze table Country persistent for all;

select JSON_LENGTH(histogram, '$.histogram_hb_v2') from mysql.column_stats where column_name='Code';

+---------------------------------------------+

| JSON_LENGTH(histogram, '$.histogram_hb_v2') |

+---------------------------------------------+

|                                          60 |

+---------------------------------------------+

1 row in set (0.00 sec)

Sergei Petrunia added a comment - 2021-10-11 12:12 - edited ... no, it's not about "one more bucket". It can be way more buckets. Example: source mysql-test/include/world_schema.inc source mysql-test/include/world.inc set histogram_type='JSON_HB'; set histogram_size=50; analyze table Country persistent for all; select JSON_LENGTH(histogram, '$.histogram_hb_v2') from mysql.column_stats where column_name='Code'; +---------------------------------------------+ | JSON_LENGTH(histogram, '$.histogram_hb_v2') | +---------------------------------------------+ | 60 | +---------------------------------------------+ 1 row in set (0.00 sec)

Sergei Petrunia added a comment - 2021-10-11 12:21

239 rows, 50 buckets...

gives bucket_capacity=4.78 = 4

The current code will put 4 rows in each bucket and will collect 500 buckets after having examined 50*4= 200 rows.
39 rows that are left will need 39/4 = 9.8 rows. This will give 59.8 = 60 buckets.

It looks like the idea that we should compute some integer number for rows_in_bucket is not a good one.
How about allowing fractions in rows_in_bucket ...

Sergei Petrunia added a comment - 2021-10-11 12:21 239 rows, 50 buckets... gives bucket_capacity=4.78 = 4 The current code will put 4 rows in each bucket and will collect 500 buckets after having examined 50*4= 200 rows. 39 rows that are left will need 39/4 = 9.8 rows. This will give 59.8 = 60 buckets. It looks like the idea that we should compute some integer number for rows_in_bucket is not a good one. How about allowing fractions in rows_in_bucket ...

People

Assignee:: Sergei Petrunia

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2021-09-28 16:41

Updated:: 2022-01-19 15:10

Resolved:: 2021-10-11 14:08

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server