[MDEV-21130] Histograms: use JSON as on-disk format - Jira

Details

Type: Task
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Fix Version/s: 10.8.1
Component/s: Optimizer
Labels:
- eits
- gsoc20
- gsoc21

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

CREATE TABLE mysql.column_stats (

  min_value varbinary(255) DEFAULT NULL,

  max_value varbinary(255) DEFAULT NULL,

...

  hist_size tinyint unsigned,

  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),

  histogram varbinary(255),

...

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:

Improve the histogram precision
Allow the DBAs to examine the histograms
Enable other histogram types to be collected/used.

Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"

  { "hello":"world"}

that is, the following should work:

set histogram_type='json';

analyze table t1 persisent for all;

select histogram from mysql.column_stats where table_name='t1' ;

this should produce {"hello":"world"}.

Milestone-2: produce JSON with histogram.

- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

  "value1",

  "value2",

...

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

Milestone-3: Parse the JSON back into an array

Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

Milestone-4: Make the code support different kinds of Histograms

Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

Step #0: Make Histogram a real class

Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

Step 1: Separate classes for binary and JSON histograms

Need to introduce

class Histogram -- interface, no data members.

class Histogram_binary : public Histogram

class Histogram_json : public Histogram

and a factory function

Histogram *create_histogram(Histogram_type)

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

Step 2: Demonstrate saving/loading of histograms

Now, the code already can:

collect a JSON histogram and save it.
when loading a histogram, figure from histogram_type column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

Parse function at the moment only prints to stderr.
However, we should catch parse errors and make sure they are reported to the client.
The test may look like this:

INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);

FLUSH TABLES;

# this should print some descriptive test

--error NNNN

select * from test.t1;

Milestone-5: Parse the JSON data into a structure that allows lookups.

The structure is

std::vector<std::string>

and it holds the data in KeyTupleFormat (See the comments for reasoning. There was a suggestion to use in_vector (This is what IN subqueries use) but it didn't work out)

Milestone 5.1 (aka Milestone 44)

Make a function to estimate selectivity using the data structure specified in previous milestone.

Make range_selectivity() accept key_range parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

Attachments

Issue Links

causes

MDEV-26709 JSON histogram may contain bucketS than histogram_size allows

Closed

MDEV-26710 Histogram field in mysql.column_stats is too short, JSON histograms get corrupt

Closed

MDEV-26711 JSON histograms become invalid due to not properly quoted values

Closed

MDEV-26718 Query with: Cross join, Non-indexed column comparison, Column with histogram: speed varies 20x depending on histogram type

Closed

MDEV-26724 Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column

Closed

MDEV-26737 Outdated VARIABLE_COMMENT for HISTOGRAM_TYPE in I_S.SYSTEM_VARIABLES and help

Closed

MDEV-26750 Estimation for filtered rows is far off with JSON_HB histogram

Closed

MDEV-26751 Deficiencies in MTR coverage for JSON histograms

Closed

MDEV-26801 Valgrind/MSAN errors in Column_statistics_collected::finish, main.statistics_json fails

Closed

MDEV-30097 Assertion `low == (int)buckets.size()-1 || field->key_cmp((uchar*)buckets[low+1].start_value.data(), lookup_val)> 0' failed in Histogram_json_hb::find_bucket

Confirmed

includes

MDEV-26519 JSON Histograms: improve histogram collection

Closed

relates to

MDEV-26125 Histograms: limited histogram precision causes imprecise estimates

Closed

MDEV-26589 Assertion failure upon DECODE_HISTOGRAM with NULLs in first column

Closed

MDEV-26590 Stack smashing/buffer overflow in Histogram_json_hb::parse upon UPDATE on table with long VARCHAR

Closed

MDEV-26592 Assertion failure in set_warning_truncated_wrong_value / Column_stat::load_histogram after failed constraint creation

Closed

MDEV-26595 ASAN use-after-poison my_strnxfrm_simple_internal / Histogram_json_hb::range_selectivity with CHAR column

Closed

MDEV-27062 Make histogram_type=JSON_HB the new default

Closed

MDEV-27566 Benchmark-like test for JSON_HB histograms

Open

MDEV-21131 Histograms: Most-Common-Values histograms

Open

MDEV-26849 JSON Histograms: point selectivity estimates are off for non-existent values

Closed

(5 causes, 1 includes, 9 relates to)

Activity

Ascending order - Click to sort in descending order

View 16 older comments

Sergei Petrunia added a comment - 2021-08-31 15:04 - edited

elenst, documentation for testing:

histogram_type variable is used to specify what kind of histogram is collected.

After this patch, it can be set to a new value: JSON_HB.

Then, EITS code will create JSON histograms. Old histogram types remain supported, the optimizer will use whatever histogram is available.

The histograms are kept in mysql.column_stats, the columns of interest are:

hist_type (SINGLE_PREC_HB, DOUBLE_PREC_HB, now also JSON_HB)
histogram. for JSON_HB, this is a JSON document.

the function DECODE_HISTOGRAM also "supports" JSON_HB by returning the histogram unmodified. That is, one can do this:

select DECODE_HISTOGRAM(hist_type, histgram) from mysql.column_stats

and get a readable representation of histogram for any kind of histogram.

Things to test

Collection of histograms. ANALYZE ... PERSISTENT FOR... statements overlapping with other workload. Histogram memory management has changed significantly (This also affects old, binary histograms).
Use by the optimizer. You can look into the MTR tests and find examples like this:

analyze select * from tbl where col1= const

analyze select * from tbl where col1< const

analyze select * from tbl where col1 between const1 and const2

There is only one column col1 which is not indexed. The interesting parts of the output are filtered (prediction from the histogram) and r_filtered (actual value).

There will be differences in estimates with binary histograms. The estimates should be generally better. Rounding-error-level regressions are acceptable. An estimate that is significantly worse than one from the binary histogram warrants an investigation.

Sergei Petrunia added a comment - 2021-08-31 15:04 - edited elenst , documentation for testing: histogram_type variable is used to specify what kind of histogram is collected. After this patch, it can be set to a new value: JSON_HB. Then, EITS code will create JSON histograms. Old histogram types remain supported, the optimizer will use whatever histogram is available. The histograms are kept in mysql.column_stats, the columns of interest are: hist_type (SINGLE_PREC_HB, DOUBLE_PREC_HB, now also JSON_HB) histogram. for JSON_HB, this is a JSON document. the function DECODE_HISTOGRAM also "supports" JSON_HB by returning the histogram unmodified. That is, one can do this: select DECODE_HISTOGRAM(hist_type, histgram) from mysql.column_stats and get a readable representation of histogram for any kind of histogram. Things to test Collection of histograms. ANALYZE ... PERSISTENT FOR... statements overlapping with other workload. Histogram memory management has changed significantly (This also affects old, binary histograms). Use by the optimizer. You can look into the MTR tests and find examples like this: analyze select * from tbl where col1= const analyze select * from tbl where col1< const analyze select * from tbl where col1 between const1 and const2 There is only one column col1 which is not indexed. The interesting parts of the output are filtered (prediction from the histogram) and r_filtered (actual value). There will be differences in estimates with binary histograms. The estimates should be generally better. Rounding-error-level regressions are acceptable. An estimate that is significantly worse than one from the binary histogram warrants an investigation.

Sergei Petrunia added a comment - 2021-09-01 12:40

The issue in MySQL that I was mentioning: https://bugs.mysql.com/bug.php?id=104789

Sergei Petrunia added a comment - 2021-09-01 12:40 The issue in MySQL that I was mentioning: https://bugs.mysql.com/bug.php?id=104789

Sergei Petrunia added a comment - 2021-09-05 08:52

Pushed more cleanup patches to https://github.com/MariaDB/server/tree/bb-10.7-mdev21130 . Buildbot is clean.

Sergei Petrunia added a comment - 2021-09-05 08:52 Pushed more cleanup patches to https://github.com/MariaDB/server/tree/bb-10.7-mdev21130 . Buildbot is clean.

Ian Gilfillan added a comment - 2021-09-22 10:46

Descriptions in the server for system variables need updating as well, for example histogram_type still states "Specifies type of the histograms created by ANALYZE. Possible values are: SINGLE_PREC_HB - single precision height-balanced, DOUBLE_PREC_HB - double precision height-balanced." even though JSON_HB is valid as well.

Ian Gilfillan added a comment - 2021-09-22 10:46 Descriptions in the server for system variables need updating as well, for example histogram_type still states "Specifies type of the histograms created by ANALYZE. Possible values are: SINGLE_PREC_HB - single precision height-balanced, DOUBLE_PREC_HB - double precision height-balanced." even though JSON_HB is valid as well.

Sergei Petrunia added a comment - 2022-01-21 13:28

Pushed into 10.8. JSON_HB histograms are not enabled by default.

Sergei Petrunia added a comment - 2022-01-21 13:28 Pushed into 10.8. JSON_HB histograms are not enabled by default.

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2019-11-23 11:48

Updated:: 2023-03-21 15:35

Resolved:: 2022-01-21 13:28

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server