[MDEV-21130] Histograms: use JSON as on-disk format - Jira

Sergei Petrunia created issue - 2019-11-23 11:48

Sergei Petrunia made changes - 2019-11-23 11:49

Field	Original Value	New Value
Labels		eits

Sergei Petrunia made changes - 2019-11-23 12:06

Link

This issue relates to MDEV-21131 [ MDEV-21131 ]

Julien Fritsch made changes - 2020-01-23 13:58

Fixing Priority

250

Vicențiu Ciorbaru made changes - 2020-02-04 14:47

Labels

eits

eits gsoc20

Dhruv Aggarwal added a comment - 2020-03-25 21:20

Hi there!

I would like to work on this issue and then submit my proposal.
How can I configuring system to solve this?

Dhruv Aggarwal added a comment - 2020-03-25 21:20 Hi there! I would like to work on this issue and then submit my proposal. How can I configuring system to solve this?

Sergei Golubchik added a comment - 2020-03-26 11:48

Did you see https://mariadb.org/get-involved/getting-started-for-developers/get-code-build-test/ ?

Sergei Golubchik added a comment - 2020-03-26 11:48 Did you see https://mariadb.org/get-involved/getting-started-for-developers/get-code-build-test/ ?

Vicențiu Ciorbaru made changes - 2021-02-18 14:56

Labels

eits gsoc20

eits gsoc20 gsoc21

Robert Bindar made changes - 2021-02-19 13:46

Assignee

Robert Bindar [ robertbindar ]

Robert Bindar made changes - 2021-02-19 13:46

Assignee

Robert Bindar [ robertbindar ]

Sergei Petrunia [ psergey ]

Sergei Petrunia added a comment - 2021-05-18 12:30

Ok this task is selected for Google Summer of Code 2021. The student is Michael Okoko, the mentor is me.

Sergei Petrunia added a comment - 2021-05-18 12:30 Ok this task is selected for Google Summer of Code 2021. The student is Michael Okoko, the mentor is me.

Sergei Petrunia added a comment - 2021-06-08 15:13

Discussion:
https://lists.launchpad.net/maria-developers/msg12743.html
https://lists.launchpad.net/maria-developers/msg12757.html

Sergei Petrunia added a comment - 2021-06-08 15:13 Discussion: https://lists.launchpad.net/maria-developers/msg12743.html https://lists.launchpad.net/maria-developers/msg12757.html

Sergei Petrunia made changes - 2021-06-15 12:20

Link

This issue relates to TODO-2926 [ TODO-2926 ]

Sergei Petrunia made changes - 2021-06-21 13:25

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3
TODO

Sergei Petrunia made changes - 2021-06-21 13:30

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back
(DRAFT)
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 into something that allows binary search:

{code}
std::vector<std::string>
std::vector<String>
{code}

h2. Milestone-4
TODO

Sergei Petrunia added a comment - 2021-06-21 16:24

Code for Milestone-1 is here: https://github.com/MariaDB/server/pull/1854 . It still fails the tests.

Sergei Petrunia added a comment - 2021-06-21 16:24 Code for Milestone-1 is here: https://github.com/MariaDB/server/pull/1854 . It still fails the tests.

Sergei Petrunia made changes - 2021-06-23 14:40

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back
(DRAFT)
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 into something that allows binary search:

{code}
std::vector<std::string>
std::vector<String>
{code}

h2. Milestone-4
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back
(DRAFT)
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 into something that allows binary search:

{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
h2. Milestone-4
TODO

Sergei Petrunia added a comment - 2021-07-09 14:25

Code for Milestone-2: https://github.com/MariaDB/server/pull/1871 . It has a memory leak in class Histogram but this will be fixed in the subsequent milestones.

Sergei Petrunia added a comment - 2021-07-09 14:25 Code for Milestone-2: https://github.com/MariaDB/server/pull/1871 . It has a memory leak in class Histogram but this will be fixed in the subsequent milestones.

Sergei Petrunia added a comment - 2021-07-09 15:16

On the question of "How does 'col IN (const1,const2,...)' store its lookup array" :

It is item_cmpfunc.h, class in_vector. It has type-specific descendants like in_string, in_longlong,
in_timestamp. It is created through this call:

array= m_comparator.type_handler()->make_in_vector(thd, this, arg_count - 1);

The storage format is specific to each datatype and is not explicitly defined.

The interface for populating the array and making lookups uses Item objects:

  virtual void set(uint pos,Item *item)=0;

  bool find(Item *item);

Sergei Petrunia added a comment - 2021-07-09 15:16 On the question of "How does 'col IN (const1,const2,...)' store its lookup array" : It is item_cmpfunc.h, class in_vector. It has type-specific descendants like in_string, in_longlong, in_timestamp. It is created through this call: array= m_comparator.type_handler()->make_in_vector(thd, this , arg_count - 1); The storage format is specific to each datatype and is not explicitly defined. The interface for populating the array and making lookups uses Item objects: virtual void set(uint pos,Item *item)=0; bool find(Item *item);

Sergei Petrunia added a comment - 2021-07-09 15:18

One thing it doesn't have is that there's no way to get the values back from the array.

Sergei Petrunia added a comment - 2021-07-09 15:18 One thing it doesn't have is that there's no way to get the values back from the array.

Sergei Petrunia made changes - 2021-07-11 14:20

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back
(DRAFT)
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 into something that allows binary search:

{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
h2. Milestone-4
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Sergei Petrunia made changes - 2021-07-11 20:27

Link

This issue relates to ~~MDEV-26125~~ [ ~~MDEV-26125~~ ]

Sergei Petrunia made changes - 2021-07-16 06:26

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions,
not values.

bigger issue: memory allocation for histograms is de-coupled from reading
the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and
then filled later (IIRC there was a bug (fixed) where the optimizer attempted
to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads?
This was an (unintentional?) possibility but then it was disabled (see
TABLE_STATISTICS_CB object and its use)

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Sergei Petrunia made changes - 2021-07-21 09:53

Labels

eits gsoc20 gsoc21

RM_107_OPTIMIZER_V1 eits gsoc20 gsoc21

Sergei Petrunia made changes - 2021-07-21 09:54

Labels

RM_107_OPTIMIZER_V1 eits gsoc20 gsoc21

eits gsoc20 gsoc21

Sergei Golubchik made changes - 2021-07-21 14:19

Fix Version/s

10.7 [ 24805 ]

Sergei Golubchik made changes - 2021-07-21 14:35

Priority

Major [ 3 ]

Critical [ 2 ]

Ralf Gebhardt made changes - 2021-07-21 17:55

Due Date

2021-09-14

Sergei Petrunia added a comment - 2021-07-23 05:56

Some notes from debugging

=== Histogram lifecycle ===

=== Loading histogram from mysql.column_stat ===

alloc_statistics_for_table_share() - allocates stats data structures but
doesn't read them.

read_statistics_for_table() - loads the statistics.

Note that this is done in two steps: first, column stats are loaded, including
histogram_type and histogram_size. Then, the histogram itself is loaded.
The second step even does a separate read from mysql.column_stats!

The loading is synchronized using TABLE_STATISTICS_CB (only one thread is doing
the loading)

read_histograms_for_table() - called after read_statistics_for_table(), reads the
histograms.

The synchronization is done through TABLE_STATISTICS_CB (only one thread is doing
the loading).

Also, TABLE_STATISTICS_CB ensures that histograms are only read once (if there
are two TABLE objects, they'll share the same field->read_stats and so
Histogram object).

=== Collecting histogram ===

alloc_statistics_for_table() is used to allocate statistical data structures.
allocation is done on TABLE's MEM_ROOT.

collect_statistics_for_table() - is used to collect the statistics.

(Note: it looks like it's possible for different threads to collect statistics
for different fields simultaneously?)

Then, Column_stat::store_stat_fields() saves the data into table.

Then, the histogram is read back from disk before it is used.

Sergei Petrunia added a comment - 2021-07-23 05:56 Some notes from debugging === Histogram lifecycle === === Loading histogram from mysql.column_stat === alloc_statistics_for_table_share() - allocates stats data structures but doesn't read them. read_statistics_for_table() - loads the statistics. Note that this is done in two steps: first, column stats are loaded, including histogram_type and histogram_size. Then, the histogram itself is loaded. The second step even does a separate read from mysql.column_stats! The loading is synchronized using TABLE_STATISTICS_CB (only one thread is doing the loading) read_histograms_for_table() - called after read_statistics_for_table(), reads the histograms. The synchronization is done through TABLE_STATISTICS_CB (only one thread is doing the loading). Also, TABLE_STATISTICS_CB ensures that histograms are only read once (if there are two TABLE objects, they'll share the same field->read_stats and so Histogram object). === Collecting histogram === alloc_statistics_for_table() is used to allocate statistical data structures. allocation is done on TABLE's MEM_ROOT. collect_statistics_for_table() - is used to collect the statistics. (Note: it looks like it's possible for different threads to collect statistics for different fields simultaneously?) Then, Column_stat::store_stat_fields() saves the data into table. Then, the histogram is read back from disk before it is used.

Sergei Petrunia added a comment - 2021-07-23 11:17

Patch for Milestone-3: https://github.com/MariaDB/server/pull/1875

Sergei Petrunia added a comment - 2021-07-23 11:17 Patch for Milestone-3: https://github.com/MariaDB/server/pull/1875

Sergei Petrunia made changes - 2021-07-23 11:24

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions,
not values.

bigger issue: memory allocation for histograms is de-coupled from reading
the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and
then filled later (IIRC there was a bug (fixed) where the optimizer attempted
to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads?
This was an (unintentional?) possibility but then it was disabled (see
TABLE_STATISTICS_CB object and its use)

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Make point_selectivity and range_selectivity to accept endpoints as parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

h3. Step 3: Separate histogram construction from usage

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Sergei Petrunia added a comment - 2021-07-29 23:35

Patch for Milestone-4.1: https://github.com/idoqo/server/pull/2/

Sergei Petrunia added a comment - 2021-07-29 23:35 Patch for Milestone-4.1: https://github.com/idoqo/server/pull/2/

Sergei Petrunia made changes - 2021-07-29 23:45

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Make point_selectivity and range_selectivity to accept endpoints as parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

h3. Step 3: Separate histogram construction from usage

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Demonstrate saving/loading of histograms
Now, the code already can:
- collect a JSON histogram and save it.
- when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

Parse function at the moment only prints to stderr.
However, we should catch parse errors and make sure they are reported to the client.
The test may look like this:
{code}
INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
FLUSH TABLES;
# this should print some descriptive test
--error NNNN
select * from test.t1;
{code}

h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

h3. Step 4: Separate histogram construction from usage (?)
TODO

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Sergei Petrunia made changes - 2021-08-03 13:05

Status

Open [ 1 ]

In Progress [ 3 ]

Sergei Petrunia added a comment - 2021-08-05 18:06 - edited

More on the in-memory data structure for lookups.

It doesn't have to have anything to do with collection-time data structure.

It is basically an ordered array (either of bucket endpoints or of some
stuctures that contain endpoints plus something else).

The ordered array will be used to make lookups into it, so that we can find the
buckets that overlap with the range we're computing the estimate for (denote as
$SEARCH_RANGE).

It is likely that $SEARCH_RANGE partially overlaps with a bucket, that is

$BUCKET_MIN < $SEARCH_ENDP < $BUCKET_MAX.

In this case, we'll follow existing histograms and assume that $SEARCH_RANGE
occupies a "proportional" part of the bucket.

Currenty the fraction is calculated as follows:

        store_key_image_to_rec(field, (uchar *) $SEARCH_ENDP,

                               field->key_length());

        pos = field->pos_in_interval(min_value,

                                     max_value);

Here, the min_value and max_value have the bounds.

They are defined as

Field * Column_statistics::min_value

and pos_in_interval is defined as:

  double pos_in_interval(Field *min, Field *max);

There are two implementations:
pos_in_interval_val_real(), which gets min/max values by calling min/max->val_real() and then operating on the numbers it has got.

pos_in_interval_val_str() which gets min and max values by calling

strxfrm(..., target_size=8, min->ptr + data_offset, min->data_length())

and then operating on 8-byte prefixes as integers.

Sergei Petrunia added a comment - 2021-08-05 18:06 - edited More on the in-memory data structure for lookups. It doesn't have to have anything to do with collection-time data structure. It is basically an ordered array (either of bucket endpoints or of some stuctures that contain endpoints plus something else). The ordered array will be used to make lookups into it, so that we can find the buckets that overlap with the range we're computing the estimate for (denote as $SEARCH_RANGE). It is likely that $SEARCH_RANGE partially overlaps with a bucket, that is $BUCKET_MIN < $SEARCH_ENDP < $BUCKET_MAX. In this case, we'll follow existing histograms and assume that $SEARCH_RANGE occupies a "proportional" part of the bucket. Currenty the fraction is calculated as follows: store_key_image_to_rec(field, (uchar *) $SEARCH_ENDP, field->key_length()); pos = field->pos_in_interval(min_value, max_value); Here, the min_value and max_value have the bounds. They are defined as Field * Column_statistics::min_value and pos_in_interval is defined as: double pos_in_interval(Field *min, Field *max); There are two implementations: pos_in_interval_val_real() , which gets min/max values by calling min/max->val_real() and then operating on the numbers it has got. pos_in_interval_val_str() which gets min and max values by calling strxfrm(..., target_size=8, min->ptr + data_offset, min->data_length()) and then operating on 8-byte prefixes as integers.

Sergei Petrunia added a comment - 2021-08-06 13:53 - edited

Arguments against in_vector data structure:

1. Its methods to populate/make lookups use "Item *item" as parameters. This will need to change to also allow Field*.
2. The find() method currently only supports equality lookups. We need [X;Y) open/closed ranges.
3. The values are stored in some custom format. The format allows to compare values, but it is unclear how one could implement an analog of Field::pos_in_interval() for it.
4. in_vector is not flexible - there's no way to store any extra data for a bucket.

Arguments for using it: we will be able to reuse
1. Code that allocates the array
2. Encoding functions

Not much.

Sergei Petrunia added a comment - 2021-08-06 13:53 - edited Arguments against in_vector data structure: 1. Its methods to populate/make lookups use "Item *item" as parameters. This will need to change to also allow Field*. 2. The find() method currently only supports equality lookups. We need [X;Y) open/closed ranges. 3. The values are stored in some custom format. The format allows to compare values, but it is unclear how one could implement an analog of Field::pos_in_interval() for it. 4. in_vector is not flexible - there's no way to store any extra data for a bucket. Arguments for using it: we will be able to reuse 1. Code that allocates the array 2. Encoding functions Not much.

Sergei Petrunia added a comment - 2021-08-10 15:55

Leaning towards using std::vector<KeyTupleFormat> data structure/

Here is a patch https://github.com/MariaDB/server/commit/4d3c434028649a0f14471ed4a1a4c10b97b78eb8 that shows how to do all needed operations:

JSON text -> binary conversion
Comparing two values (so one can find the bucket)
Computing pos_in_interval(A,B,X) - for A <= X <= B, return a fraction specifying how close X to the endoints of the [A,B] interval.

Patch usage: https://gist.github.com/spetrunia/a09bd451f650737ba5ea7a6e51bc797a

Sergei Petrunia added a comment - 2021-08-10 15:55 Leaning towards using std::vector<KeyTupleFormat> data structure/ Here is a patch https://github.com/MariaDB/server/commit/4d3c434028649a0f14471ed4a1a4c10b97b78eb8 that shows how to do all needed operations: JSON text -> binary conversion Comparing two values (so one can find the bucket) Computing pos_in_interval(A,B,X) - for A <= X <= B, return a fraction specifying how close X to the endoints of the [A,B] interval. Patch usage: https://gist.github.com/spetrunia/a09bd451f650737ba5ea7a6e51bc797a

Sergei Petrunia made changes - 2021-08-16 13:11

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Demonstrate saving/loading of histograms
Now, the code already can:
- collect a JSON histogram and save it.
- when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

Parse function at the moment only prints to stderr.
However, we should catch parse errors and make sure they are reported to the client.
The test may look like this:
{code}
INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
FLUSH TABLES;
# this should print some descriptive test
--error NNNN
select * from test.t1;
{code}

h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

h3. Step 4: Separate histogram construction from usage (?)
TODO

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Demonstrate saving/loading of histograms
Now, the code already can:
- collect a JSON histogram and save it.
- when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

Parse function at the moment only prints to stderr.
However, we should catch parse errors and make sure they are reported to the client.
The test may look like this:
{code}
INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
FLUSH TABLES;
# this should print some descriptive test
--error NNNN
select * from test.t1;
{code}

h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Sergei Petrunia made changes - 2021-08-16 13:15

Description

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Demonstrate saving/loading of histograms
Now, the code already can:
- collect a JSON histogram and save it.
- when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

Parse function at the moment only prints to stderr.
However, we should catch parse errors and make sure they are reported to the client.
The test may look like this:
{code}
INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
FLUSH TABLES;
# this should print some descriptive test
--error NNNN
select * from test.t1;
{code}

h3. Step 3: Make point_selectivity and range_selectivity to accept endpoints as parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.

Candidates:
{code}
std::vector<std::string>
std::vector<String>
{code}

Input from Igor at optimizer call: check out how IN-subqueries store their lookup arrays. They use a format that's neither KeyTupleFormat nor table->record format. Could/Should we use that format?
TODO

Currently, histograms are stored as array of 1-byte bucket bounds (SINGLE_PREC_HB) or or 2-byte bucket bounds (DOUBLE_PREC_HB).

The table storing the histograms supports different histogram formats but limits them to 256 bytes (hist_size is tinyint).

{code:sql}
CREATE TABLE mysql.column_stats (
  min_value varbinary(255) DEFAULT NULL,
  max_value varbinary(255) DEFAULT NULL,
  ...
  hist_size tinyint unsigned,
  hist_type enum('SINGLE_PREC_HB','DOUBLE_PREC_HB'),
  histogram varbinary(255),
  ...
{code}

This prevents us from supporting other kinds of histograms.

The first low-hanging fruit would be to store the histogram bucket bounds precisely (like MySQL and PostgreSQL do, for example).

The idea of this MDEV is to switch to JSON as storage format for histograms.

If we do that, it will:
- Improve the histogram precision
- Allow the DBAs to examine the histograms
- Enable other histogram types to be collected/used.

h2. Milestone-1:

Let histogram_type have another possible value, tentative name "JSON"
when that is set, let ANALYZE TABLE syntax collect a JSON "histogram"
{code}
  { "hello":"world"}
{code}
that is, the following should work:

{code:sql}
set histogram_type='json';
analyze table t1 persisent for all;
select histogram from mysql.column_stats where table_name='t1' ;
{code}
this should produce \{"hello":"world"\}.

h2. Milestone-2: produce JSON with histogram(*).

(*)- the exact format is not specified, for now, print the bucket endpoints and produce output like this:

{code}
[
  "value1",
  "value2",
  ...
]
{code}

Milestone-2, part#2: make mysql.column_stats.histogram a blob.

h2. Milestone-3: Parse the JSON back into an array
Figure out how to use the JSON parser.
Parse the JSON data produced in Milestone-2 back. For now, just print the parsed values to stderr.
(Additional input provided on Zulip re parsing valid/invalid JSON histograms)

h2. Milestone-4: Make the code support different kinds of Histograms
Currently, there's only one type of histogram.

smaller issue: histogram lookup functions assume the histogram stores fractions, not values.
bigger issue: memory allocation for histograms is de-coupled from reading the histograms. See alloc_statistics_for_table, read_histograms_for_table.

The histogram object lives in a data structure that is bzero'ed first and then filled later (IIRC there was a bug (fixed) where the optimizer attempted to use bzero'ed histogram)

Can histograms be collected or loaded in parallel by several threads? This was an (unintentional?) possibility but then it was disabled (see TABLE_STATISTICS_CB object and its use)

h3. Step #0: Make Histogram a real class
Here's the commit:
https://github.com/MariaDB/server/commit/3ac32917ab6c42a5a0f9ed817dd8d3c7e20ce34d

h3. Step 1: Separate classes for binary and JSON histograms

Need to introduce
{code}
class Histogram -- interface, no data members.
class Histogram_binary : public Histogram
class Histogram_json : public Histogram
{code}
and a factory function
{code}
Histogram *create_histogram(Histogram_type)
{code}

for now, let Histogram_json::point_selectivity() and Histogram_json::range_selectivity() return 0.1 and 0.5, respectively.

h3. Step 2: Demonstrate saving/loading of histograms
Now, the code already can:
- collect a JSON histogram and save it.
- when loading a histogram, figure from {{histogram_type}} column that this is JSON histogram being loaded, create Histogram_json and invoke the parse function.

Parse function at the moment only prints to stderr.
However, we should catch parse errors and make sure they are reported to the client.
The test may look like this:
{code}
INSERT INTO mysql.column_stats VALUES('test','t1','column1', .... '[invalid, json, data']);
FLUSH TABLES;
# this should print some descriptive test
--error NNNN
select * from test.t1;
{code}

h2. Milestone-5: Parse the JSON data into a structure that allows lookups.
The structure is
{code}
std::vector<std::string>
{code}
and it holds the data in KeyTupleFormat (See the comments for reasoning. There was a suggestion to use {{in_vector}} (This is what IN subqueries use) but it didn't work out)

h2. Milestone 5.1 (aka Milestone 44)
Make a function to estimate selectivity using the data structure specified in previous milestone.

h2. Make range_selectivity() accept key_range parameters.

(currently, they accept fractions, which is only suitable for binary histograms)
This means Histogram_binary will need to have access to min_value and max_value to compute the fractions.

Michael Okoko added a comment - 2021-08-23 08:55

The progress so far is documented as part of the GSoC final report at https://gist.github.com/idoqo/ac520943e53f64034beaed4258b62ba5

Michael Okoko added a comment - 2021-08-23 08:55 The progress so far is documented as part of the GSoC final report at https://gist.github.com/idoqo/ac520943e53f64034beaed4258b62ba5

Sergei Petrunia added a comment - 2021-08-31 14:52

The candidate code is at: https://github.com/MariaDB/server/tree/bb-10.7-mdev21130

Sergei Petrunia added a comment - 2021-08-31 14:52 The candidate code is at: https://github.com/MariaDB/server/tree/bb-10.7-mdev21130

Sergei Petrunia added a comment - 2021-08-31 15:04 - edited

elenst, documentation for testing:

histogram_type variable is used to specify what kind of histogram is collected.

After this patch, it can be set to a new value: JSON_HB.

Then, EITS code will create JSON histograms. Old histogram types remain supported, the optimizer will use whatever histogram is available.

The histograms are kept in mysql.column_stats, the columns of interest are:

hist_type (SINGLE_PREC_HB, DOUBLE_PREC_HB, now also JSON_HB)
histogram. for JSON_HB, this is a JSON document.

the function DECODE_HISTOGRAM also "supports" JSON_HB by returning the histogram unmodified. That is, one can do this:

select DECODE_HISTOGRAM(hist_type, histgram) from mysql.column_stats

and get a readable representation of histogram for any kind of histogram.

Things to test

Collection of histograms. ANALYZE ... PERSISTENT FOR... statements overlapping with other workload. Histogram memory management has changed significantly (This also affects old, binary histograms).
Use by the optimizer. You can look into the MTR tests and find examples like this:

analyze select * from tbl where col1= const

analyze select * from tbl where col1< const

analyze select * from tbl where col1 between const1 and const2

There is only one column col1 which is not indexed. The interesting parts of the output are filtered (prediction from the histogram) and r_filtered (actual value).

There will be differences in estimates with binary histograms. The estimates should be generally better. Rounding-error-level regressions are acceptable. An estimate that is significantly worse than one from the binary histogram warrants an investigation.

Sergei Petrunia added a comment - 2021-08-31 15:04 - edited elenst , documentation for testing: histogram_type variable is used to specify what kind of histogram is collected. After this patch, it can be set to a new value: JSON_HB. Then, EITS code will create JSON histograms. Old histogram types remain supported, the optimizer will use whatever histogram is available. The histograms are kept in mysql.column_stats, the columns of interest are: hist_type (SINGLE_PREC_HB, DOUBLE_PREC_HB, now also JSON_HB) histogram. for JSON_HB, this is a JSON document. the function DECODE_HISTOGRAM also "supports" JSON_HB by returning the histogram unmodified. That is, one can do this: select DECODE_HISTOGRAM(hist_type, histgram) from mysql.column_stats and get a readable representation of histogram for any kind of histogram. Things to test Collection of histograms. ANALYZE ... PERSISTENT FOR... statements overlapping with other workload. Histogram memory management has changed significantly (This also affects old, binary histograms). Use by the optimizer. You can look into the MTR tests and find examples like this: analyze select * from tbl where col1= const analyze select * from tbl where col1< const analyze select * from tbl where col1 between const1 and const2 There is only one column col1 which is not indexed. The interesting parts of the output are filtered (prediction from the histogram) and r_filtered (actual value). There will be differences in estimates with binary histograms. The estimates should be generally better. Rounding-error-level regressions are acceptable. An estimate that is significantly worse than one from the binary histogram warrants an investigation.

Sergei Petrunia added a comment - 2021-09-01 12:40

The issue in MySQL that I was mentioning: https://bugs.mysql.com/bug.php?id=104789

Sergei Petrunia added a comment - 2021-09-01 12:40 The issue in MySQL that I was mentioning: https://bugs.mysql.com/bug.php?id=104789

Sergei Petrunia made changes - 2021-09-01 13:39

Link

This issue includes ~~MDEV-26519~~ [ ~~MDEV-26519~~ ]

Sergei Petrunia added a comment - 2021-09-05 08:52

Pushed more cleanup patches to https://github.com/MariaDB/server/tree/bb-10.7-mdev21130 . Buildbot is clean.

Sergei Petrunia added a comment - 2021-09-05 08:52 Pushed more cleanup patches to https://github.com/MariaDB/server/tree/bb-10.7-mdev21130 . Buildbot is clean.

Elena Stepanova made changes - 2021-09-05 08:59

Link

This issue relates to TODO-3118 [ TODO-3118 ]

Sergei Golubchik made changes - 2021-09-07 07:40

Assignee

Sergei Petrunia [ psergey ]

Sergei Golubchik [ serg ]

Sergei Golubchik made changes - 2021-09-07 07:40

Status

In Progress [ 3 ]

Stalled [ 10000 ]

Sergei Golubchik made changes - 2021-09-07 07:40

Assignee

Sergei Golubchik [ serg ]

Elena Stepanova [ elenst ]

Elena Stepanova made changes - 2021-09-11 09:50

Link

This issue relates to ~~MDEV-26589~~ [ ~~MDEV-26589~~ ]

Elena Stepanova made changes - 2021-09-11 13:46

Link

This issue relates to ~~MDEV-26590~~ [ ~~MDEV-26590~~ ]

Elena Stepanova made changes - 2021-09-11 16:42

Link

This issue relates to ~~MDEV-26592~~ [ ~~MDEV-26592~~ ]

Elena Stepanova made changes - 2021-09-11 21:01

Link

This issue relates to ~~MDEV-26595~~ [ ~~MDEV-26595~~ ]

Ralf Gebhardt made changes - 2021-09-16 07:34

Due Date

2021-09-14

Ian Gilfillan added a comment - 2021-09-22 10:46

Descriptions in the server for system variables need updating as well, for example histogram_type still states "Specifies type of the histograms created by ANALYZE. Possible values are: SINGLE_PREC_HB - single precision height-balanced, DOUBLE_PREC_HB - double precision height-balanced." even though JSON_HB is valid as well.

Ian Gilfillan added a comment - 2021-09-22 10:46 Descriptions in the server for system variables need updating as well, for example histogram_type still states "Specifies type of the histograms created by ANALYZE. Possible values are: SINGLE_PREC_HB - single precision height-balanced, DOUBLE_PREC_HB - double precision height-balanced." even though JSON_HB is valid as well.

Elena Stepanova made changes - 2021-09-28 16:41

Link

This issue causes ~~MDEV-26709~~ [ ~~MDEV-26709~~ ]

Elena Stepanova made changes - 2021-09-28 17:35

Link

This issue causes ~~MDEV-26710~~ [ ~~MDEV-26710~~ ]

Elena Stepanova made changes - 2021-09-28 18:37

Link

This issue causes ~~MDEV-26711~~ [ ~~MDEV-26711~~ ]

Elena Stepanova made changes - 2021-09-29 12:56

Link

This issue causes ~~MDEV-26718~~ [ ~~MDEV-26718~~ ]

Elena Stepanova made changes - 2021-09-29 21:28

Link

This issue causes ~~MDEV-26724~~ [ ~~MDEV-26724~~ ]

Elena Stepanova made changes - 2021-09-30 22:00

Link

This issue causes ~~MDEV-26737~~ [ ~~MDEV-26737~~ ]

Elena Stepanova made changes - 2021-10-02 22:37

Link

This issue causes ~~MDEV-26750~~ [ ~~MDEV-26750~~ ]

Elena Stepanova made changes - 2021-10-02 23:21

Link

This issue causes ~~MDEV-26751~~ [ ~~MDEV-26751~~ ]

Elena Stepanova made changes - 2021-10-11 13:00

Link

This issue causes ~~MDEV-26801~~ [ ~~MDEV-26801~~ ]

Sergei Petrunia made changes - 2021-10-20 19:29

Link

This issue relates to ~~MDEV-26849~~ [ ~~MDEV-26849~~ ]

Elena Stepanova made changes - 2021-11-03 23:01

Assignee

Elena Stepanova [ elenst ]

Sergei Petrunia [ psergey ]

Ralf Gebhardt made changes - 2021-11-08 20:39

Fix Version/s		10.8 [ 26121 ]
Fix Version/s	10.7 [ 24805 ]

Julien Fritsch made changes - 2021-12-06 09:44

Link

This issue relates to TODO-3118 [ TODO-3118 ]

Julien Fritsch made changes - 2021-12-06 09:45

Link

This issue relates to ~~MDEV-27062~~ [ ~~MDEV-27062~~ ]

Julien Fritsch made changes - 2021-12-06 09:46

Link

This issue relates to TODO-3253 [ TODO-3253 ]

Sergei Golubchik made changes - 2021-12-06 21:22

Workflow

MariaDB v3 [ 101382 ]

MariaDB v4 [ 131734 ]

Sergei Petrunia added a comment - 2022-01-21 13:28

Pushed into 10.8. JSON_HB histograms are not enabled by default.

Sergei Petrunia added a comment - 2022-01-21 13:28 Pushed into 10.8. JSON_HB histograms are not enabled by default.

Sergei Petrunia made changes - 2022-01-21 13:28

Component/s		Optimizer [ 10200 ]
Fix Version/s		10.8.1 [ 26815 ]
Fix Version/s	10.8 [ 26121 ]
Resolution		Fixed [ 1 ]
Status	Stalled [ 10000 ]	Closed [ 6 ]

Sergei Petrunia made changes - 2022-01-21 13:29

Link

This issue relates to MDEV-27566 [ MDEV-27566 ]

Elena Stepanova made changes - 2022-11-25 16:23

Link

This issue causes MDEV-30097 [ MDEV-30097 ]

MariaDB Server

Histograms: use JSON as on-disk format

Details

Description

Milestone-1:

Milestone-2: produce JSON with histogram.

Milestone-3: Parse the JSON back into an array

Milestone-4: Make the code support different kinds of Histograms

Step #0: Make Histogram a real class

Step 1: Separate classes for binary and JSON histograms

Step 2: Demonstrate saving/loading of histograms

Milestone-5: Parse the JSON data into a structure that allows lookups.

Milestone 5.1 (aka Milestone 44)

Make range_selectivity() accept key_range parameters.

Attachments

Issue Links

Activity

People

Dates

Git Integration