[MDEV-26519] JSON Histograms: improve histogram collection - Jira

XML

Word

Printable

Details

Type: Task
Status: Closed (View Workflow)
Priority: Blocker
Resolution: Fixed
Fix Version/s: 10.8.1
Component/s: Optimizer
Labels:

Description

The histograms as defined in ~~MDEV-21130~~ follow the approach used by SINGLE_PREC_HB/DOUBLE_PREC_HB histograms and put the bucket bound exactly after
a certain fraction of rows.

1. Popular values should have their own buckets

For example, consider a table with these values (each character is one row):

  aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq

  |     |     |     |     |     |     |     |     |

  0     1     2     3     4     5     6     7     8

A histogram with 8 buckets will have these buckets:

1. a-a

2. a-a

3. a-a

3. a-a

5. a-b

6. b-f

7. f-i

8. i-q

The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision.

However, a better approach would be to put all 'a' values in one bucket:

  aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq

  |                           |  |  |  |  |  |  | |

  0                           1  2  3  4  5  6  7 8

This will give a more precise number for table 'a'.
It will also provide more buckets for other values.

Other databases

MySQL does something like this.
PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram.

2. Store the number of distinct values in each bucket

Again, following MySQL here: store number of distinct values in each bucket.

For strings, store only prefix

Store only a 40-character prefix.

Attachments

Issue Links

causes

MDEV-26709 JSON histogram may contain bucketS than histogram_size allows

Closed

MDEV-26710 Histogram field in mysql.column_stats is too short, JSON histograms get corrupt

Closed

MDEV-26711 JSON histograms become invalid due to not properly quoted values

Closed

MDEV-26718 Query with: Cross join, Non-indexed column comparison, Column with histogram: speed varies 20x depending on histogram type

Closed

MDEV-26724 Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column

Closed

MDEV-26751 Deficiencies in MTR coverage for JSON histograms

Closed

MDEV-26801 Valgrind/MSAN errors in Column_statistics_collected::finish, main.statistics_json fails

Closed

MDEV-26885 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#2)

Closed

MDEV-26886 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3)

Closed

MDEV-26892 JSON histograms become invalid with a specific (corrupt) value in table

Closed

MDEV-26901 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#4)

Closed

MDEV-26911 Unexpected ER_DUP_KEY, ASAN errors, double free detected in tcache with JSON_HB histogram

Closed

MDEV-27203 Valgrind / MSAN errors in Histogram_json_hb::parse_bucket, main.statistics_json fails

Closed

MDEV-27229 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#5)

Closed

MDEV-27230 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#6)

Closed

MDEV-27243 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#7) upon TIME comparison

Closed

MDEV-27492 DOUBLE_PREC_HB histogram has poor estimates for BIT columns

Stalled

MDEV-27497 DOUBLE_PREC_HB histogram produces wrong estimates for COMPRESSED columns

Stalled

MDEV-30097 Assertion `low == (int)buckets.size()-1 || field->key_cmp((uchar*)buckets[low+1].start_value.data(), lookup_val)> 0' failed in Histogram_json_hb::find_bucket

Confirmed

is part of

MDEV-21130 Histograms: use JSON as on-disk format

Closed

MDEV-27373 Q1 2022 release merge

Closed

relates to

MDEV-27472 ANALYZE: r_filtered=100 may look confusing when r_rows=0

Stalled

MDEV-27495 Less precise filtered estimation with JSON histogram (#9) or wrong value of r_filtered

Closed

MDEV-26764 JSON_HB Histograms: handle BINARY and unassigned characters

Closed

MDEV-26849 JSON Histograms: point selectivity estimates are off for non-existent values

Closed

MDEV-27062 Make histogram_type=JSON_HB the new default

Closed

(14 causes, 2 is part of, 5 relates to)

Activity

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2021-09-01 13:39

Updated:: 2023-03-21 15:36

Resolved:: 2022-01-21 13:26

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.