Details
-
Task
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
Description
The histograms as defined in MDEV-21130 follow the approach used by SINGLE_PREC_HB/DOUBLE_PREC_HB histograms and put the bucket bound exactly after
a certain fraction of rows.
1. Popular values should have their own buckets
For example, consider a table with these values (each character is one row):
aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq
|
| | | | | | | | |
|
0 1 2 3 4 5 6 7 8
|
A histogram with 8 buckets will have these buckets:
1. a-a
|
2. a-a
|
3. a-a
|
3. a-a
|
5. a-b
|
6. b-f
|
7. f-i
|
8. i-q
|
The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision.
However, a better approach would be to put all 'a' values in one bucket:
aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq
|
| | | | | | | | |
|
0 1 2 3 4 5 6 7 8
|
This will give a more precise number for table 'a'.
It will also provide more buckets for other values.
Other databases
- MySQL does something like this.
- PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram.
2. Store the number of distinct values in each bucket
Again, following MySQL here: store number of distinct values in each bucket.
For strings, store only prefix
Store only a 40-character prefix.
Attachments
Issue Links
- causes
-
MDEV-26709 JSON histogram may contain bucketS than histogram_size allows
- Closed
-
MDEV-26710 Histogram field in mysql.column_stats is too short, JSON histograms get corrupt
- Closed
-
MDEV-26711 JSON histograms become invalid due to not properly quoted values
- Closed
-
MDEV-26718 Query with: Cross join, Non-indexed column comparison, Column with histogram: speed varies 20x depending on histogram type
- Closed
-
MDEV-26724 Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column
- Closed
-
MDEV-26751 Deficiencies in MTR coverage for JSON histograms
- Closed
-
MDEV-26801 Valgrind/MSAN errors in Column_statistics_collected::finish, main.statistics_json fails
- Closed
-
MDEV-26885 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#2)
- Closed
-
MDEV-26886 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3)
- Closed
-
MDEV-26892 JSON histograms become invalid with a specific (corrupt) value in table
- Closed
-
MDEV-26901 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#4)
- Closed
-
MDEV-26911 Unexpected ER_DUP_KEY, ASAN errors, double free detected in tcache with JSON_HB histogram
- Closed
-
MDEV-27203 Valgrind / MSAN errors in Histogram_json_hb::parse_bucket, main.statistics_json fails
- Closed
-
MDEV-27229 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#5)
- Closed
-
MDEV-27230 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#6)
- Closed
-
MDEV-27243 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#7) upon TIME comparison
- Closed
-
MDEV-27492 DOUBLE_PREC_HB histogram has poor estimates for BIT columns
- Stalled
-
MDEV-27497 DOUBLE_PREC_HB histogram produces wrong estimates for COMPRESSED columns
- Stalled
-
MDEV-30097 Assertion `low == (int)buckets.size()-1 || field->key_cmp((uchar*)buckets[low+1].start_value.data(), lookup_val)> 0' failed in Histogram_json_hb::find_bucket
- Confirmed
- is part of
-
MDEV-21130 Histograms: use JSON as on-disk format
- Closed
-
MDEV-27373 Q1 2022 release merge
- Closed
- relates to
-
MDEV-27472 ANALYZE: r_filtered=100 may look confusing when r_rows=0
- Stalled
-
MDEV-27495 Less precise filtered estimation with JSON histogram (#9) or wrong value of r_filtered
- Closed
-
MDEV-26764 JSON_HB Histograms: handle BINARY and unassigned characters
- Closed
-
MDEV-26849 JSON Histograms: point selectivity estimates are off for non-existent values
- Closed
-
MDEV-27062 Make histogram_type=JSON_HB the new default
- Closed