Details
-
Task
-
Status: Closed (View Workflow)
-
Blocker
-
Resolution: Fixed
Description
The histograms as defined in MDEV-21130 follow the approach used by SINGLE_PREC_HB/DOUBLE_PREC_HB histograms and put the bucket bound exactly after
a certain fraction of rows.
1. Popular values should have their own buckets
For example, consider a table with these values (each character is one row):
aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq
|
| | | | | | | | |
|
0 1 2 3 4 5 6 7 8
|
A histogram with 8 buckets will have these buckets:
1. a-a
|
2. a-a
|
3. a-a
|
3. a-a
|
5. a-b
|
6. b-f
|
7. f-i
|
8. i-q
|
The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision.
However, a better approach would be to put all 'a' values in one bucket:
aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq
|
| | | | | | | | |
|
0 1 2 3 4 5 6 7 8
|
This will give a more precise number for table 'a'.
It will also provide more buckets for other values.
Other databases
- MySQL does something like this.
- PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram.
2. Store the number of distinct values in each bucket
Again, following MySQL here: store number of distinct values in each bucket.
For strings, store only prefix
Store only a 40-character prefix.
Attachments
Issue Links
- causes
-
MDEV-26709 JSON histogram may contain bucketS than histogram_size allows
-
- Closed
-
-
MDEV-26710 Histogram field in mysql.column_stats is too short, JSON histograms get corrupt
-
- Closed
-
-
MDEV-26711 JSON histograms become invalid due to not properly quoted values
-
- Closed
-
-
MDEV-26718 Query with: Cross join, Non-indexed column comparison, Column with histogram: speed varies 20x depending on histogram type
-
- Closed
-
-
MDEV-26724 Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column
-
- Closed
-
-
MDEV-26751 Deficiencies in MTR coverage for JSON histograms
-
- Closed
-
-
MDEV-26801 Valgrind/MSAN errors in Column_statistics_collected::finish, main.statistics_json fails
-
- Closed
-
-
MDEV-26885 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#2)
-
- Closed
-
-
MDEV-26886 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3)
-
- Closed
-
-
MDEV-26892 JSON histograms become invalid with a specific (corrupt) value in table
-
- Closed
-
-
MDEV-26901 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#4)
-
- Closed
-
-
MDEV-26911 Unexpected ER_DUP_KEY, ASAN errors, double free detected in tcache with JSON_HB histogram
-
- Closed
-
-
MDEV-27203 Valgrind / MSAN errors in Histogram_json_hb::parse_bucket, main.statistics_json fails
-
- Closed
-
-
MDEV-27229 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#5)
-
- Closed
-
-
MDEV-27230 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#6)
-
- Closed
-
-
MDEV-27243 Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#7) upon TIME comparison
-
- Closed
-
-
MDEV-27492 DOUBLE_PREC_HB histogram has poor estimates for BIT columns
-
- Stalled
-
-
MDEV-27497 DOUBLE_PREC_HB histogram produces wrong estimates for COMPRESSED columns
-
- Stalled
-
-
MDEV-30097 Assertion `low == (int)buckets.size()-1 || field->key_cmp((uchar*)buckets[low+1].start_value.data(), lookup_val)> 0' failed in Histogram_json_hb::find_bucket
-
- Confirmed
-
- is part of
-
MDEV-21130 Histograms: use JSON as on-disk format
-
- Closed
-
-
MDEV-27373 Q1 2022 release merge
-
- Closed
-
- relates to
-
MDEV-27472 ANALYZE: r_filtered=100 may look confusing when r_rows=0
-
- Stalled
-
-
MDEV-27495 Less precise filtered estimation with JSON histogram (#9) or wrong value of r_filtered
-
- Closed
-
-
MDEV-26764 JSON_HB Histograms: handle BINARY and unassigned characters
-
- Closed
-
-
MDEV-26849 JSON Histograms: point selectivity estimates are off for non-existent values
-
- Closed
-
-
MDEV-27062 Make histogram_type=JSON_HB the new default
-
- Closed
-
Activity
Field | Original Value | New Value |
---|---|---|
Link |
This issue is part of |
Description |
The histograms as defined in a certain fraction of rows. For example, consider a table with these values (each character is one row): {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} A histogram with 8 buckets will have these buckets: {code} 1. a-a 2. a-a 3. a-a 3. a-a 5. a-b 6. b-f 7. f-i 8. i-q {code} The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket: {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} This will give a more precise number for table 'a'. It will also provide more buckets for other values. MySQL does something like this. |
The histograms as defined in a certain fraction of rows. For example, consider a table with these values (each character is one row): {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} A histogram with 8 buckets will have these buckets: {code} 1. a-a 2. a-a 3. a-a 3. a-a 5. a-b 6. b-f 7. f-i 8. i-q {code} The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket: {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} This will give a more precise number for table 'a'. It will also provide more buckets for other values. h2. Other databases * MySQL does something like this. * PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram. |
Summary | JSON Histograms: popular values should have their own buckets | JSON Histograms: improve histogram collection |
Description |
The histograms as defined in a certain fraction of rows. For example, consider a table with these values (each character is one row): {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} A histogram with 8 buckets will have these buckets: {code} 1. a-a 2. a-a 3. a-a 3. a-a 5. a-b 6. b-f 7. f-i 8. i-q {code} The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket: {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} This will give a more precise number for table 'a'. It will also provide more buckets for other values. h2. Other databases * MySQL does something like this. * PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram. |
The histograms as defined in a certain fraction of rows. h2. Popular values should have their own buckets For example, consider a table with these values (each character is one row): {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} A histogram with 8 buckets will have these buckets: {code} 1. a-a 2. a-a 3. a-a 3. a-a 5. a-b 6. b-f 7. f-i 8. i-q {code} The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket: {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} This will give a more precise number for table 'a'. It will also provide more buckets for other values. h3. Other databases * MySQL does something like this. * PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram. |
Description |
The histograms as defined in a certain fraction of rows. h2. Popular values should have their own buckets For example, consider a table with these values (each character is one row): {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} A histogram with 8 buckets will have these buckets: {code} 1. a-a 2. a-a 3. a-a 3. a-a 5. a-b 6. b-f 7. f-i 8. i-q {code} The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket: {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} This will give a more precise number for table 'a'. It will also provide more buckets for other values. h3. Other databases * MySQL does something like this. * PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram. |
The histograms as defined in a certain fraction of rows. h2. 1. Popular values should have their own buckets For example, consider a table with these values (each character is one row): {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} A histogram with 8 buckets will have these buckets: {code} 1. a-a 2. a-a 3. a-a 3. a-a 5. a-b 6. b-f 7. f-i 8. i-q {code} The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket: {code} aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq | | | | | | | | | 0 1 2 3 4 5 6 7 8 {code} This will give a more precise number for table 'a'. It will also provide more buckets for other values. h3. Other databases * MySQL does something like this. * PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram. h2. 2. Store the number of distinct values in each bucket Again, following MySQL here: store number of distinct values in each bucket. h2. For strings, store only prefix Store only a 40-character prefix. |
Status | Open [ 1 ] | In Progress [ 3 ] |
Priority | Major [ 3 ] | Critical [ 2 ] |
Labels | eits |
Priority | Critical [ 2 ] | Major [ 3 ] |
Link | This issue relates to TODO-3118 [ TODO-3118 ] |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue relates to |
Link |
This issue causes |
Link |
This issue relates to |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Fix Version/s | 10.8 [ 26121 ] | |
Fix Version/s | 10.7 [ 24805 ] |
Priority | Major [ 3 ] | Critical [ 2 ] |
Link |
This issue relates to |
Link | This issue relates to TODO-3118 [ TODO-3118 ] |
Link | This issue relates to TODO-3253 [ TODO-3253 ] |
Workflow | MariaDB v3 [ 124750 ] | MariaDB v4 [ 131851 ] |
Status | In Progress [ 3 ] | In Testing [ 10301 ] |
Assignee | Sergei Petrunia [ psergey ] | Elena Stepanova [ elenst ] |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link |
This issue causes |
Link | This issue relates to MDEV-27472 [ MDEV-27472 ] |
Link | This issue causes MDEV-27492 [ MDEV-27492 ] |
Link |
This issue relates to |
Link | This issue causes MDEV-27497 [ MDEV-27497 ] |
Assignee | Elena Stepanova [ elenst ] | Sergei Golubchik [ serg ] |
Status | In Testing [ 10301 ] | Stalled [ 10000 ] |
Assignee | Sergei Golubchik [ serg ] | Sergei Petrunia [ psergey ] |
Priority | Critical [ 2 ] | Blocker [ 1 ] |
Fix Version/s | 10.8.1 [ 26815 ] | |
Fix Version/s | 10.8 [ 26121 ] | |
Resolution | Fixed [ 1 ] | |
Status | Stalled [ 10000 ] | Closed [ 6 ] |
Link |
This issue is part of |
Link | This issue causes MDEV-30097 [ MDEV-30097 ] |
Labels | eits | Preview_10.8 eits |
Labels | Preview_10.8 eits | Preview_10.7 Preview_10.8 eits |
Note for those who were asking "why it is not done like in MySQL-8" : MySQL 8 has some properties that I doubt we will want to have: https://bugs.mysql.com/bug.php?id=104789