The histograms as defined in MDEV-21130 follow the approach used by SINGLE_PREC_HB/DOUBLE_PREC_HB histograms and put the bucket bound exactly after
a certain fraction of rows.
1. Popular values should have their own buckets
For example, consider a table with these values (each character is one row):
aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq
| | | | | | | | |
0 1 2 3 4 5 6 7 8
A histogram with 8 buckets will have these buckets:
1. a-a
2. a-a
3. a-a
3. a-a
5. a-b
6. b-f
7. f-i
8. i-q
The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision.
However, a better approach would be to put all 'a' values in one bucket:
aaaaaaaaaaaaaaaaaaaaaaaaaaaabbbbccdefghijkilmnopq
| | | | | | | | |
0 1 2 3 4 5 6 7 8
This will give a more precise number for table 'a'.
It will also provide more buckets for other values.
Other databases
MySQL does something like this.
PostgreSQL collects MCVs (Most Common Values) and removes them from consideration before building the histogram.
2. Store the number of distinct values in each bucket
Again, following MySQL here: store number of distinct values in each bucket.
For strings, store only prefix
Store only a 40-character prefix.
Attachments
Issue Links
causes
MDEV-26709JSON histogram may contain bucketS than histogram_size allows
Closed
MDEV-26710Histogram field in mysql.column_stats is too short, JSON histograms get corrupt
Closed
MDEV-26711JSON histograms become invalid due to not properly quoted values
Closed
MDEV-26718Query with: Cross join, Non-indexed column comparison, Column with histogram: speed varies 20x depending on histogram type
Closed
MDEV-26724Endless loop in json_escape_to_string upon collecting JSON histograms with empty string in a column
Closed
MDEV-26751Deficiencies in MTR coverage for JSON histograms
Closed
MDEV-26801Valgrind/MSAN errors in Column_statistics_collected::finish, main.statistics_json fails
Closed
MDEV-26885Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#2)
Closed
MDEV-26886Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#3)
Closed
MDEV-26892JSON histograms become invalid with a specific (corrupt) value in table
Closed
MDEV-26901Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#4)
Closed
MDEV-26911Unexpected ER_DUP_KEY, ASAN errors, double free detected in tcache with JSON_HB histogram
Closed
MDEV-27203Valgrind / MSAN errors in Histogram_json_hb::parse_bucket, main.statistics_json fails
Closed
MDEV-27229Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#5)
Closed
MDEV-27230Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#6)
Closed
MDEV-27243Estimation for filtered rows less precise with JSON histogram comparing to DOUBLE_PREC (#7) upon TIME comparison
Closed
MDEV-27492DOUBLE_PREC_HB histogram has poor estimates for BIT columns
Stalled
MDEV-27497DOUBLE_PREC_HB histogram produces wrong estimates for COMPRESSED columns
Here are the acceptance criteria I am planning to apply to this preview (minor modifications are possible):
1) At least 100x5min test runs for estimation precision checks should pass Note: This criterion is made-up to replace the absent requirements. Constants below were chosen in an arbitrary fashion and can be adjusted if there is a demand and interested parties agree. Note: "Special case" or "corner case" exceptions will be accepted, but only if they are officially documented in a form understandable to end users.
The test is performed on a non-debug build as follows:
two servers with mildly randomized identical configurations are run using the same binaries from the preview branch, test server running with JSON_HB and the baseline with DOUBLE_PREC_HB, histogram_size default;
the servers execute randomized single-table ANALYZE FORMAT=JSON SELECT statements on randomized table structures and data, with table size mostly ranging from 20 to 5000 rows;
only queries with rows value >= 20 are looked at;
only queries for which on the baseline server the precision of filtered estimation is within 5% are looked at;
an error is returned if the precision of filtered estimation on the test server is more than 20% off.
2) At least 100x5min test runs for histogram health checks and basic server stability should show no feature-specific problems or regressions comparing to vanilla 10.8.
The stress test is performed on a debug ASAN build with heavily randomized configurations as follows:
the standard generic regression test combinations (random DML+DDL) are used, with the addition of histogram_type=JSON_HB and frequent ANALYZE TABLE .. PERSISTENT FOR ALL;
periodically mysql.column_stats.histogram for existing JSON histograms is checked so that
JSON_VALID is true,
actual histogram size is less or equal than histogram_size value,
the total size of all buckets sums up to 1.
3) MTR test runs (subject to general skip lists) on an MSAN/ASAN/UBSAN debug builds, big+nm and big+ps, with --mysqld=--histogram-type=JSON_HB, may only return legacy failures and understandable mismatch/test failures caused by the non-default option.
4) MTR test runs (subject to general skip lists) on an MSAN/ASAN/UBSAN debug builds, big+nm and big+ps, may only return legacy failures.
Legend: as above
Test
Status
Updated
Last tested commit
Notes
Estimation precision
2022-01-16
348d3788
A few failures similar to those already ruled out as unrelated issues
Health/stability
2022-01-16
348d3788
18% tests failed on unrelated reasons
ASAN/MSAN/UBSAN big/nm+ps MTR
2022-01-16
746fc9e5
a few unrelated sporadic failures
ASAN/MSAN/UBSAN big/nm+ps MTR with JSON_HB
2022-01-16
746fc9e5
a few unrelated sporadic failures
Elena Stepanova
added a comment - - edited Here are the acceptance criteria I am planning to apply to this preview (minor modifications are possible):
1) At least 100x5min test runs for estimation precision checks should pass
Note: This criterion is made-up to replace the absent requirements. Constants below were chosen in an arbitrary fashion and can be adjusted if there is a demand and interested parties agree.
Note: "Special case" or "corner case" exceptions will be accepted, but only if they are officially documented in a form understandable to end users.
The test is performed on a non-debug build as follows:
two servers with mildly randomized identical configurations are run using the same binaries from the preview branch, test server running with JSON_HB and the baseline with DOUBLE_PREC_HB, histogram_size default;
the servers execute randomized single-table ANALYZE FORMAT=JSON SELECT statements on randomized table structures and data, with table size mostly ranging from 20 to 5000 rows;
only queries with rows value >= 20 are looked at;
only queries for which on the baseline server the precision of filtered estimation is within 5% are looked at;
an error is returned if the precision of filtered estimation on the test server is more than 20% off.
2) At least 100x5min test runs for histogram health checks and basic server stability should show no feature-specific problems or regressions comparing to vanilla 10.8.
The stress test is performed on a debug ASAN build with heavily randomized configurations as follows:
the standard generic regression test combinations (random DML+DDL) are used, with the addition of histogram_type=JSON_HB and frequent ANALYZE TABLE .. PERSISTENT FOR ALL;
periodically mysql.column_stats.histogram for existing JSON histograms is checked so that
JSON_VALID is true,
actual histogram size is less or equal than histogram_size value,
the total size of all buckets sums up to 1.
3) MTR test runs (subject to general skip lists) on an MSAN/ASAN/UBSAN debug builds, big+nm and big+ps, with --mysqld=--histogram-type=JSON_HB , may only return legacy failures and understandable mismatch/test failures caused by the non-default option.
4) MTR test runs (subject to general skip lists) on an MSAN/ASAN/UBSAN debug builds, big+nm and big+ps, may only return legacy failures.
Legend: as above
Test
Status
Updated
Last tested commit
Notes
Estimation precision
2022-01-16
348d3788
A few failures similar to those already ruled out as unrelated issues
Health/stability
2022-01-16
348d3788
18% tests failed on unrelated reasons
ASAN/MSAN/UBSAN big/nm+ps MTR
2022-01-16
746fc9e5
a few unrelated sporadic failures
ASAN/MSAN/UBSAN big/nm+ps MTR with JSON_HB
2022-01-16
746fc9e5
a few unrelated sporadic failures
I think preview-10.8-MDEV-26519-json-histograms as of commit 746fc9e5 can be pushed into 10.8 main and released with 10.8.1.
The difference between 746fc9e5 and 348d3788 is in test files only, so tests performed on 348d3788 do not need to be re-run.
Please note that testing of the feature was limited to its technical functionality (histogram precision) and absence of functional/stability regression. Due to the specifics of the feature, it doesn't reflect end user experience – users don't use histograms directly and their precision as such doesn't concern them, what's important is the final performance of query execution, and it wasn't a part of the feature testing as it was ruled out at early stages as irrelevant to the task scope. It is reasonable for the feature itself (MDEV-21130/MDEV-26519), but for enabling it by default (MDEV-27062) I would recommend some comparative performance testing, at least with widely-recognized workloads, possibly with the table structures tuned to extend histogram usage.
Elena Stepanova
added a comment - I think preview-10.8- MDEV-26519 -json-histograms as of commit 746fc9e5 can be pushed into 10.8 main and released with 10.8.1.
The difference between 746fc9e5 and 348d3788 is in test files only, so tests performed on 348d3788 do not need to be re-run.
Please note that testing of the feature was limited to its technical functionality (histogram precision) and absence of functional/stability regression. Due to the specifics of the feature, it doesn't reflect end user experience – users don't use histograms directly and their precision as such doesn't concern them, what's important is the final performance of query execution, and it wasn't a part of the feature testing as it was ruled out at early stages as irrelevant to the task scope. It is reasonable for the feature itself ( MDEV-21130 / MDEV-26519 ), but for enabling it by default ( MDEV-27062 ) I would recommend some comparative performance testing, at least with widely-recognized workloads, possibly with the table structures tuned to extend histogram usage.
For anyone wanting to preview the feature, see https://mariadb.org/10-7-preview-feature-json-histograms/