[MDEV-26519] JSON Histograms: improve histogram collection Created: 2021-09-01 Updated: 2023-03-21 Resolved: 2022-01-21 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Fix Version/s: | 10.8.1 |
| Type: | Task | Priority: | Blocker |
| Reporter: | Sergei Petrunia | Assignee: | Sergei Petrunia |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | Preview_10.7, Preview_10.8, eits | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The histograms as defined in 1. Popular values should have their own bucketsFor example, consider a table with these values (each character is one row):
A histogram with 8 buckets will have these buckets:
The estimation function will be able to see that 'a' is a popular value, it will determine its frequency with bucket_size precision. However, a better approach would be to put all 'a' values in one bucket:
This will give a more precise number for table 'a'. Other databases
2. Store the number of distinct values in each bucketAgain, following MySQL here: store number of distinct values in each bucket. For strings, store only prefixStore only a 40-character prefix. |
| Comments |
| Comment by Sergei Petrunia [ 2021-09-07 ] | |||||||||||||||||||||||||
|
Note for those who were asking "why it is not done like in MySQL-8" : MySQL 8 has some properties that I doubt we will want to have: https://bugs.mysql.com/bug.php?id=104789 | |||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-09-10 ] | |||||||||||||||||||||||||
|
Suggestion #1: "Let each value that's larger than one bucket be in its own bucket" However, this may cause us to generate close to 2*N buckets (where N is the
Make each "popular" value take a bit more than bucket_size rows. The table can Between each two consequent popular values, put a few rows with unpopular | |||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-09-10 ] | |||||||||||||||||||||||||
|
Suggestion #2: Here, D is not put into a separate bucket:
Here F is put into a separate bucket:
For values that take between 1 and 2 buckets, getting their own bucket becomes A popular value that is put into a larger bucket "saves" buckets by using fewer buckets. (The alternative is to make the next buckets have fewer rows but we are not | |||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-09-10 ] | |||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2021-09-10 ] | |||||||||||||||||||||||||
|
The patch implements Suggestion #2 described above. | |||||||||||||||||||||||||
| Comment by Ian Gilfillan [ 2021-09-22 ] | |||||||||||||||||||||||||
|
For anyone wanting to preview the feature, see https://mariadb.org/10-7-preview-feature-json-histograms/ | |||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2021-12-08 ] | |||||||||||||||||||||||||
|
The 10.8 branch to test is preview-10.8- | |||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2021-12-14 ] | |||||||||||||||||||||||||
|
Here are the acceptance criteria I am planning to apply to this preview (minor modifications are possible):
2) At least 100x5min test runs for histogram health checks and basic server stability should show no feature-specific problems or regressions comparing to vanilla 10.8.
3) MTR test runs (subject to general skip lists) on an MSAN/ASAN/UBSAN debug builds, big+nm and big+ps, with --mysqld=--histogram-type=JSON_HB, may only return legacy failures and understandable mismatch/test failures caused by the non-default option. 4) MTR test runs (subject to general skip lists) on an MSAN/ASAN/UBSAN debug builds, big+nm and big+ps, may only return legacy failures. Legend: as above
| |||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2022-01-17 ] | |||||||||||||||||||||||||
|
I think preview-10.8- The difference between 746fc9e5 and 348d3788 is in test files only, so tests performed on 348d3788 do not need to be re-run. Please note that testing of the feature was limited to its technical functionality (histogram precision) and absence of functional/stability regression. Due to the specifics of the feature, it doesn't reflect end user experience – users don't use histograms directly and their precision as such doesn't concern them, what's important is the final performance of query execution, and it wasn't a part of the feature testing as it was ruled out at early stages as irrelevant to the task scope. It is reasonable for the feature itself ( | |||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2022-01-21 ] | |||||||||||||||||||||||||
|
Pushed into 10.8. JSON_HB histograms are NOT enabled by default. |