[MDEV-12313] Histograms with equal-width bins in MariaDB Created: 2017-03-20  Updated: 2020-03-31

Status: Open
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: None

Type: Task Priority: Major
Reporter: Igor Babaev Assignee: Vicențiu Ciorbaru
Resolution: Unresolved Votes: 0
Labels: eits, gsoc17, gsoc18, gsoc19


 Description   

Histograms with equal-width bins are easy to construct using samples. For this it's enough
to look through the given sample set and for each value from it to figure out what bin this value can be placed in. Each bin requires only one counter.
Let f be a column of a table with N rows and n be the number of samples by which the equal-width histogram of k bins for this column is constructed. Let after looking through all sample
rows the counters created for the histogram bins contain numbers c[1],..,c[k]. Then
m[i]= c[i]/n * 100 is the percentage of the rows whose values of f are expected to be in the interval

 (max(f)-min(f))/k *(i-1), max(f)-min(f))/k *i-1).

It means that if the sample rows have been chosen randomly the expected number of rows with the values of f from this interval can be approximated by the number m[i]*/100 * N.

To collect such statistics it is suggested to use the following variant of the ANALYZE TABLE command:

ANALYZE FAST TABLE tbl [ WITH n ROWS ] [SAMPLING p PERCENTS ]
   PERSISTENT FOR COLUMNS (col1 [IN RANGE r] [WITH k INTERVALS],...)

Here:

  • 'WITH n ROWS' provides an estimate for the number of rows in the table in the case when this estimate cannot be obtained from statistical data.
  • 'SAMPLING p PERCENTS' provides the percentage of sample rows to collect statistics.
    If this is omitted the number is taken from the system variable samples_ratio.
  • 'IN RANGE r' sets the range of equal-width bins of the histogram built for the column col1. If this is omitted then and min and max values for the column can be read from statistical data
    then the histogram is built for the range [min(col1), max(col1)]. Otherwise the range [MIN_type(col1), MAX_type(col1) is considered]. The values beyond the given range, if any, are also is taken into account in two additional bins.
  • WITH k INTERVALS says how many bins are included in the histogram. If it is omitted this value is taken from the system variable histogram_size.


 Comments   
Comment by Teodor-Vicentiu Niculescu [ 2018-05-29 ]

Project Update

All changes can be seen on my github page:
https://github.com/teodorniculescu/server

What I have done so far:

  • implemented grammar for the new query ( sql/sql_yacc.yy )

ANALYZE FAST TABLE table [ WITH n ROWS ] [ SAMPLING p PERCENT ] [ LASTS s SECONDS ]
PERSISTENT FOR COLUMNS (column_1 [IN RANGE r] [WITH k INTERVALS],...)

  • written error cases ( sql/share/errmsg-utf8.txt )
  • created a test for the new query ( mysql-test/t/analyze_fast.test && mysql-test/r/analyze_fast.result )

Follow-up:

  • create a composite data type that stores the equal-width bins
  • create the basic structure of the function which stores the equal-width bins, based on collect_statistics_for_table ( sql/sql_statistics.cc )
  • modify the function in order to store samples
Comment by Teodor-Vicentiu Niculescu [ 2018-07-12 ]

Project Update

All changes are on the MariaDB 10.3 fork on my GitHub page:
https://github.com/teodorniculescu/server

Commit hash:
8cb57d3b3f7e2fea02e3851e8f67a94845d549ff

What is new:

  • created a new function 'collect_fast_statistics_for_table' which creates Equal-Width Histograms (sql/sql_statistics.cc);
  • the function is called by the 'ANALYZE FAST' query;
  • the Equal-Width Histogram is stored in the same data structure as the equal-height histogram (uchar* values);
  • the Equal-Width Histogram is constructed using a full table scan;
  • works with both Single and Double precision (as shown in the test case mysql-test/t/analyze_fast.test);

Follow-up:

  • change the function in order to samples values;
  • add a lot of values (~10.000) to the test case in order to show that it works as intended;
  • toggle sampling according to the number of entries in the table (if there are too few entries, there is no need for sampling);

Any advice, criticism or questions will be greatly appreciated!

Comment by Sergei Petrunia [ 2018-10-05 ]

Notes from discussion with Igor

  • How will the code compute the estimated number of distinct values from the sample? (there's a formula for that. are we using it?)
  • What is the overhead of doing a full table scan as opposed to doing sampling (need to experiment)?
  • Using a secondary index to drive sampling as opposed to using a primary: this is beneficial because the secondary index is smaller.
  • PG's approach is not "sample X% of the table". They sample ~30K rows regardless of table size (there's a paper that says sufficient sample size is ~log(table_rows)).
  • mysql.column_stats.histogram. Do we keep it VARCHAR(255) or change to a bigger datatype?
Generated at Thu Feb 08 07:56:45 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.