Details
Description
Histograms with equal-width bins are easy to construct using samples. For this it's enough
to look through the given sample set and for each value from it to figure out what bin this value can be placed in. Each bin requires only one counter.
Let f be a column of a table with N rows and n be the number of samples by which the equal-width histogram of k bins for this column is constructed. Let after looking through all sample
rows the counters created for the histogram bins contain numbers c[1],..,c[k]. Then
m[i]= c[i]/n * 100 is the percentage of the rows whose values of f are expected to be in the interval
(max(f)-min(f))/k *(i-1), max(f)-min(f))/k *i-1).
|
It means that if the sample rows have been chosen randomly the expected number of rows with the values of f from this interval can be approximated by the number m[i]*/100 * N.
To collect such statistics it is suggested to use the following variant of the ANALYZE TABLE command:
ANALYZE FAST TABLE tbl [ WITH n ROWS ] [SAMPLING p PERCENTS ]
|
PERSISTENT FOR COLUMNS (col1 [IN RANGE r] [WITH k INTERVALS],...)
|
Here:
- 'WITH n ROWS' provides an estimate for the number of rows in the table in the case when this estimate cannot be obtained from statistical data.
- 'SAMPLING p PERCENTS' provides the percentage of sample rows to collect statistics.
If this is omitted the number is taken from the system variable samples_ratio. - 'IN RANGE r' sets the range of equal-width bins of the histogram built for the column col1. If this is omitted then and min and max values for the column can be read from statistical data
then the histogram is built for the range [min(col1), max(col1)]. Otherwise the range [MIN_type(col1), MAX_type(col1) is considered]. The values beyond the given range, if any, are also is taken into account in two additional bins. - WITH k INTERVALS says how many bins are included in the histogram. If it is omitted this value is taken from the system variable histogram_size.