|
One can have an example with non "RQG data", also:
MariaDB [dbt3sf10]> select c_mktsegment, count(*)/@total from customer group by c_mktsegment;
|
+--------------+-----------------+
|
| c_mktsegment | count(*)/@total |
|
+--------------+-----------------+
|
| AUTOMOBILE | 0.2000 |
|
| BUILDING | 0.2002 |
|
| FURNITURE | 0.1997 |
|
| HOUSEHOLD | 0.1998 |
|
| MACHINERY | 0.2003 |
|
+--------------+-----------------+
|
5 rows in set (0.64 sec)
|
MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='AUTOMOBILE';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 100.00 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='BUILDING';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 48.33 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='FURNITURE';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 11.94 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 23.60 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='MACHINERY';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 9.57 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
MariaDB [dbt3sf10]> select *, hex(histogram) from mysql.column_stats where column_name='c_mktsegment' and table_name='customer'\G
|
*************************** 1. row ***************************
|
db_name: dbt3sf10
|
table_name: customer
|
column_name: c_mktsegment
|
min_value: AUTOMOBILE
|
max_value: MACHINERY
|
nulls_ratio: 0.0000
|
avg_length: 8.9998
|
avg_frequency: 300000.0000
|
hist_size: 200
|
hist_type:
|
histogram: ����������������������������������������jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj��������������������������������������������������������������������������������
|
hex(histogram): 00000000000000000000000000000000000000000000000000000000000000000000000000000000151515151515151515151515151515151515151515151515151515151515151515151515151515156A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A6A95959595959595959595959595959595959595959595959595959595959595959595959595959595FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
|
|
|
Note that: for
c_mktsegment ='AUTOMOBILE'
selectivity is 100%.
histogram shows 'AUTOMOBILE' is the minimum value.
Histogram has 200 buckets. first 40 of which are at 0%.. The remaining 160 buckets start at 0x15=20%.
|
|
Trying with the latest code, revid:psergey@askmonty.org-20140327090800-l3je8eqjrfx21iki. Selectivity for all above queries like
MariaDB [dbt3sf10]> explain extended select count(*) from customer where c_mktsegment ='HOUSEHOLD';
|
is 19.90%. THis is good.
One can also see that
mysql> explain extended select count(*) from customer where c_mktsegment ='NO-SUCH-MARKET';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 19.90 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
is still 19.90, but this seems to be a limitation of this type of histogram.
|
|
After MDEV-5962 was fixed:
mysql> explain extended select count(*) from customer where c_mktsegment ='ZZZZZZZZZZ';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 19.90 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
mysql> explain extended select count(*) from customer where c_mktsegment ='DDDDDDD';
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
| 1 | SIMPLE | customer | ALL | NULL | NULL | NULL | NULL | 1500000 | 0.30 | Using where |
|
+------+-------------+----------+------+---------------+------+---------+------+---------+----------+-------------+
|
That is, we will get incorrect estimates for values that
- are not within the min...max range
- are sufficiently close to the popular values (so much that they have the same position. value position has precision of 1/256th or 1/65536th for SINGLE/DOUBLE_PREC_HB).
for other non-existent values, we're able to return low selecitivity.
|
|
As for the original testcase:
if I run
[explain extended] select count(*), from t5 where col2=$i;
|
I get this:
'col1=... #rows filtered%
|
160 1 0.01
|
161 1 0.01
|
162 1 0.01
|
163 1 0.01
|
164 1 0.01
|
165 1 0.01
|
166 1 0.01
|
167 1 0.01
|
168 1 0.01
|
169 1 0.01
|
170 1 0.01
|
171 1 0.01
|
172 1 0.01
|
173 1 0.01
|
174 1 0.01
|
175 1 32.5
|
176 1 32.5
|
177 1 32.5
|
178 3301 32.5
|
179 1 32.5
|
180 1 32.5
|
181 1 32.5
|
182 1 32.5
|
183 1 32.5
|
184 1 0.07
|
185 1 0.07
|
186 1 0.07
|
187 1 0.07
|
188 1 0.07
|
189 1 0.07
|
190 1 0.07
|
That is, big number of values for value of 178 spreads over to its neighbors. One could expect that.
|
|
If one changes the histogram to use DOUBLE_PREC_HB, then the situation improves. 30% is only returned for the col2=178:
const #rows filtered% filtered%,
|
single_prec_hb double_prec_hb
|
160 1 0.01 0.01
|
161 1 0.01 0.01
|
162 1 0.01 0.01
|
163 1 0.01 0.01
|
164 1 0.01 0.01
|
165 1 0.01 0.01
|
166 1 0.01 0.01
|
167 1 0.01 0.01
|
168 1 0.01 0.01
|
169 1 0.01 0.01
|
170 1 0.01 0.01
|
171 1 0.01 0.01
|
172 1 0.01 0.01
|
173 1 0.01 0.11
|
174 1 0.01 0.11
|
175 1 32.5 0.11
|
176 1 32.5 0.11
|
177 1 32.5 0.11
|
178 3301 32.5 32.50
|
179 1 32.5 0.06
|
180 1 32.5 0.06
|
181 1 32.5 0.06
|
182 1 32.5 0.06
|
183 1 32.5 0.06
|
184 1 0.07 0.06
|
185 1 0.07 0.06
|
186 1 0.07 0.06
|
187 1 0.07 0.06
|
188 1 0.07 0.06
|
189 1 0.07 0.06
|
190 1 0.07 0.01
|
|
|
Closing as this is inherent problem of this type of histogram.
|