|
Why would that be a bug? Without ANALYZE TABLE the statistics can be off, it's expected.
|
|
The statistics for idx_obs_daily_2 are so far off the 2-8K range observed on other nodes which have the same data. 2.7M is no where near that.
Its highly unlikely that a statistical sampling of idx_obs_daily_2(dtDate) sampling would exactly match idx_obs_daily_1(sLocType).
Something in the automated updated statistics caused these to be the same in error.
innodb_stats_sample_pages = 8 as per default.
MariaDB [weather]> select count(*) from obs_daily;
|
+----------+
|
| count(*) |
|
+----------+
|
| 23570427 |
|
+----------+
|
|
|
MariaDB [weather]> select count(*) from obs_daily group by dtDate;
|
most values where between 2-8K, a few a 17K and a few at 23K
|
3784 rows in set (57.58 sec)
|
|
|
MariaDB [weather]> select sLocType,count(*) from obs_daily group by sLocType;
|
+----------+----------+
|
| sLocType | count(*) |
|
+----------+----------+
|
| XXXX | 3722119 |
|
| YYYY | 86833 |
|
| ZZZZ | 30591 |
|
| AAAA | 19730887 |
|
+----------+----------+
|
4 rows in set (40.76 sec)
|
|
|
|
It's fairly easy to reproduce, see below.
I have a gut feeling that there is a decent reason for this; but I'm nowhere near the level of expertise in XtraDB required to provide the actual explanation, so I'm assigning it to Jan for that.
drop table if exists obs_daily;
|
CREATE TABLE `obs_daily` ( `iId` int(11) NOT NULL AUTO_INCREMENT, `sLocType` varchar(10) NOT NULL DEFAULT '', `sLocCode` varchar(30) NOT NULL DEFAULT '', `dtDate` date NOT NULL DEFAULT '0000-00-00', PRIMARY KEY (`iId`), UNIQUE KEY `idx_obs_daily_1` (`sLocType`,`sLocCode`,`dtDate`), KEY `idx_obs_daily_2` (`dtDate`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;
|
insert into obs_daily values (null, 'A', 'c1', '2014-11-11'),(null, 'A', 'c2', '2014-11-11'),(null, 'A', 'c3', '2014-11-11'),(null, 'A', 'c4', '2014-11-11'),(null, 'A', 'c5', '2014-11-11'),(null, 'A', 'c6', '2014-11-11'),(null, 'A', 'c7', '2014-11-11'),(null, 'A', 'c8', '2014-11-11');
|
insert into obs_daily select null, 'A', sLocCode, '2014-08-12' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-13' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-12' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-11' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-10' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-16' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-19' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-22' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-21' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-31' from obs_daily;
|
insert into obs_daily select null, 'A', concat(sLocCode,iId), '2014-08-30' from obs_daily;
|
insert into obs_daily select null, 'A', concat('code-1-',iId), '2014-08-31' from obs_daily;
|
insert into obs_daily select null, 'A', concat('code-2-',iId), '2014-08-31' from obs_daily;
|
insert into obs_daily select null, 'A', concat('code-2-',iId), '2014-08-29' from obs_daily;
|
insert into obs_daily select null, 'A', concat('code-3-',iId), '2014-08-09' from obs_daily;
|
insert into obs_daily select null, 'A', concat('code-4-',iId), '2014-08-05' from obs_daily;
|
insert into obs_daily select null, 'A', concat('code-5-',iId), '2014-08-03' from obs_daily;
|
MariaDB [test]> select count(*) from obs_daily;
|
+----------+
|
| count(*) |
|
+----------+
|
| 1048576 |
|
+----------+
|
1 row in set (2.22 sec)
|
|
MariaDB [test]> select count(distinct sLocType) from obs_daily;
|
+--------------------------+
|
| count(distinct sLocType) |
|
+--------------------------+
|
| 1 |
|
+--------------------------+
|
1 row in set (0.01 sec)
|
|
MariaDB [test]> select count(distinct sLocCode) from obs_daily;
|
+--------------------------+
|
| count(distinct sLocCode) |
|
+--------------------------+
|
| 1007624 |
|
+--------------------------+
|
1 row in set (5.65 sec)
|
|
MariaDB [test]> select count(distinct dtDate) from obs_daily;
|
+------------------------+
|
| count(distinct dtDate) |
|
+------------------------+
|
| 15 |
|
+------------------------+
|
1 row in set (0.00 sec)
|
MariaDB [test]> select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily';
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2788 | 2780 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 116494, 0, 0 | 3564 | 3076 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 116494, 1 | 801 | 796 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
MariaDB [test]> set global innodb_stats_sample_pages = 8;
|
Query OK, 0 rows affected (0.00 sec)
|
|
MariaDB [test]> analyze table obs_daily;
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.01 sec)
|
|
MariaDB [test]> select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily';
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 115617, 1, 1 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 9459, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
MariaDB [test]> set global innodb_stats_sample_pages = 3000;
|
Query OK, 0 rows affected (0.00 sec)
|
|
MariaDB [test]> analyze table obs_daily;
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (3.61 sec)
|
|
MariaDB [test]> select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily';
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 1048023, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 131002, 1 | 865 | 814 |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
MariaDB [test]> set global innodb_stats_sample_pages = 8;
|
Query OK, 0 rows affected (0.00 sec)
|
|
MariaDB [test]> analyze table obs_daily;
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.02 sec)
|
|
MariaDB [test]> select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily';
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 115183, 0, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 115183, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
MariaDB [test]> analyze table obs_daily;
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.02 sec)
|
|
MariaDB [test]> select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily';
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 116842, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 116842, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
MariaDB [test]> analyze table obs_daily;
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.02 sec)
|
|
MariaDB [test]> select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily';
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 116407, 1, 1 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 116407, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
|
I do not think this is a bug, this is how the table statictics calculation is designed to work. If you set the number of analyzed pages to very low number compared to actual pages on that table/index it randomly pics those pages (default 8 pages), this leads to fact that query after analyze table returns different results. If the index tree is small, smaller than 10 * n_sample_pages + total_external_size, then the estimate is ok. For bigger index trees it is common that we do not see any borders between key values in the few pages we pick. But still there may be n_sample_pages different key values, or even more. And it just tries to approximate to n_sample_pages (8) this leads to fact that we calculate n_rows / n_sample_pages+estimate = 116407.
|
|
In both Serg's reply and Jan's later analysis there is a problem of technical correctness versus usability.
We can argue technical correctness until we're blue in the face, but that doesn't make it usable for clients.
The objective of the table stats is to enable the optimiser to make sensible decisions about index usage. If that objective is not attained, the system doesn't work - so if the current system works as designed, then the design is inadequate for current user needs and needs adjustment or replacement.
I agree that stats can change when ANALYZE TABLE is run. That's not a debate.
I think we can reduce the issue to the fact that the number of sample pages is static. Historically it was even a constant, but either way the reason was that InnoDB didn't know how many rows or pages it had in a table, thus dynamically scaling the # of sample pages was not possible.
However, with persistent stats tables InnoDB does have an indication of how many index and leaf pages it has, and thus we can derive a basis for dynamic scaling of the # of sample pages, appropriate to the size of the table.
Any production deployment will have a variety of table sizes, so just increasing the # of sample pages globally is not sensible. It incurs a performance impact and the benefit for many tables will be negligible. I fully agree with Jan that for a larger table, the # of sample pages should be larger in order to be able to get a sensible estimate. Let's make it so. InnoDB now has sufficient info to make that decision. So let's update the logic design.
Thanks
|
|
I agree with Arjen, for analyze table we could safely use number of pages for sample size we have time. But for INSERT, DELETE, and UPDATE selecting a good sample size is not that easy. Consider case where table has only a few distict values on particular index, to find out that in worst case we would need to sample almost all pages and if there is a lot of pages this would have significant negative effect to performance. For example in above case if we pick 10% pages as sample size we end up analyzing about 300 pages from idx_obs_daily_1 and about 80 pages from idx_obs_daily_2.
In current implementation above insert times:
Query OK, 512 rows affected (0.29 sec)
|
Query OK, 1024 rows affected (0.66 sec)
|
Query OK, 2048 rows affected (2.06 sec)
|
Query OK, 4096 rows affected (4.10 sec)
|
Query OK, 8192 rows affected, 52 warnings (2.85 sec)
|
Query OK, 16384 rows affected (1.31 sec)
|
Query OK, 32768 rows affected (3.53 sec)
|
Query OK, 65536 rows affected (7.47 sec)
|
Query OK, 131072 rows affected (13.30 sec)
|
Query OK, 262144 rows affected (28.45 sec)
|
Query OK, 524288 rows affected (58.24 sec)
|
Lets change the number of analyzed pages as follows:
if (srv_stats_sample_pages > index->stat_index_size) {
|
if (index->stat_index_size > 0) {
|
n_sample_pages = index->stat_index_size;
|
} else {
|
n_sample_pages = 1;
|
}
|
} else {
|
if (index->stat_index_size <= 20) {
|
n_sample_pages = index->stat_index_size;
|
} else {
|
/* If we have enough pages, analyze at least 10% of them */
|
n_sample_pages = index->stat_index_size * 0.1;
|
}
|
}
|
With this change insert times are:
Query OK, 512 rows affected (0.31 sec)
|
Query OK, 1024 rows affected (0.78 sec)
|
Query OK, 2048 rows affected (1.97 sec)
|
Query OK, 4096 rows affected (4.01 sec)
|
Query OK, 8192 rows affected, 52 warnings (2.52 sec)
|
Query OK, 16384 rows affected (1.24 sec)
|
Query OK, 32768 rows affected (3.48 sec)
|
Query OK, 65536 rows affected (7.10 sec)
|
Query OK, 131072 rows affected (13.14 sec)
|
Query OK, 262144 rows affected (26.14 sec)
|
Query OK, 524288 rows affected (1 min 2.64 sec)
|
However, 10% is not yet enough in this case always:
set global innodb_stats_sample_pages = 8
|
|
analyze table obs_daily
|
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.03 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 1050943, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 1050943, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
analyze table obs_daily
|
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.02 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 1049616, 1, 1 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 104961, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
|
|
For calculating a sample size, you wouldn't want a %/fraction, but some logarithmic scale - otherwise indeed for big tables it goes silly - with no statistical advantage.
How about sample_pages = max((total_pages > 4 ? 4 : total_pages),log2(total_pages));
That should provide at least a sane statistical basis for ANALYZE.
If you want to make it a bit more agressive, you could make it log2(total_pages) * 2 or something like that.
Obviously innodb_stats_sample_pages variable becomes obsolete with this new approach; if you want, you could make it the multiplier. That is, log2(total_pages) * innodb_stats_sample_pages then people can tweak it but you still have a logarithmic scale.
Can you fill in me in on what sampling is required for INSERT/UPDATE/DELETE and why a larger sample size slows down the inserts? Are you referring to updating the statistics after writes, or something else? If it's stats updates, then again the the logarithmic formula should work much better.
Can we try that and see how it looks? It won't be perfect, but I reckon it's likely to be much better than what we have now. Not necessarily on this particular table, but in general. It's no use trying to optimise for this one case at the expense of everything and everyone else.
thanks
|
|
Setting value to 30% insert times are:
Query OK, 65536 rows affected (8.49 sec)
|
Query OK, 131072 rows affected (14.64 sec)
|
Query OK, 262144 rows affected (37.52 sec)
|
Query OK, 524288 rows affected (1 min 4.73 sec)
|
and we finally start seeing about correct estimates:
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.06 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 1050261, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 65641, 1 | 865 | 814 |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.06 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 1051542, 1, 1 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 105154, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+---------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
|
When Does InnoDB Update Table Statistics? MySQL and InnoDB plugin’s manual,
-
- Metadata commands like SHOW INDEX, SHOW TABLE STATUS and SHOW [FULL] TABLES (or their corresponding queries from INFORMATION_SCHEMA.TABLES and INFORMATION_SCHEMA.STATISTICS)
- When 1 / 16th of the table or 2Billion rows has been modified, whichever comes first. ./row/row0mysql.c:row_update_statistics_if_needed
thus if INSERT,UPDATE or DELETE modifies 1/16th of the table we do sampling again.
|
|
I modified the calculation to be:
/* It makes no sense to test more pages than are contained
|
in the index, thus we lower the number if it is too high */
|
if (srv_stats_sample_pages > index->stat_index_size) {
|
if (index->stat_index_size > 0) {
|
n_sample_pages = index->stat_index_size;
|
} else {
|
n_sample_pages = 1;
|
}
|
} else {
|
n_sample_pages = max((index->stat_index_size > srv_stats_sample_pages ? srv_stats_sample_pages : index->stat_index_size),
|
log2(index->stat_index_size)*srv_stats_sample_pages);
|
}
|
In above case that means that we analyze about 98 pages from index idx_obs_daily1 and about 78 pages from index idx_obs_daily2 when srv_stats_sample_pages = 8. Insert speed:
Query OK, 65536 rows affected (7.46 sec)
|
Query OK, 131072 rows affected (15.38 sec)
|
Query OK, 262144 rows affected (28.98 se
|
Query OK, 524288 rows affected (56.72 sec)
|
and actual statistics:
set global innodb_stats_sample_pages = 8
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.01 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 262258, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 87419, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.01 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 262914, 0, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 87638, 1 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.01 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 262719, 1, 1 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 47767, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
I will see if I can instrument code to know is this analyze table (do full table analysis) or normal case (do simple analysis).
|
|
Hi Jan
Thanks for that.
Sorry can I please improve my own code suggestion? 
From
n_sample_pages = max((index->stat_index_size > srv_stats_sample_pages ? srv_stats_sample_pages : index->stat_index_size),
log2(index->stat_index_size)*srv_stats_sample_pages);
to
n_sample_pages = max(min(srv_stats_sample_pages,index->stat_index_size),log2(index->stat_index_size)*srv_stats_sample_pages);
Regarding the 1 / 16th of the table or 2Billion rows modified, perhaps that should have a lower upper bound.
1/16th can be a lot, when dealing with a big table. Thus the stats wouldn't yet get updated even though the data profile could be completely different long before.
For instance, in the current code, for 16 million rows, a million rows would have to change for the updating to happen.
Perhaps that's worth a new system variable as people can decide for their situation... but as a suggestion to test, 1/16th or 100,000 rows modified, whichever comes first?
As a rule of thumb from the real world - when that many rows get modified, the index stats are very likely to be different at least for some of the indexes; it's definitely worthwhile recalculating as the small time penalty for that exercise is offset by the improved performance due to the optimiser making better choices.
Thanks
|
|
With that sample we get:
set global innodb_stats_sample_pages = 8
|
Query OK, 0 rows affected (0.00 sec)
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.01 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 262383, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 524766, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.01 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 263219, 1, 0 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 526438, 1 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
|
analyze table obs_daily
|
+----------------+---------+----------+----------+
|
| Table | Op | Msg_type | Msg_text |
|
+----------------+---------+----------+----------+
|
| test.obs_daily | analyze | status | OK |
|
+----------------+---------+----------+----------+
|
1 row in set (0.02 sec)
|
|
select * from information_schema.INNODB_INDEX_STATS where table_name='obs_daily'
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| table_schema | table_name | index_name | fields | rows_per_key | index_total_pages | index_leaf_pages |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
| test | obs_daily | PRIMARY | 1 | 1 | 2852 | 2844 |
|
| test | obs_daily | idx_obs_daily_1 | 3 | 260039, 1, 1 | 3628 | 3138 |
|
| test | obs_daily | idx_obs_daily_2 | 2 | 86679, 0 | 865 | 814 |
|
+--------------+------------+-----------------+--------+--------------+-------------------+------------------+
|
3 rows in set (0.00 sec)
|
Desided not to touch too much GA release, but maybe I can introduce a new configuration variable srv_stats_modified_counter:
/**
|
|
If user has provided upper bound for how many rows needs to be updated
|
before we calculate new statistics we use minimum of provided value
|
and 1/16 of table every 16th round. If no upper bound is provided
|
(srv_stats_modified_counter = 0, default) then calculate new statistics
|
if 1 / 16 of table has been modified
|
since the last time a statistics batch was run.
|
We calculate statistics at most every 16th round, since we may have
|
a counter table which is very small and updated very often.
|
@param t table
|
@return true if the table has changed too much and stats need to be
|
recalculated
|
*/
|
#define DICT_TABLE_CHANGED_TOO_MUCH(t) \
|
((ib_int64_t) (t)->stat_modified_counter > (srv_stats_modified_counter ? \
|
ut_min(srv_stats_modified_counter, (16 + (t)->stat_n_rows / 16)) : \
|
16 + (t)->stat_n_rows / 16))
|
|
|
In my opinion analyze table should calculate these statistics by doing a full table/index scan that would then lead to correct values. But that change is far too big for GA releases. Above changes are safe for GA-releases with a addition configration variable innodb_stats_sample_traditional boolean default false where we use new logaritmic number of sampled pages and true where we use only configured number of sampled pages. This to offer users a way to keep current behavior.
|
|
revno: 4368
committer: Jan Lindström <jplindst@mariadb.org>
branch nick: 5.5
timestamp: Wed 2014-11-19 20:27:34 +0200
message:
MDEV-7084: innodb index stats inadequate using constant
innodb_stats_sample_pages
Analysis: If you set the number of analyzed pages
to very low number compared to actual pages on
that table/index it randomly pics those pages
(default 8 pages), this leads to fact that query
after analyze table returns different results. If
the index tree is small, smaller than 10 *
n_sample_pages + total_external_size, then the
estimate is ok. For bigger index trees it is
common that we do not see any borders between
key values in the few pages we pick. But still
there may be n_sample_pages different key values,
or even more. And it just tries to
approximate to n_sample_pages (8).
Fix: (1) Introduced new dynamic configuration variable
innodb_stats_sample_traditional that retains
the current design. Default false.
(2) If traditional sample is not used we use
n_sample_pages = max(min(srv_stats_sample_pages,
index->stat_index_size),
log2(index->stat_index_size)*
srv_stats_sample_pages);
(3) Introduced new dynamic configuration variable
stat_modified_counter (default = 0) if set
sets lower bound for row updates when statistics is re-estimated.
If user has provided upper bound for how many rows needs to be updated
before we calculate new statistics we use minimum of provided value
and 1/16 of table every 16th round. If no upper bound is provided
(srv_stats_modified_counter = 0, default) then calculate new statistics
if 1 / 16 of table has been modified
since the last time a statistics batch was run.
We calculate statistics at most every 16th round, since we may have
a counter table which is very small and updated very often.
@param t table
@return true if the table has changed too much and stats need to be
recalculated
*/
#define DICT_TABLE_CHANGED_TOO_MUCH(t) \
((ib_int64_t) (t)->stat_modified_counter > (srv_stats_modified_counter ?
\
ut_min(srv_stats_modified_counter, (16 + (t)->stat_n_rows / 16)) : \
16 + (t)->stat_n_rows / 16))
|
|
Excellent, thanks for that Jan.
A few remarks
Now, stat_modified_counter defaults to 0 according to your commit docu.
But did you completely replace the previous static upper bound of 2 billion rows?
In that case, wouldn't it be easier (and backward consistent) to make stat_modified_counter default to 2 billion? That is, really replace the constant with that dynamic variable.
Typo note: the description is correct but the (3) heading introduction says "lower bound" instead of "upper bound".
Did you remove the whole if construct and replace it with the new single line? Because as far as I can see, it does the same thing. The following can be removed:
if (srv_stats_sample_pages > index->stat_index_size) {
if (index->stat_index_size > 0)
{
n_sample_pages = index->stat_index_size;
}
else
{
n_sample_pages = 1;
}
} else {
I see you put this into the 5.5 branch - that's excellent as some of our clients are still on 5.5.
I trust it'll also be in 10.0 ?
Can you give an indication of which minor releases of 5.5 and 10.0 this patch will be in?
Will you also submit these patches upstream to Oracle and/or Percona?
thanks
|
|
Hi,
I could not find that 2B rule from code, I did not by default change this 1/16 upper bound because 5.5 is GA, I only provided a way to bypass that. I replaced the sample page calculation with:
if (srv_stats_sample_traditional) {
|
/* It makes no sense to test more pages than are contained
|
in the index, thus we lower the number if it is too high */
|
if (srv_stats_sample_pages > index->stat_index_size) {
|
if (index->stat_index_size > 0) {
|
n_sample_pages = index->stat_index_size;
|
} else {
|
n_sample_pages = 1;
|
}
|
} else {
|
n_sample_pages = srv_stats_sample_pages;
|
}
|
} else {
|
/* New logaritmic number of pages that are estimated. We
|
first pick minimun from srv_stats_sample_pages and number of
|
pages on index. Then we pick maximum from previous number of
|
pages and log2(number of index pages) * srv_stats_sample_pages. */
|
if (index->stat_index_size > 0) {
|
n_sample_pages = ut_max(ut_min(srv_stats_sample_pages, index->stat_index_size),
|
log2(index->stat_index_size)*srv_stats_sample_pages);
|
} else {
|
n_sample_pages = 1;
|
}
|
}
|
New method is used by default but I provided a way to revert to original method. This patch is most likely released on 5.5.41 and 10.0.16. I will provide patches to Percona (https://bugs.launchpad.net/percona-server/+bug/1394483) and Oracle (http://bugs.mysql.com/bug.php?id=74935).
|
|
Decided to turn off by default the new statistics calculation i.e. innodb_stats_traditional=true, this is because there could be regression to customer databases when table is opened.
------------------------------------------------------------
revno: 4374
committer: Jan Lindström <jplindst@mariadb.org>
branch nick: 5.5
timestamp: Fri 2014-11-21 13:32:53 +0200
message:
Forgot to add test file.
------------------------------------------------------------
revno: 4372
committer: Jan Lindström <jplindst@mariadb.org>
branch nick: 5.5
timestamp: Fri 2014-11-21 13:27:36 +0200
message:
MDEV-7084: innodb index stats inadequate using constant innodb_stats_sample_pages
Use traditional statistics estimation by default (innodb-stats-traditional=true).
There could be performance regression for customers if there is a lot of
open table operations.
|
|
Thanks Jan for the quick implementation.
Unless I'm mistaken, `log2(index->stat_index_size)*srv_stats_sample_pages)` could be larger than index->stat_index_size (e.g stat_index_size =srv_stats_sample_pages = 8) -> s_sample_pages=log2(8)*8 = 24 so in this case 24 pages are sampled when there are only 8.
so suggest the following to constrain n_sample pages to between srv_stats_sample_pages and stat_index_size:
n_sample_pages = ut_min(index->stat_index_size, ut_max(srv_stats_sample_pages, log2(index->stat_index_size)*srv_stats_sample_pages))
|
Technically only a stat_index_size of 1 or 2 (log2(2)=1) will result in ut_max() returning srv_stats_sample_pages being returned as a maximum. so then entire srv_stats_sample_traditional==false case is:
n_sample_pages = (index->stat_index_size<=1) ? 1 : ut_min(index->stat_index_size, log2(index->stat_index_size)*srv_stats_sample_pages);
|
|
|
I think you are correct and because srv_stats_sample_pages can be > 10 we should have:
if (srv_stats_sample_traditional) {
|
/* It makes no sense to test more pages than are contained
|
in the index, thus we lower the number if it is too high */
|
if (srv_stats_sample_pages > index->stat_index_size) {
|
if (index->stat_index_size > 0) {
|
n_sample_pages = index->stat_index_size;
|
} else {
|
n_sample_pages = 1;
|
}
|
} else {
|
n_sample_pages = srv_stats_sample_pages;
|
}
|
} else {
|
/* New logaritmic number of pages that are estimated. We
|
first pick minimun from srv_stats_sample_pages and number of
|
pages on index. Then we pick maximum from previous number of
|
pages and log2(number of index pages) * srv_stats_sample_pages. */
|
if (index->stat_index_size > 0) {
|
n_sample_pages = ut_min(index->stat_index_size,
|
ut_max(ut_min(srv_stats_sample_pages, index->stat_index_size),
|
log2(index->stat_index_size)*srv_stats_sample_pages));
|
} else {
|
n_sample_pages = 1;
|
}
|
}
|
|
|
Maths makes some of these paths unreachable.
So taking index size as I and sample as S and log(I)*S as L
so the current expression is min(I, max( min(S,I), L)
case 1: assume S < I
min(I, max( min(S,I), L) becomes
min(I , max( S, L))
but since L=LOG2(I)*S and log2(I) >=1 L>S always so max(S,L) = L.
so we have:
min(I , L)
case 2: assume I < S
min(I, max( min(S,I), L) becomes
min(I, max( I, L))
case 2a: L > I
min(I, max( I, L))
becomes
min(I, L)
becomes I
case 2b: when L < I
min(I, max( I, L))
becomes
min(I, I )
becomes I
so our expression is:
n_pages = S < I ? min(I,L) : I
hence the simple version:
} else {
|
/* New logarithmic number of pages that are estimated.
|
If we have only 0 or 1 index pages then we can only take 1 sample.
|
|
So taking index size as I and sample as S and log(I)*S as L
|
|
requirement 1) we want the out limit of the expression to not exceed I;
|
requirement 2) we want the ideal pages to be at least S;
|
so the current expression is min(I, max( min(S,I), L)
|
|
looking for simplifications:
|
|
case 1: assume S < I
|
min(I, max( min(S,I), L) -> min(I , max( S, L))
|
|
but since L=LOG2(I)*S and log2(I) >=1 L>S always so max(S,L) = L.
|
|
so we have: min(I , L)
|
|
case 2: assume I < S
|
min(I, max( min(S,I), L) -> min(I, max( I, L))
|
|
case 2a: L > I
|
min(I, max( I, L)) -> min(I, L) -> I
|
|
case 2b: when L < I
|
min(I, max( I, L)) -> min(I, I ) -> I
|
|
so taking all case2 paths is I, our expression is:
|
n_pages = S < I? min(I,L) : I
|
*/
|
if (index->stat_index_size > 1) {
|
n_sample_pages = (srv_stats_sample_pages < index->stat_index_size) ? ut_min(index->stat_index_size, log2(index->stat_index_size)*srv_stats_sample_pages) : index->stat_index_size;
|
}
|
else
|
{
|
n_sample_pages = 1;
|
}
|
}
|
edited: to fix a transposition in the comments
|
|
Thanks, I really should educate myself more on math.
|
|
You're welcome. If you hadn't come back with a longer form I woudn't have done the proper proof and the slight improvement on my original 6 hrs ago wouldn't have been realised.
|
|
thanks Jan. I see you committed this r4377
much appreciate the credit in the commit too.
|