[MDEV-9097] Engine independent statistics only allows utf8 values, making query plans fail Created: 2015-11-07 Updated: 2015-11-18 Resolved: 2015-11-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Affects Version/s: | 10.0.16, 10.0.22, 10.0, 10.1 |
| Fix Version/s: | 10.0.23 |
| Type: | Bug | Priority: | Major |
| Reporter: | Jaime Crespo | Assignee: | Sergei Petrunia |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | eits | ||
| Issue Links: |
|
||||||||
| Sprint: | 10.1.9-3 | ||||||||
| Description |
|
When using use_stat_tables = PREFERABLY, and then run ANALYZE TABLE on a table with a an indexed column containing UTF-8 characters (but not mysql's 'utf8'), some query plans can be altered, sometimes with horrible performance consequences. What it is worse, EXPLAIN does not show the new query plan from broken statistics, but it shows the original query plan, while we can check after execution than we have a very different execution plan not only on the query latency but on the Handler statistics. How to reproduce: Error formatting macro: code: java.lang.StackOverflowError
CREATE DATABASE zhwiki; – Note the charset binary, this allows any character encoding INSERT INTO `recentchanges` VALUES (51267205,'20151023001221','',0,'Reinheitsgebot',0,'��鯰科','添加语言链接:[[:specieswiki:Pangasiidae]]',1,1,0,566064,36471314,36471314,5,'wb',0,'',1,'',1254,1254,0,0,NULL,'','a:3:{s:20:\"wikibase-repo-change\";a:14: {s:2:\"id\";i:259555851;s:4:\"type\";s:20:\"wikibase-item~update\";s:4:\"time\";s:14:\"20151023001221\";s:9:\"object_id\";s:7:\"q901080\";s:7:\"user_id\";s:6:\"940976\";s:11:\"revision_id\";s:9:\"261218043\";s:11:\"entity_type\";s:4:\"item\";s:9:\"user_text\";s:14:\"Reinheitsgebot\";s:7:\"page_id\";i:851866;s:9:\"parent_id\";i:255341522;s:7:\"comment\";s:49:\"/* wbsetsitelink-add:1|specieswiki */ Pangasiidae\";s:6:\"rev_id\";i:261218043;s:3:\"bot\";i:1;s:7:\"site_id\";s:12:\"wikidatawiki\";}s:7:\"comment\";s:49:\"添加语言链接:[[:specieswiki:Pangasiidae]]\";s:12:\"comment-html\";s:123:\"添加语言链接:<a class=\"external\" href=\"//www.wikidata.org/wiki/Specieswiki:Pangasiidae\">specieswiki:Pangasiidae</a>\";}'); SET GLOBAL use_stat_tables=PREFERABLY; – reload connection for change to take effect to your current session MariaDB [zhwiki]> ANALYZE TABLE recentchanges;
---------------------
--------------------- We can check that effectively, the tables have the wrong encoding: CREATE TABLE `column_stats` ( With 1.4 millons rows, a simple query like: FLUSH STATUS; goes from reading 3000 rows to read the full table (full index scan) using the timestamp index, 1.4 millon rows. As a workaround, disable use_stat_tables. 3 things went wrong here:
|
| Comments |
| Comment by Daniel Black [ 2015-11-08 ] | |||||||||
|
agree. its an annoyance/ | |||||||||
| Comment by Sergei Petrunia [ 2015-11-09 ] | |||||||||
Agree. They should be varbinary as that's the only charset that allows to store varbinary.
This is a bit more complex as currently tables in mysql.* are mostly MyISAM tables.
Tried with a smaller example, and I can observe the problem in EXPLAIN, too. The query optimizer uses EITS data depending on use_stat_tables / optimizer_use_condition_selectivity settings, it is done in the same way in EXPLAIN and regular queries. Maybe, you're hitting a problem with EXPLAIN being different from actual execution due to ORDER BY ... LIMIT problems? There were such bugs in MySQL/MariaDB. (Do I understand correctly that recentchanges table is not included in wikipedia database dumps?) | |||||||||
| Comment by Sergei Petrunia [ 2015-11-09 ] | |||||||||
|
Debugged to see what happens when we fail to save min_value and/or max_value. If we fail to save both, the server assumes that both min_value and max_value are empty. This makes a predicate
look like it is highly selective. The reason for that is that we get either 0 or 1 when computing range endpoint's position in get_column_range_cardinality:
and we again get either 0 or 1 for the other endpoint. Then
Without histogram, we get sel=0;
and here we get col_stats->get_avg_frequency() as selectivity (for ANY meaningful range condition). That is, failure to save min_value or max_value in the table causes the optimizer to vastly over-estimate selectivity of any range condition. This can lead to bad query plans. | |||||||||
| Comment by Jaime Crespo [ 2015-11-09 ] | |||||||||
|
I am hot-patching the mysql table, converting it to binary (I know this may have problems on mysql_upgrade, but I need a workaround now). I wanted it to be utf8(mb4), because a varchar binary is essentially a varbinary, but that doesn't work for me.
I will report issues I may find. | |||||||||
| Comment by Jaime Crespo [ 2015-11-09 ] | |||||||||
|
BTW, yes, I can confirm that the EXPLAIN giving different query plans than the actual query is unrelated to the engine_independent statistics and a different issue. | |||||||||
| Comment by Sergei Petrunia [ 2015-11-09 ] | |||||||||
|
Some details about how the error happened: we would pass the min/max_value, stating it is in utf8bin charset/collation and attempt to put it into an utf8bin Field object. This wouldn't do any data conversion but could cause the data to be truncated. I guess the upgrade procedure in patch
is adequate. If something has been truncated, we can't get it back. We could re-run ANALYZE TABLE for anything, but this could be a long process. My opinion is that a mention in the changelog would sufficient. (The upgrade doesn't make anything worse, if somebody wants to make things better, they'll need to re-run ANALYZE themselves). | |||||||||
| Comment by Sergei Petrunia [ 2015-11-09 ] | |||||||||
|
Pushed the fix for | |||||||||
| Comment by Sergei Petrunia [ 2015-11-09 ] | |||||||||
I think there won't be any problems with mysql_upgrade (the table will just stay the same). I take the table columns are using VARBINARY data type? The patch consists of 1. Modifications to mysql.column_stats table and 2. Changes in the source code. I have debugged to see what happens when one applies #1 but not #2. utf8-based charset and VARBINARY columns in user tables are ok.
Ok. The set of such queries is getting smaller over time, but some of them are still there. | |||||||||
| Comment by Jaime Crespo [ 2015-11-09 ] | |||||||||
|
I have applied the schema change, it seems to work well. We use only binary fields for text due to legacy reasons. All encoding is handled at application side. I will either upgrade to a version patched or patch it myself. I am now more concerned about the full index scan on the query plan, but I need more testing. I will try to solve it on application code rather than a change in the database server. | |||||||||
| Comment by Daniel Black [ 2015-11-12 ] | |||||||||
|
I can't see what utf8 charset can't be converted to VARBINARY however I'm happy with the change for the gains of all the non-utf8 varchar columns I can't use.
You can insert into the stats table directly without using analyze tables which means you shouldn't need to patch it manually. The charset aspects of https://mariadb.atlassian.net/secure/attachment/36708/mdev-7383-statistics-to-handle-as-my_charset_bin.patch relate to the storage by ANALYZE TABLE rather than any aspect to do with retrieval.
Eek, too long/invasive.
Agree. .bq (I know this may have problems on mysql_upgrade... jcrespo, this shouldn't be the case. The alter table is idempotent. jcrespo agree on skipping column stats for blob/text columns in the last bit of discussion in So solved as per | |||||||||
| Comment by Jaime Crespo [ 2015-11-12 ] | |||||||||
|
Yes, I will report any issue related to EXPLAIN separately, if any. | |||||||||
| Comment by Sergei Petrunia [ 2015-11-18 ] | |||||||||
|
Yes. Closed by fix for |