[MDEV-13372] Optimizer creates very bad execution plan for tokudb on random Created: 2017-07-22 Updated: 2018-03-05 Resolved: 2018-03-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Optimizer, Storage Engine - TokuDB |
| Affects Version/s: | 10.1.22 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Slawomir Pryczek | Assignee: | Alice Sherepa |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Fedora Core |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
Hi Guys, im having huge issue with query optimizer it seems it makes it impossible to finish a very simple query using tokudb, with some basic joins... basically not only this query is affected but it seems it affects all queries on these 2 tables that are using >=2 JOINS
Sample Query:
Now there are 2 execution plans which are made, seems randomly... one is taking about 3 seconds to finish, the second one never finishes... Good Plan: Bad Plan: Now, the main table has about 40 million rows, good plan results AFAIK in picking about 125k rows by DATE index, and then simple join with lookup tables (using PRI KEY on IDs) so numbers are converted to string. Very simple i treat this second table as dictionary... query takes about 2-4 seconds... For bad plan not sure what it's trying to do as i read it's trying to use index of table used only for group by as covering index... basically due to this strategy it now stars to run for 2'nd hour, with "Queried about 1170790000 rows" When i remove the second join (it always works good... attached as query1-join). Any idea how can i fix that or debug, so it'll always "start" with the main table filter by date and do ref-joins? Thanks |
| Comments |
| Comment by Alice Sherepa [ 2017-09-19 ] | |||||||
|
Please attach your cnf file(s) and
| |||||||
| Comment by Slawomir Pryczek [ 2017-09-19 ] | |||||||
|
Hey Alice, sending you SHOW CREATE...
Basically I noticed that for toku sometimes for tables with millions of rows the row counter information schema is 0. Of course count If the row count is 0, it basically always produces bad execution plan. If row count is more or less about correct, then it produces incorrect execution plan randomly. For now we're switching all queries touching larger toku tables to STRAIGHT_JOIN to mitigate this, for toku... basically same tables were on aria/myisam before and it always worked correctly, so it might be some issue with tokudb statistical data which is used by optimizer to make a query plan... Thanks, | |||||||
| Comment by Phil Sweeney [ 2017-09-20 ] | |||||||
|
Numerous row counting issues were introduced to TokuDB when they switched to physical row counts a while back (last year?). Even after row counts are corrected with manual commands, they can end up at zero again due to bugs. Some info on recounting the rows here: https://www.percona.com/doc/percona-server/LATEST/tokudb/tokudb_background_analyze_table.html A number of bug fixes have resulted over a long period of time.. some even in the most recent releases of MariaDB (the most recent fixes are especially relevant if you have any INSERT ON DUPLICATE KEY). Suggest you try on newer versions. I will say we've had some index choice regressions from 10.1.24 onwards though which is preventing us from upgrading to get the row count fixes.. YMMV! | |||||||
| Comment by Slawomir Pryczek [ 2017-09-21 ] | |||||||
|
Hi Phil, thanks for the comment .. yes it has to be it, probably also there's some issue with cardinality i think (as sometimes when the row count is >0 and close to actual number, the query plan is still wrong). And we're almost always doing inserts with ON DUPLICATE. Strange thing is that after some time the query was executed ok ( run same query eg. 4 times, break it 4 times for 5'th time it worked), and then the plan was working "ok" for some time, and after some more - got "broken" again. Also a friend of mine was having same issues with his servers, basically he'd just drop the table and re-create it, then it started to work correctly for some time and then it was "broken" again. We'll probably try 10.2, and for now stick to STRAIGHT_JOIN because with query times changing from 2 ms to 2 days, it's too dangerous to rely on optimizer for us. Do you maybe know if background re-counting helps and the count drops gradually, or these bugs can make it to be incorrectly set to 0 instantly, so there's no point in doing it? | |||||||
| Comment by Phil Sweeney [ 2017-09-22 ] | |||||||
|
Yes, recounting will help, but the row count will slowly decrease until it hits zero again. Sometimes I found it didn't affect query performance until you restarted the server, then it seemed to 'pick up' the fact that the row count was zero and go crazy (I found this especially on a slave that restarts automatically for snapshotting). | |||||||
| Comment by Elena Stepanova [ 2017-09-30 ] | |||||||
|
philsweeney wrote:
philsweeney, Whoever has information about existence and status of upstream TokuDB bugs, please do share. | |||||||
| Comment by Phil Sweeney [ 2017-09-30 ] | |||||||
|
elenst Yes, the row count bugs are mostly fixed (note there's another one around hot created indexes breaking the row count, fix not yet released by Percona: https://jira.percona.com/browse/TDB-35) However, there is another regression introduced in MariaDB 10.1.24 where index selection goes wonky. I was hoping that TDB-35 would fix it, but when they ended up saying it was related to newly hot created indexes I realised I have a separate regression (not related to row counts - my row counts are fine!). I'll add some more commentary to MDEV-13783 - hoping you can contribute your skills to helping track down what is going on! | |||||||
| Comment by Elena Stepanova [ 2017-10-01 ] | |||||||
|
Has your variation of the issue (the one where row counts are fine, but the plan is still bad) been filed anywhere? | |||||||
| Comment by Phil Sweeney [ 2017-10-02 ] | |||||||
|
elenst I ended up commenting on MDEV-13347 instead. I haven't filed a separate issue because I've had trouble getting into a small enough query without 5 joins to be of use. Will let you know if I get more time to look at it. | |||||||
| Comment by Alice Sherepa [ 2018-01-18 ] | |||||||
|
Slawomir Pryczek, I can not reproduce it so far. | |||||||
| Comment by Slawomir Pryczek [ 2018-03-03 ] | |||||||
|
Hi Elena, sorry just saw the message. For now i can't remember exactly nor reproduce the issue as it was random, we have updated already to 10.2 and we're actually using STRAIGHT_JOIN to be sure everything works as expected. It was for sure related to rowcount being 0 as i remember i was able to fix some of these issues by recounting rows. Not sure about the count being exactly 0, might be that the count was just incorrect (but it's related to this "class" of bugs that makes rowcount incorrect for toku, i think). It might also be possible that after server update we did - rowcount was somehow "damaged"... For new queries will try to use normal joins and if anything bad happens will report again with some better description. It's unfortunately not easy to reproduce. The server could be running ok for couple of days then one query started to generate bad plan often (like 20-50% of the time) and everything got stuck. Also running same query on replica worked good, if that helps in anything. And AFAIR i never had this issue on anything else than toku... |