[MDEV-13372] Optimizer creates very bad execution plan for tokudb on random Created: 2017-07-22  Updated: 2018-03-05  Resolved: 2018-03-05

Status: Closed
Project: MariaDB Server
Component/s: Optimizer, Storage Engine - TokuDB
Affects Version/s: 10.1.22
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Slawomir Pryczek Assignee: Alice Sherepa
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Fedora Core


Attachments: PNG File bad-plan.png     PNG File good-plan.png     PNG File query-1join.png    
Issue Links:
Relates
relates to MDEV-13347 MariaDB is not picking up encompassin... Confirmed
relates to MDEV-13783 Index optimiser regression in 10.1.26... Open

 Description   

Hi Guys, im having huge issue with query optimizer it seems it makes it impossible to finish a very simple query using tokudb, with some basic joins... basically not only this query is affected but it seems it affects all queries on these 2 tables that are using >=2 JOINS
(im doing joins only between these two)

CREATE TABLE `stats_traffic_value` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `values` varchar(2048) NOT NULL,
  `_last_used` date NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `values` (`values`)
) ENGINE=TokuDB AUTO_INCREMENT=4249003 DEFAULT CHARSET=ascii;

CREATE TABLE `stats_traffic` (
  `date` date NOT NULL,
  `network_id` int(11) NOT NULL,
  `publisher_id` varchar(255) NOT NULL,
  `zone` varchar(255) NOT NULL,
  `language` char(2) NOT NULL,
  `country` char(2) NOT NULL,
  `stv_domain` int(11) NOT NULL,
  `stv_tag` int(11) NOT NULL,
  `stv_tag_secondary` int(11) NOT NULL,
  `stv_gender` int(11) NOT NULL,
  `stv_looking_for` int(11) NOT NULL,
  `device` varchar(20) NOT NULL,
  `_sandbox` varchar(255) NOT NULL,
  `impressions` int(11) NOT NULL DEFAULT '0',
  `unique` int(11) NOT NULL,
  `free_impressions` int(11) NOT NULL DEFAULT '0',
  `free_unique` int(11) NOT NULL,
  `clicks` int(11) NOT NULL,
  `clicks_total` int(11) NOT NULL,
  `conversions` int(11) NOT NULL,
  `conversion_value` decimal(10,3) NOT NULL,
  `spending` decimal(10,4) NOT NULL,
  `earnings` decimal(12,6) NOT NULL,
  `_key` binary(27) NOT NULL,
  UNIQUE KEY `_key` (`date`,`_key`) USING HASH
) ENGINE=TokuDB DEFAULT CHARSET=ascii
/*!50100 PARTITION BY HASH (TO_DAYS(date) DIV 20)
PARTITIONS 5 */;

Sample Query:

SELECT    	
`tf`.`network_id` as network,
`tf`.`publisher_id`,
`tf`.`country`,
`tf`.`zone`,
`tv2`.`value` AS `tag_secondary`,	
`tv`.`value` AS `tag`,
		
`tf`.`device` as `device`,
`tf`.`_sandbox` as `sandbox`,
		      
SUM(`impressions`) as impressions,          
SUM(`free_impressions`) as free_impressions
FROM `pstats`.`stats_traffic` `tf`
INNER JOIN `pstats`.`stats_traffic_value_lookup` `tv` ON `tv`.`stats_traffic_value_id` = `tf`.`stv_tag`
INNER JOIN `pstats`.`stats_traffic_value_lookup` `tv2` ON `tv2`.`stats_traffic_value_id` = `tf`.`stv_tag_secondary`
  
WHERE 1=1 AND date = '2017-07-21' AND `tf`.`_sandbox` = '' AND
`tv`.`value`='gay' AND country='us'	AND `tf`.network_id = 51
		
GROUP BY `publisher_id`, `zone`, `device`, `_sandbox`, 
stv_gender, `tv2`.`value`, `tv`.value, stv_looking_for

Now there are 2 execution plans which are made, seems randomly... one is taking about 3 seconds to finish, the second one never finishes...

Good Plan:
https://www.screencast.com/t/rqBCtKYIn

Bad Plan:
https://www.screencast.com/t/9Mfh9EOk

Now, the main table has about 40 million rows, good plan results AFAIK in picking about 125k rows by DATE index, and then simple join with lookup tables (using PRI KEY on IDs) so numbers are converted to string. Very simple i treat this second table as dictionary... query takes about 2-4 seconds...

For bad plan not sure what it's trying to do as i read it's trying to use index of table used only for group by as covering index... basically due to this strategy it now stars to run for 2'nd hour, with "Queried about 1170790000 rows"

When i remove the second join (it always works good... attached as query1-join). Any idea how can i fix that or debug, so it'll always "start" with the main table filter by date and do ref-joins? Thanks



 Comments   
Comment by Alice Sherepa [ 2017-09-19 ]

Please attach your cnf file(s) and

SHOW CREATE TABLE 'stats_traffic_value_lookup';

Comment by Slawomir Pryczek [ 2017-09-19 ]

Hey Alice, sending you SHOW CREATE...

CREATE TABLE `stats_traffic_value_lookup` (
  `value` varchar(255) NOT NULL,
  `stats_traffic_value_id` int(11) NOT NULL,
  `_last_used` date NOT NULL,
  PRIMARY KEY (`stats_traffic_value_id`,`value`),
  KEY `value` (`value`)
) ENGINE=TokuDB DEFAULT CHARSET=ascii

Basically I noticed that for toku sometimes for tables with millions of rows the row counter information schema is 0. Of course count gives correct results. And after running analyze table with options to update the count it becomes ok again.

If the row count is 0, it basically always produces bad execution plan. If row count is more or less about correct, then it produces incorrect execution plan randomly. For now we're switching all queries touching larger toku tables to STRAIGHT_JOIN to mitigate this, for toku... basically same tables were on aria/myisam before and it always worked correctly, so it might be some issue with tokudb statistical data which is used by optimizer to make a query plan...

Thanks,
Slawomir.

Comment by Phil Sweeney [ 2017-09-20 ]

Numerous row counting issues were introduced to TokuDB when they switched to physical row counts a while back (last year?). Even after row counts are corrected with manual commands, they can end up at zero again due to bugs.

Some info on recounting the rows here: https://www.percona.com/doc/percona-server/LATEST/tokudb/tokudb_background_analyze_table.html

A number of bug fixes have resulted over a long period of time.. some even in the most recent releases of MariaDB (the most recent fixes are especially relevant if you have any INSERT ON DUPLICATE KEY). Suggest you try on newer versions. I will say we've had some index choice regressions from 10.1.24 onwards though which is preventing us from upgrading to get the row count fixes.. YMMV!

Comment by Slawomir Pryczek [ 2017-09-21 ]

Hi Phil, thanks for the comment .. yes it has to be it, probably also there's some issue with cardinality i think (as sometimes when the row count is >0 and close to actual number, the query plan is still wrong). And we're almost always doing inserts with ON DUPLICATE. Strange thing is that after some time the query was executed ok ( run same query eg. 4 times, break it 4 times for 5'th time it worked), and then the plan was working "ok" for some time, and after some more - got "broken" again.

Also a friend of mine was having same issues with his servers, basically he'd just drop the table and re-create it, then it started to work correctly for some time and then it was "broken" again. We'll probably try 10.2, and for now stick to STRAIGHT_JOIN because with query times changing from 2 ms to 2 days, it's too dangerous to rely on optimizer for us.

Do you maybe know if background re-counting helps and the count drops gradually, or these bugs can make it to be incorrectly set to 0 instantly, so there's no point in doing it?

Comment by Phil Sweeney [ 2017-09-22 ]

Yes, recounting will help, but the row count will slowly decrease until it hits zero again. Sometimes I found it didn't affect query performance until you restarted the server, then it seemed to 'pick up' the fact that the row count was zero and go crazy (I found this especially on a slave that restarts automatically for snapshotting).

Comment by Elena Stepanova [ 2017-09-30 ]

philsweeney wrote:

Numerous row counting issues were introduced to TokuDB when they switched to physical row counts a while back (last year?). Even after row counts are corrected with manual commands, they can end up at zero again due to bugs.

philsweeney,
I've found your bug report TDB-2. Did you file or are you aware of any other related ones?
TDB-2 is supposed to be fixed in 5.6.36-82.1, which is included into 10.1.26. However, we are still getting similar complaints, even about 10.1.26, e.g. MDEV-13783 . But then, you said that there were numerous issues, so maybe there are other reports which are still open?

Whoever has information about existence and status of upstream TokuDB bugs, please do share.

Comment by Phil Sweeney [ 2017-09-30 ]

elenst Yes, the row count bugs are mostly fixed (note there's another one around hot created indexes breaking the row count, fix not yet released by Percona: https://jira.percona.com/browse/TDB-35)

However, there is another regression introduced in MariaDB 10.1.24 where index selection goes wonky. I was hoping that TDB-35 would fix it, but when they ended up saying it was related to newly hot created indexes I realised I have a separate regression (not related to row counts - my row counts are fine!).

I'll add some more commentary to MDEV-13783 - hoping you can contribute your skills to helping track down what is going on!

Comment by Elena Stepanova [ 2017-10-01 ]

philsweeney,

Has your variation of the issue (the one where row counts are fine, but the plan is still bad) been filed anywhere?
I've seen your comments on MDEV-13347, but not on MDEV-13783.

Comment by Phil Sweeney [ 2017-10-02 ]

elenst I ended up commenting on MDEV-13347 instead. I haven't filed a separate issue because I've had trouble getting into a small enough query without 5 joins to be of use. Will let you know if I get more time to look at it.

Comment by Alice Sherepa [ 2018-01-18 ]

Slawomir Pryczek, I can not reproduce it so far.
In the table stats_traffic you have a column, named "unique", I guess it should produce an error while creating a table (I just renamed it).
I get only good query execution plan all the time. Maybe you found what provoke that bad plan, how to repeat it? Please also provide your .cnf file(s).

Comment by Slawomir Pryczek [ 2018-03-03 ]

Hi Elena, sorry just saw the message. For now i can't remember exactly nor reproduce the issue as it was random, we have updated already to 10.2 and we're actually using STRAIGHT_JOIN to be sure everything works as expected. It was for sure related to rowcount being 0 as i remember i was able to fix some of these issues by recounting rows. Not sure about the count being exactly 0, might be that the count was just incorrect (but it's related to this "class" of bugs that makes rowcount incorrect for toku, i think). It might also be possible that after server update we did - rowcount was somehow "damaged"...

For new queries will try to use normal joins and if anything bad happens will report again with some better description. It's unfortunately not easy to reproduce. The server could be running ok for couple of days then one query started to generate bad plan often (like 20-50% of the time) and everything got stuck.

Also running same query on replica worked good, if that helps in anything. And AFAIR i never had this issue on anything else than toku...

Generated at Thu Feb 08 08:05:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.