Hi Sergei Petrunia!
1)roberto spadim, UNIONs can be parallelized but I think it's not a high priority.
yeap, but for example:
(SELECT 1) UNION (SELECT 2) UNION (SELECT 3) LIMIT 1234
SELECT 1, SELECT 2 and SELECT 3, can run each one in one thread
the UNION can run as 'thread controller' to SELECTS 1,2,3 if thread controller know that enought data was received it could stop selects
inside SELECT 1,2,3
we could execute more than 4 threads (must check what each SELECT do)
the doubt here is, how to know what SELECT could/should create more threads
think about a big union with 10 SELECTS with MRR, we could have more than 50 threads?! just to expose some numbers that we should take care
i agree 100% with you, i'm just reporting some scenarios that we should take care =]
2)" if the first one got the total limit result we should stop all queries and discart rows from others queries" - I think, it is fairly rare that a query needs just LIMIT n rows, no matter which rows (please share your experience.. do you agree with this?).
some times... it's a aplication problem, but today what guys do is: instead of a big (select) UNION (select) union (select) limit 1234, at client side it execute the first query, if total rows<1234, execute the second, if total rows<1234 .... could be nice executing it at server side, with a option to execute the 'faster' (more optimized, lower cost) query first
for example (select big select) union (select small select) limit 10, we could execute the small select first, and big after if the mysql user report that order isn't a problem
2.1)Typically it is "ORDER BY ... LIMIT n", where the first N rows are needed, according to some criteria.
yes, i agree, the criteria normally is at mysql user, not mysql database
2.2)In that case, getting N rows from one part of the union still means you need to get N rows from the other part and then pick the N rows with the least order by criteria.
yeap, that's the 'standard' way to solve this query, but we could execute UNION in a 'optimized' order, first queries with lower cost, and after with higher cost, here an optimizer switch or something like it must be exposed to server/client/optimizer to change union order
2.3) sorry many post editions, i forgot to explain why union and execution order is important
ok some kinds of query with limit execute with a order, but some union are created with client side programmer knowledge about data/strutucre, but, when data go very big, programmer know that any union could/should be big, here the point is, what's the biggest select to union? he can only know if he have access to optimizer (explain) or to data statistics, that's the point to be optimized by server side, not client side
an example:
(select where some_index LIKE "%some thing%") UNION (select where full text match something) LIMIT 10
what is the more expensive query? i think only optimizer/server side could report, and yes i don't care about order, maybe i could rewrite as
(select where some_index LIKE "%some thing%") UNION (select where full text match something) ORDER BY NULL LIMIT 10
or something like it just to explain that this query don't need a order at "union" execution
—
i don't know if (2) could/should be done in this mdev (it's not the union optimization mdev), but it expose a "problem", how many thread could/should be created with union queries? should we execute all unions in parallel or should execute each union one by one?
Shard-query is able to run a query on multiple CPUs if the queried tables are partitioned. There is some data about how much this brings: http://www.mysqlperformanceblog.com/2014/05/01/parallel-query-mysql-shard-query/ (this is probably not the only piece of data).
Shard-query has been around for a long time, but didn't get much momentum for some reason. (An "obvious" technical explanation is that people don't want to partition their tables. But why wouldn't they?)