[MDEV-6096] Ideas about parallel query execution - Jira

Sergei Petrunia added a comment - 2014-05-14 16:40

Shard-query is able to run a query on multiple CPUs if the queried tables are partitioned. There is some data about how much this brings: http://www.mysqlperformanceblog.com/2014/05/01/parallel-query-mysql-shard-query/ (this is probably not the only piece of data).

Shard-query has been around for a long time, but didn't get much momentum for some reason. (An "obvious" technical explanation is that people don't want to partition their tables. But why wouldn't they?)

Sergei Petrunia added a comment - 2014-05-14 16:40 Shard-query is able to run a query on multiple CPUs if the queried tables are partitioned. There is some data about how much this brings: http://www.mysqlperformanceblog.com/2014/05/01/parallel-query-mysql-shard-query/ (this is probably not the only piece of data). Shard-query has been around for a long time, but didn't get much momentum for some reason. (An "obvious" technical explanation is that people don't want to partition their tables. But why wouldn't they?)

roberto spadim added a comment - 2014-07-30 06:08 - edited

spider engine can execute parallel query? the "shard feature"

roberto spadim added a comment - 2014-07-30 06:08 - edited spider engine can execute parallel query? the "shard feature"

Sergei Petrunia added a comment - 2014-08-02 00:22

Yes, it can. It has some limitations, though:

it handles only certain kinds of aggregates.
you need to set up sharding (put your data into shards, etc) before you can use spider. This requires a lot of effort, not everyone is willing to shard their database before they could run a few queries in parallel. It is much easier to get people to try a solution that doesn't require re-loading all of their data before they can try it.

On the other hand, it would be stupid to re-implement features that are already implemented in Spider.

Sergei Petrunia added a comment - 2014-08-02 00:22 Yes, it can. It has some limitations, though: it handles only certain kinds of aggregates. you need to set up sharding (put your data into shards, etc) before you can use spider. This requires a lot of effort, not everyone is willing to shard their database before they could run a few queries in parallel. It is much easier to get people to try a solution that doesn't require re-loading all of their data before they can try it. On the other hand, it would be stupid to re-implement features that are already implemented in Spider.

roberto spadim added a comment - 2014-08-02 00:47

nice, any idea how to "throttle" if query will execute with many threads or not?

roberto spadim added a comment - 2014-08-02 00:47 nice, any idea how to "throttle" if query will execute with many threads or not?

Sergei Petrunia added a comment - 2014-10-16 19:36

Parallel query execution.

Splitting the query across multiple threads requires

R1. Enlisting the worker threads
R2. Dividing the work into work units.
R3. Executing work units in parallel
R4. Collecting the results of work back together

== R1. Enlisting the worker threads ==

Need to decide on how many threads to use

Need to create or get the threads (can either create or get them from the
thread pool).

== R2. Dividing the work into work units ==

We need to split the work into work units somehow. The most basic task is to
run a table scan with N threads. We will have to split the table into
multiple chunks.

== R3. Executing work units in parallel ==

When using InnoDB, there is a problem with doing reads from multiple threads:
they must all use the same transaction. InnoDB doesn't let one use transaction in multiple threads AFAIU. For read-only queries this could be sidestepped by starting multiple different transactions which all have the same view of the database.

== R4. Collecting the results of work back together ==

Need a multiple-producer single consumer queue.
Note that table->record[0] format is space-inefficient for e.g. big VARCHARs

Sergei Petrunia added a comment - 2014-10-16 19:36 Parallel query execution. Splitting the query across multiple threads requires R1. Enlisting the worker threads R2. Dividing the work into work units. R3. Executing work units in parallel R4. Collecting the results of work back together == R1. Enlisting the worker threads == Need to decide on how many threads to use Need to create or get the threads (can either create or get them from the thread pool). == R2. Dividing the work into work units == We need to split the work into work units somehow. The most basic task is to run a table scan with N threads. We will have to split the table into multiple chunks. == R3. Executing work units in parallel == When using InnoDB, there is a problem with doing reads from multiple threads: they must all use the same transaction. InnoDB doesn't let one use transaction in multiple threads AFAIU. For read-only queries this could be sidestepped by starting multiple different transactions which all have the same view of the database. == R4. Collecting the results of work back together == Need a multiple-producer single consumer queue. Note that table->record [0] format is space-inefficient for e.g. big VARCHARs

roberto spadim added a comment - 2014-10-16 20:07 - edited

One question, how optimizer will know if a query should/shouldn't execute in parallel?
Table size? Count of tables in query? Server workload? User sql flag (sql_parallel sql_no_parallel)? Optimizer switch? Table engine(partition)? Use of agregate functions in query?

roberto spadim added a comment - 2014-10-16 20:07 - edited One question, how optimizer will know if a query should/shouldn't execute in parallel? Table size? Count of tables in query? Server workload? User sql flag (sql_parallel sql_no_parallel)? Optimizer switch? Table engine(partition)? Use of agregate functions in query?

Sergei Petrunia added a comment - 2014-10-16 23:44

I think, parallel execution should only be attempted when the join is sufficiently big. That is, we run the optimizer, then look at the query plan, and if it is expected to enumerate sufficiently many rows, we try parallel execution.

Sergei Petrunia added a comment - 2014-10-16 23:44 I think, parallel execution should only be attempted when the join is sufficiently big. That is, we run the optimizer, then look at the query plan, and if it is expected to enumerate sufficiently many rows, we try parallel execution.

VAROQUI Stephane added a comment - 2014-10-16 23:59

A temporary Spider table can be created partitioned to the same local table . if you do a range create x partition by range where x is number of thread. The query is run by replacing the original table by the spider table et voila

VAROQUI Stephane added a comment - 2014-10-16 23:59 A temporary Spider table can be created partitioned to the same local table . if you do a range create x partition by range where x is number of thread. The query is run by replacing the original table by the spider table et voila

VAROQUI Stephane added a comment - 2014-10-17 00:01 - edited

If you wan't this to work we first need merge of the // partition scan plugin from Spiral + Additional patches, but using the same idea we can define each partitions on a list of slaves to get // queries to a cluster of nodes

SESSION variable or SQL HINT like SELECT SQL_LOCAL_ANALYTIC if run on slaves SQL_SLAVES_ANALYTIC

For this to work we need partition range condition push down added to spider, that a count from T1 will not return x time records in T1

VAROQUI Stephane added a comment - 2014-10-17 00:01 - edited If you wan't this to work we first need merge of the // partition scan plugin from Spiral + Additional patches, but using the same idea we can define each partitions on a list of slaves to get // queries to a cluster of nodes SESSION variable or SQL HINT like SELECT SQL_LOCAL_ANALYTIC if run on slaves SQL_SLAVES_ANALYTIC For this to work we need partition range condition push down added to spider, that a count from T1 will not return x time records in T1

roberto spadim added a comment - 2014-10-17 09:22 - edited

humm, maybe with UNION selects too?
i have openned a feature, about running UNION query with LIMIT and without order by, maybe should be interesting think about union parallel query too even without many rows, but we can execute each query since they don't affected each other (only at UNION removing duplicates)?

in other words: union/union all, many rows, partitions?, sql hints/optimizer variable
should be interesting something to "create table xyz select..." and "insert into xyz select..." ? i'm commenting here just to don't forget or leave the idea in some usefull place

roberto spadim added a comment - 2014-10-17 09:22 - edited humm, maybe with UNION selects too? i have openned a feature, about running UNION query with LIMIT and without order by, maybe should be interesting think about union parallel query too even without many rows, but we can execute each query since they don't affected each other (only at UNION removing duplicates)? in other words: union/union all, many rows, partitions?, sql hints/optimizer variable should be interesting something to "create table xyz select..." and "insert into xyz select..." ? i'm commenting here just to don't forget or leave the idea in some usefull place

Federico Razzoli added a comment - 2014-10-17 15:44

Wouldn't this be useful for MRR queries, when many rows are expected to be read?

Federico Razzoli added a comment - 2014-10-17 15:44 Wouldn't this be useful for MRR queries, when many rows are expected to be read?

roberto spadim added a comment - 2014-10-18 07:23

Here a example of a query that should (and shouldn't?) be executed in parallel:
MDEV-6696

how to know if we should execute in parallel or not (what the condition) when we have a union+limit, with possible fast queries + possible biggers/slow queries? should we divide the union in many threads and start all queries? if the first one got the total limit result we should stop all queries and discart rows from others queries? in this case if we execute fast queries AND slow queries at same time, maybe the fast query could be slower (cause we have more iops at disks, that's something that we could better tune), it's something a bit complex to select if we should/shouldn't execute in parallel, isn't?

in any case, a parallel execution (even with a optimizer switch) is a nice feature to have

roberto spadim added a comment - 2014-10-18 07:23 Here a example of a query that should (and shouldn't?) be executed in parallel: MDEV-6696 how to know if we should execute in parallel or not (what the condition) when we have a union+limit, with possible fast queries + possible biggers/slow queries? should we divide the union in many threads and start all queries? if the first one got the total limit result we should stop all queries and discart rows from others queries? in this case if we execute fast queries AND slow queries at same time, maybe the fast query could be slower (cause we have more iops at disks, that's something that we could better tune), it's something a bit complex to select if we should/shouldn't execute in parallel, isn't? in any case, a parallel execution (even with a optimizer switch) is a nice feature to have

Sergei Petrunia added a comment - 2014-10-22 12:41

stephanevaroqui: I am concerned about putting the requirement that the data is partitioned. If one needs to partition their data before they can run a parallel scan, that puts a huge barrier to entry. What about real-world databases? Is it often the case that big tables are already partitioned in a way that allows to do parallel scans?
f_razzoli, queries that use MRR (ie. BKA) are big queries, so they are candidates for being run in parallel. I am not yet sure how this should be done, though.

rspadim, UNIONs can be parallelized but I think it's not a high priority.
" if the first one got the total limit result we should stop all queries and discart rows from others queries" - I think, it is fairly rare that a query needs just LIMIT n rows, no matter which rows (please share your experience.. do you agree with this?). Typically it is "ORDER BY ... LIMIT n", where the first N rows are needed, according to some criteria. In that case, getting N rows from one part of the union still means you need to get N rows from the other part and then pick the N rows with the least order by criteria.

Sergei Petrunia added a comment - 2014-10-22 12:41 stephanevaroqui : I am concerned about putting the requirement that the data is partitioned. If one needs to partition their data before they can run a parallel scan, that puts a huge barrier to entry. What about real-world databases? Is it often the case that big tables are already partitioned in a way that allows to do parallel scans? f_razzoli , queries that use MRR (ie. BKA) are big queries, so they are candidates for being run in parallel. I am not yet sure how this should be done, though. rspadim , UNIONs can be parallelized but I think it's not a high priority. " if the first one got the total limit result we should stop all queries and discart rows from others queries" - I think, it is fairly rare that a query needs just LIMIT n rows, no matter which rows (please share your experience.. do you agree with this?). Typically it is "ORDER BY ... LIMIT n", where the first N rows are needed, according to some criteria. In that case, getting N rows from one part of the union still means you need to get N rows from the other part and then pick the N rows with the least order by criteria.

Sergei Petrunia added a comment - 2014-10-22 12:45 - edited

Need to look at Spider SE.

Results of discussion with igor: code-wise, before getting something to run, we need to learn:

Implement Item::clone for sufficient number of Items so that we can clone the WHERE clause
Make a feature to "co-open tables": we need TABLE* and handler* objects that are usable from secondary threads.
Implement (or reuse) something that allows to transfer results from the secondary threads back to the coordinator thread.

Sergei Petrunia added a comment - 2014-10-22 12:45 - edited Need to look at Spider SE. Results of discussion with igor : code-wise, before getting something to run, we need to learn: Implement Item::clone for sufficient number of Items so that we can clone the WHERE clause Make a feature to "co-open tables": we need TABLE* and handler* objects that are usable from secondary threads. Implement (or reuse) something that allows to transfer results from the secondary threads back to the coordinator thread.

roberto spadim added a comment - 2014-10-22 19:28 - edited

Hi Sergei Petrunia!
1)roberto spadim, UNIONs can be parallelized but I think it's not a high priority.

yeap, but for example:
(SELECT 1) UNION (SELECT 2) UNION (SELECT 3) LIMIT 1234
SELECT 1, SELECT 2 and SELECT 3, can run each one in one thread
the UNION can run as 'thread controller' to SELECTS 1,2,3 if thread controller know that enought data was received it could stop selects
inside SELECT 1,2,3
we could execute more than 4 threads (must check what each SELECT do)
the doubt here is, how to know what SELECT could/should create more threads
think about a big union with 10 SELECTS with MRR, we could have more than 50 threads?! just to expose some numbers that we should take care
i agree 100% with you, i'm just reporting some scenarios that we should take care =]

2)" if the first one got the total limit result we should stop all queries and discart rows from others queries" - I think, it is fairly rare that a query needs just LIMIT n rows, no matter which rows (please share your experience.. do you agree with this?).

some times... it's a aplication problem, but today what guys do is: instead of a big (select) UNION (select) union (select) limit 1234, at client side it execute the first query, if total rows<1234, execute the second, if total rows<1234 .... could be nice executing it at server side, with a option to execute the 'faster' (more optimized, lower cost) query first
for example (select big select) union (select small select) limit 10, we could execute the small select first, and big after if the mysql user report that order isn't a problem

2.1)Typically it is "ORDER BY ... LIMIT n", where the first N rows are needed, according to some criteria.

yes, i agree, the criteria normally is at mysql user, not mysql database

2.2)In that case, getting N rows from one part of the union still means you need to get N rows from the other part and then pick the N rows with the least order by criteria.

yeap, that's the 'standard' way to solve this query, but we could execute UNION in a 'optimized' order, first queries with lower cost, and after with higher cost, here an optimizer switch or something like it must be exposed to server/client/optimizer to change union order

2.3) sorry many post editions, i forgot to explain why union and execution order is important
ok some kinds of query with limit execute with a order, but some union are created with client side programmer knowledge about data/strutucre, but, when data go very big, programmer know that any union could/should be big, here the point is, what's the biggest select to union? he can only know if he have access to optimizer (explain) or to data statistics, that's the point to be optimized by server side, not client side

an example:

(select where some_index LIKE "%some thing%") UNION (select where full text match something) LIMIT 10

what is the more expensive query? i think only optimizer/server side could report, and yes i don't care about order, maybe i could rewrite as

(select where some_index LIKE "%some thing%") UNION (select where full text match something) ORDER BY NULL LIMIT 10

or something like it just to explain that this query don't need a order at "union" execution

—
i don't know if (2) could/should be done in this mdev (it's not the union optimization mdev), but it expose a "problem", how many thread could/should be created with union queries? should we execute all unions in parallel or should execute each union one by one?

roberto spadim added a comment - 2014-10-22 19:28 - edited Hi Sergei Petrunia! 1)roberto spadim, UNIONs can be parallelized but I think it's not a high priority. yeap, but for example: (SELECT 1) UNION (SELECT 2) UNION (SELECT 3) LIMIT 1234 SELECT 1, SELECT 2 and SELECT 3, can run each one in one thread the UNION can run as 'thread controller' to SELECTS 1,2,3 if thread controller know that enought data was received it could stop selects inside SELECT 1,2,3 we could execute more than 4 threads (must check what each SELECT do) the doubt here is, how to know what SELECT could/should create more threads think about a big union with 10 SELECTS with MRR, we could have more than 50 threads?! just to expose some numbers that we should take care i agree 100% with you, i'm just reporting some scenarios that we should take care =] 2)" if the first one got the total limit result we should stop all queries and discart rows from others queries" - I think, it is fairly rare that a query needs just LIMIT n rows, no matter which rows (please share your experience.. do you agree with this?). some times... it's a aplication problem, but today what guys do is: instead of a big (select) UNION (select) union (select) limit 1234, at client side it execute the first query, if total rows<1234, execute the second, if total rows<1234 .... could be nice executing it at server side, with a option to execute the 'faster' (more optimized, lower cost) query first for example (select big select) union (select small select) limit 10, we could execute the small select first, and big after if the mysql user report that order isn't a problem 2.1)Typically it is "ORDER BY ... LIMIT n", where the first N rows are needed, according to some criteria. yes, i agree, the criteria normally is at mysql user, not mysql database 2.2)In that case, getting N rows from one part of the union still means you need to get N rows from the other part and then pick the N rows with the least order by criteria. yeap, that's the 'standard' way to solve this query, but we could execute UNION in a 'optimized' order, first queries with lower cost, and after with higher cost, here an optimizer switch or something like it must be exposed to server/client/optimizer to change union order 2.3) sorry many post editions, i forgot to explain why union and execution order is important ok some kinds of query with limit execute with a order, but some union are created with client side programmer knowledge about data/strutucre, but, when data go very big, programmer know that any union could/should be big, here the point is, what's the biggest select to union? he can only know if he have access to optimizer (explain) or to data statistics, that's the point to be optimized by server side, not client side an example: (select where some_index LIKE "%some thing%") UNION (select where full text match something) LIMIT 10 what is the more expensive query? i think only optimizer/server side could report, and yes i don't care about order, maybe i could rewrite as (select where some_index LIKE "%some thing%") UNION (select where full text match something) ORDER BY NULL LIMIT 10 or something like it just to explain that this query don't need a order at "union" execution — i don't know if (2) could/should be done in this mdev (it's not the union optimization mdev), but it expose a "problem", how many thread could/should be created with union queries? should we execute all unions in parallel or should execute each union one by one?

VAROQUI Stephane added a comment - 2014-10-22 20:25 - edited

In my experience 99,999% of workload are using order by limit and many have issues with such queries, most client stop using RDBMS for processing that , they are now using external indexers sphinx, solr, elastic search not for full text but for this. This is the hart of every back office presenting the data in a single page based on users input regarding filter and sorting.

Partitioning is used yes, for most big tables, because BTree stop performing when the size of a table stop being in memory. I would say that many workload does not use covering index , and so all secondary index reference the clustered table, making such tables hot most of the time.

There is more reasons to use parallel plan , Most storage engine does not have an I/O materialization layer for read , every read from secondary index to a clustered table will be queued in a single execution thread , so on a SAN or a networked FS , 5 ms larency will be added to every page missed , the total latency of such queries can be divide by the concurrency as long have we don't ask more than the total IO read capacity . Prefetching is only used when scanning index or clustered index but not when using both at the same table , or joins to secondary index without MRR. so we have added MRR but InnoDB does exactly the same as before because having this 2 level index scan , from secondary to primary index

Now spider partitioning can be created on the flight with a temporary table. to create a virtual partitioning based on the QP . What need to be added is partition condition push down to every node in the reduce thread. Like this every thread can work on different range of the same table by heritage of the range conditions

VAROQUI Stephane added a comment - 2014-10-22 20:25 - edited In my experience 99,999% of workload are using order by limit and many have issues with such queries, most client stop using RDBMS for processing that , they are now using external indexers sphinx, solr, elastic search not for full text but for this. This is the hart of every back office presenting the data in a single page based on users input regarding filter and sorting. Partitioning is used yes, for most big tables, because BTree stop performing when the size of a table stop being in memory. I would say that many workload does not use covering index , and so all secondary index reference the clustered table, making such tables hot most of the time. There is more reasons to use parallel plan , Most storage engine does not have an I/O materialization layer for read , every read from secondary index to a clustered table will be queued in a single execution thread , so on a SAN or a networked FS , 5 ms larency will be added to every page missed , the total latency of such queries can be divide by the concurrency as long have we don't ask more than the total IO read capacity . Prefetching is only used when scanning index or clustered index but not when using both at the same table , or joins to secondary index without MRR. so we have added MRR but InnoDB does exactly the same as before because having this 2 level index scan , from secondary to primary index Now spider partitioning can be created on the flight with a temporary table. to create a virtual partitioning based on the QP . What need to be added is partition condition push down to every node in the reduce thread. Like this every thread can work on different range of the same table by heritage of the range conditions

roberto spadim added a comment - 2014-10-22 22:10 - edited

1) i don't like the idea "multi thread only work if you partition your data", or "create a table to use multi thread" too, maybe implicit optimizations could be done, and at documentation we could explain others methods
we could use parallel query execution with the order by part (why not?)

2) i think this MDEV will solve problems like:

to "get" data (a single SELECT):
2.1) to execute this query "xyz", we need 99999 i/o, we can execute 99iops with 1 thread, 99 with 2 threads and 99 with 9999 threads
2.2) to execute this query "xyzq", we need 99999 i/o, we can execute 9iops with 1 thread, 99 with 2 threads and 999 with 9999 threads
2.3) what's the best iops/thread and how many threads we will need and what order should it be executed to get data from tables?
from 1: (99io/1 thread=99 iops/thread, 99/2=49.5 iops/thread ... 99/9999=0.009 iops/thread) => use 1 thread
from 2: (9/1 = 9 iops/thread, 99/2 = 49.5 iops/thread ... 999/9999 = 0.09 iops/thread) => use 2 threads

to "order by", "group by", "limit", "union" (after fetching data, or how to execute many selects)
2.4) this data with 1 thread we can order by with 10seconds, with 2 threads we can execute with 5 seconds, with 1000 threads we can execute with 1 second, with 100000 threads we can execute with 0.99 second
again something like seconds/thread or a limit of % of gain from 1 to 2 to 3 threads, to select the best number of threads (something like calculating the sin/cos function with a % of error, from 1 to 2 we reduce 50% time, from 2 to 3 we reduce 1% time, let's use 2 threads 1% isn't a nice reduce)

I don't know how to easily explain/extrapolate to user/optimizer these statistics, and limit the number of threads that could be created in each part of query (fetch, order, group, limit, union, etc...)

3)about the workload at order by/limit: i agree, some read/write contention when executing a order by is a killer with some apps, but i don't know if we can remove this or not, my today work around is reduce lock/contention with commit before ordering, and ordering after with others engines or others tools

i really like the idea of "please database order by the fast way you can, i don't care about ordering result be the best one just order it with a low response time"
example with big data (or with the union with fast and slow queires):
a big order by, share the data with 4 threads, each thread execute 25% of order by, if we got the result without a finall order by we have something like "abd, cde, def, efg" (not 'totally' ordered, but with a usefull order at least), with the final order by we have "abcdddeeeffg"
i could accept the first one result if this reduce for example 1 minute or some time that is critical to others threads/resource/user use

roberto spadim added a comment - 2014-10-22 22:10 - edited 1) i don't like the idea "multi thread only work if you partition your data", or "create a table to use multi thread" too, maybe implicit optimizations could be done, and at documentation we could explain others methods we could use parallel query execution with the order by part (why not?) 2) i think this MDEV will solve problems like: to "get" data (a single SELECT): 2.1) to execute this query "xyz", we need 99999 i/o, we can execute 99iops with 1 thread, 99 with 2 threads and 99 with 9999 threads 2.2) to execute this query "xyzq", we need 99999 i/o, we can execute 9iops with 1 thread, 99 with 2 threads and 999 with 9999 threads 2.3) what's the best iops/thread and how many threads we will need and what order should it be executed to get data from tables? from 1: (99io/1 thread=99 iops/thread, 99/2=49.5 iops/thread ... 99/9999=0.009 iops/thread) => use 1 thread from 2: (9/1 = 9 iops/thread, 99/2 = 49.5 iops/thread ... 999/9999 = 0.09 iops/thread) => use 2 threads to "order by", "group by", "limit", "union" (after fetching data, or how to execute many selects) 2.4) this data with 1 thread we can order by with 10seconds, with 2 threads we can execute with 5 seconds, with 1000 threads we can execute with 1 second, with 100000 threads we can execute with 0.99 second again something like seconds/thread or a limit of % of gain from 1 to 2 to 3 threads, to select the best number of threads (something like calculating the sin/cos function with a % of error, from 1 to 2 we reduce 50% time, from 2 to 3 we reduce 1% time, let's use 2 threads 1% isn't a nice reduce) I don't know how to easily explain/extrapolate to user/optimizer these statistics, and limit the number of threads that could be created in each part of query (fetch, order, group, limit, union, etc...) 3)about the workload at order by/limit: i agree, some read/write contention when executing a order by is a killer with some apps, but i don't know if we can remove this or not, my today work around is reduce lock/contention with commit before ordering, and ordering after with others engines or others tools i really like the idea of "please database order by the fast way you can, i don't care about ordering result be the best one just order it with a low response time" example with big data (or with the union with fast and slow queires): a big order by, share the data with 4 threads, each thread execute 25% of order by, if we got the result without a finall order by we have something like "abd, cde, def, efg" (not 'totally' ordered, but with a usefull order at least), with the final order by we have "abcdddeeeffg" i could accept the first one result if this reduce for example 1 minute or some time that is critical to others threads/resource/user use

VAROQUI Stephane added a comment - 2014-10-23 01:06

From what i learn order by is not such a costly operation, it's a creation of an index on the results, cost many CPU cycles without much latency between cycles and innodb can do this this in background threads by merging insert buffer. If you don't have the memory you are dead anyway and parallel query will not help. Also the order by limit can be optimized by LRU keeping the best order from all streams, so attention should be put with fetching data in parallel, and the granularity of such fetches, if you take for granted that a single core can fetch 600K to 1M records/sec from memory, if i have enough memory this is still 20 minutes for SUM(C1) on a billion records table. With a 64 cores machine and enough memory i would like to get this result in few seconds and if i have 20 slaves i would like answer in less than a second, instead of trying to make all type of operation multi threaded , it's more easy to partition a big baby table.
What is the good number of partitions? Number of partitions will define the size of a chunk or a job. This can be defined by a coordinator thread = mapper in map reduce. The job size should limit the number of roundtrip to enable prefetch and mrr to work correctly but also to limit the amount of CPU required by coordinator for reducing the worker results, but also small enough to enable coordinator to schedule a dynamic pool of threads drive by monitoring the number of jobs/s. like what we do today when producing benchmark. We increase concurrency until it does not help. Off loading capability when riching the best performance can be done by -1 +1 -1 +1 # of threads. On a busy concurrent server this would allow to allocate no more than 1.5 threads on all type of range queries.

The optimizer should be adapted for range queries, historically optimizer will compute the plan that produce less cost without knowing about Disk/Memory io distributions, for most OLAP queries this does not work and the best plan is alway going from the big table to small tables . Materialization of small tables should be done to apply filters to organize hash or mrr join as small as possible.
The range optimizer can try to guess such situation by looking the number of records in the range * avg record size * memory lost factor (2 for innodb) and compare to the available memory for the storage engine. This can be enough to invalidate a plan that would jump from secondary index to such big table. Let's say that the optimizer found a plan of 1K read + 10K ref starting form small table to big table but that we a 50% page missed ratio deduce from previous computation this will produce 1000 RND I/O that would take 10s on spinning disk . in 10 seconds i can prefetch 8 Millions rows minus the cost of 8 million eq_ref in memory, but with 64 threads i can prefecth 64*8M record with only 640 jumps on disks . that is drastically changing some optimizer rules on spinning disks !

VAROQUI Stephane added a comment - 2014-10-23 01:06 From what i learn order by is not such a costly operation, it's a creation of an index on the results, cost many CPU cycles without much latency between cycles and innodb can do this this in background threads by merging insert buffer. If you don't have the memory you are dead anyway and parallel query will not help. Also the order by limit can be optimized by LRU keeping the best order from all streams, so attention should be put with fetching data in parallel, and the granularity of such fetches, if you take for granted that a single core can fetch 600K to 1M records/sec from memory, if i have enough memory this is still 20 minutes for SUM(C1) on a billion records table. With a 64 cores machine and enough memory i would like to get this result in few seconds and if i have 20 slaves i would like answer in less than a second, instead of trying to make all type of operation multi threaded , it's more easy to partition a big baby table. What is the good number of partitions? Number of partitions will define the size of a chunk or a job. This can be defined by a coordinator thread = mapper in map reduce. The job size should limit the number of roundtrip to enable prefetch and mrr to work correctly but also to limit the amount of CPU required by coordinator for reducing the worker results, but also small enough to enable coordinator to schedule a dynamic pool of threads drive by monitoring the number of jobs/s. like what we do today when producing benchmark. We increase concurrency until it does not help. Off loading capability when riching the best performance can be done by -1 +1 -1 +1 # of threads. On a busy concurrent server this would allow to allocate no more than 1.5 threads on all type of range queries. The optimizer should be adapted for range queries, historically optimizer will compute the plan that produce less cost without knowing about Disk/Memory io distributions, for most OLAP queries this does not work and the best plan is alway going from the big table to small tables . Materialization of small tables should be done to apply filters to organize hash or mrr join as small as possible. The range optimizer can try to guess such situation by looking the number of records in the range * avg record size * memory lost factor (2 for innodb) and compare to the available memory for the storage engine. This can be enough to invalidate a plan that would jump from secondary index to such big table. Let's say that the optimizer found a plan of 1K read + 10K ref starting form small table to big table but that we a 50% page missed ratio deduce from previous computation this will produce 1000 RND I/O that would take 10s on spinning disk . in 10 seconds i can prefetch 8 Millions rows minus the cost of 8 million eq_ref in memory, but with 64 threads i can prefecth 64*8M record with only 640 jumps on disks . that is drastically changing some optimizer rules on spinning disks !

roberto spadim added a comment - 2014-10-23 19:54

i agree and see that considering i/o scheduler doing a good job, the multi thread create more iops than single thread, and considering a shard/cluster table (engine) create more "workers" than single thread, no doubt about it

my doubt still as, how many threads should be used in each part of query? how to optimize this number? should we execute every part as multithread or we should consider some part as single thread?

there's some guys at arm world talking about 65k cores in only one machine, at intell x86-64 we talk about 128 cores, i don't know if we should consider only the size of data, but the cpu power is something that will have some "complex" scenarios in future like many cores with small "computer power" (arm) or less cores with a big "computer power" (xeon x86-64)
i don't have idea how could optimize the number of threads yet, we have number of cores, "computer power" of each core (arm/x86/etc), size of data, storage iops, storage read(write) rate, and memory (cache/buffer/access rate/cpu cache/etc)

roberto spadim added a comment - 2014-10-23 19:54 i agree and see that considering i/o scheduler doing a good job, the multi thread create more iops than single thread, and considering a shard/cluster table (engine) create more "workers" than single thread, no doubt about it my doubt still as, how many threads should be used in each part of query? how to optimize this number? should we execute every part as multithread or we should consider some part as single thread? there's some guys at arm world talking about 65k cores in only one machine, at intell x86-64 we talk about 128 cores, i don't know if we should consider only the size of data, but the cpu power is something that will have some "complex" scenarios in future like many cores with small "computer power" (arm) or less cores with a big "computer power" (xeon x86-64) i don't have idea how could optimize the number of threads yet, we have number of cores, "computer power" of each core (arm/x86/etc), size of data, storage iops, storage read(write) rate, and memory (cache/buffer/access rate/cpu cache/etc)

Federico Razzoli added a comment - 2015-06-06 15:38

I totally agree with Varouqi about the need for hash joins - this is probably the most important limitation that make Maria not viable for big data warehouses.

Any realistic chances to see this in 10.2?

Federico Razzoli added a comment - 2015-06-06 15:38 I totally agree with Varouqi about the need for hash joins - this is probably the most important limitation that make Maria not viable for big data warehouses. Any realistic chances to see this in 10.2?

Larry Adams added a comment - 2022-05-05 19:06 - edited

It would be nice if we could get the ShardQuery concept embedded in MariaDB. For example, if I need to run a GROUP BY across 20 partitions or 20 tables, it would be nice if that was at least 20 parallel queries in using a Map Reduce algorithm. I keep telling myself that I'm just going to have to write this myself. I looked a Shawn's ShardQuery, and though it's great, it places too many dependencies onto any system deployment. MariaDB should just be doing what Shawn was doing in his ShardQuery. Could it be that there are patent implications preventing it's broader adoption in MariaDB?

Column Store, though interesting, likely is not the right solution in all cases.

Larry Adams added a comment - 2022-05-05 19:06 - edited It would be nice if we could get the ShardQuery concept embedded in MariaDB. For example, if I need to run a GROUP BY across 20 partitions or 20 tables, it would be nice if that was at least 20 parallel queries in using a Map Reduce algorithm. I keep telling myself that I'm just going to have to write this myself. I looked a Shawn's ShardQuery, and though it's great, it places too many dependencies onto any system deployment. MariaDB should just be doing what Shawn was doing in his ShardQuery. Could it be that there are patent implications preventing it's broader adoption in MariaDB? Column Store, though interesting, likely is not the right solution in all cases.

VAROQUI Stephane added a comment - 2022-05-06 06:04

Hi Larry MariaDB offer already Spider for this just create a server object for each partition or modulo the number of core to use did you try it ?

VAROQUI Stephane added a comment - 2022-05-06 06:04 Hi Larry MariaDB offer already Spider for this just create a server object for each partition or modulo the number of core to use did you try it ?

VAROQUI Stephane added a comment - 2022-05-06 06:46

If look closely, that's one direction core dev of MariaDB follow while putting a majority of ressources at staying the best for OLTP( NVme ,PMem, CPU architecture contention, for InnoDB and thread pool ) at the same time they have enable query fragment push down to storage engine and a global threading service so that storage engine does not have to worry about background thread allocation. This enable storage engines like ColomnStore or Xpand to be plugged into MariaDB and no more being a Fork, like it was forced in the past. So not yet perfect but it looks to be moving in one architecture direction and Larry dream of coding something could be done by taking the map and reduce step of an SQL fragment. There are some particular issues that have been fixed regarding XA that make me think that reentrant transactions may arrive one day and this is required for consistency of concurrent execution , may be Sergei or Monty could share their view on the roadmap on this topic but i guess so many important other features have been needed like JSON support, CTE, external foreign key and statististics , S3 storage , non Blocking DDL, replication lag free DDL that we can't blame anyone for moving slow into that direction.

VAROQUI Stephane added a comment - 2022-05-06 06:46 If look closely, that's one direction core dev of MariaDB follow while putting a majority of ressources at staying the best for OLTP( NVme ,PMem, CPU architecture contention, for InnoDB and thread pool ) at the same time they have enable query fragment push down to storage engine and a global threading service so that storage engine does not have to worry about background thread allocation. This enable storage engines like ColomnStore or Xpand to be plugged into MariaDB and no more being a Fork, like it was forced in the past. So not yet perfect but it looks to be moving in one architecture direction and Larry dream of coding something could be done by taking the map and reduce step of an SQL fragment. There are some particular issues that have been fixed regarding XA that make me think that reentrant transactions may arrive one day and this is required for consistency of concurrent execution , may be Sergei or Monty could share their view on the roadmap on this topic but i guess so many important other features have been needed like JSON support, CTE, external foreign key and statististics , S3 storage , non Blocking DDL, replication lag free DDL that we can't blame anyone for moving slow into that direction.

Larry Adams added a comment - 2022-05-06 16:39

Okay, coming up to speed on Spider. Got a few more questions.

1) I see Map Reduce by GROUP BY, ORDER & LIMIT, and aggregation functions which is great.
2) What about getting away from UNION though? If I have a bunch of tables that are created based upon a TIMESTAMP range, do I simply create temporary spider table with virtual partitions with the requested non-partitioned but identical schema tables that match the time range, run the query and then drop the temporary table?
3) Is the Map Reduce Multi-level? in other words, will each shard run an intermediate GROUP BY as well as process the WHERE clause within the shard to limit the amount of data for the final reduce phase?
4) Can the Spider Node also be a Data Node? (for testing anyway).

Thanks!!

Larry Adams added a comment - 2022-05-06 16:39 Okay, coming up to speed on Spider. Got a few more questions. 1) I see Map Reduce by GROUP BY, ORDER & LIMIT, and aggregation functions which is great. 2) What about getting away from UNION though? If I have a bunch of tables that are created based upon a TIMESTAMP range, do I simply create temporary spider table with virtual partitions with the requested non-partitioned but identical schema tables that match the time range, run the query and then drop the temporary table? 3) Is the Map Reduce Multi-level? in other words, will each shard run an intermediate GROUP BY as well as process the WHERE clause within the shard to limit the amount of data for the final reduce phase? 4) Can the Spider Node also be a Data Node? (for testing anyway). Thanks!!

Larry Adams added a comment - 2022-05-06 17:18

Oh, I see a problem, I'm just trying to query these tables, not insert into them. If the tables already exist, and I just want a virtual tables that queries from them in a Read Only Fashion, am I going to run into a problem? Let me apologize in advance for my ignorance. This is why I was speaking in terms of a temporary table. It's more like a shard enabled view.

Larry Adams added a comment - 2022-05-06 17:18 Oh, I see a problem, I'm just trying to query these tables, not insert into them. If the tables already exist, and I just want a virtual tables that queries from them in a Read Only Fashion, am I going to run into a problem? Let me apologize in advance for my ignorance. This is why I was speaking in terms of a temporary table. It's more like a shard enabled view.

Larry Adams added a comment - 2022-05-06 18:53

Okay, testing now. So, let's say I have to create a table with 50 partitions, and I want to use 10 connections per server to 5 servers. How do you do that? The way it's working for me, is that each query is handled serially, so my speedup with 50 partitions would only be about 5x even though I've got 120 threads per server and NVMe everywhere.

How do I get Spider to hit each server with 10 concurrent threads? Oh, and I faked out a partition creation by using "partition by hash" to get it to work, but ended up crashing the database server when I tried to "alter table add partition". The underlying tables were not hashed. 10.8.x. Sorry for polluting this thread. I'll blow things away afterwards.

Larry Adams added a comment - 2022-05-06 18:53 Okay, testing now. So, let's say I have to create a table with 50 partitions, and I want to use 10 connections per server to 5 servers. How do you do that? The way it's working for me, is that each query is handled serially, so my speedup with 50 partitions would only be about 5x even though I've got 120 threads per server and NVMe everywhere. How do I get Spider to hit each server with 10 concurrent threads? Oh, and I faked out a partition creation by using "partition by hash" to get it to work, but ended up crashing the database server when I tried to "alter table add partition". The underlying tables were not hashed. 10.8.x. Sorry for polluting this thread. I'll blow things away afterwards.

Rick James added a comment - 2022-05-07 02:28

I suggest the following plan.

1. Put something simple in. This lets us say that "parallel execution" does exist. (Marketing!)
2. Wait a decade, then make it more useful. (It took 2 decades for ALTER to get inplace/instant/etc., except for ENUMs. It took that long before UNION ALL found out how to bypass the temp table in some situations.)

Sorry for being cynical.

The "simple" implementation might limit itself to

One thread per partition (after pruning). And have a tunable "max_parallel_threads". We should consider a lot of actions: Select, Update, Delete, Alter.
"Index merge union" could be multi-threaded until it needs to do the merge.
Ditto for UNION (ALL or DISTINCT) – However this would need an optimizer flag to choose to avoid the temp table versus using parallelism but forcing the temp table.

I would be happy to add another reason for using Partitioning to my very short list. (Oops, I am being cynical again.)

Rick James added a comment - 2022-05-07 02:28 I suggest the following plan. 1. Put something simple in. This lets us say that "parallel execution" does exist. (Marketing!) 2. Wait a decade, then make it more useful. (It took 2 decades for ALTER to get inplace/instant/etc., except for ENUMs. It took that long before UNION ALL found out how to bypass the temp table in some situations.) Sorry for being cynical. The "simple" implementation might limit itself to One thread per partition (after pruning). And have a tunable "max_parallel_threads". We should consider a lot of actions: Select, Update, Delete, Alter. "Index merge union" could be multi-threaded until it needs to do the merge. Ditto for UNION (ALL or DISTINCT) – However this would need an optimizer flag to choose to avoid the temp table versus using parallelism but forcing the temp table. I would be happy to add another reason for using Partitioning to my very short list. (Oops, I am being cynical again.)

Federico Razzoli added a comment - 2022-05-07 10:41

Agree with Rick, but I'm a bit afraid of performance regressions. Please support a way to avoid parallelisation for a specific query or globally.

Federico Razzoli added a comment - 2022-05-07 10:41 Agree with Rick, but I'm a bit afraid of performance regressions. Please support a way to avoid parallelisation for a specific query or globally.

Larry Adams added a comment - 2022-05-07 11:38

Okay, I setup two shard servers and created a faux partitioned table with even more faux hash and a few millions of record of real data, and then ran the most simplest if group by queries with an aggregate function, and then proceeded to observe zero parallelism. Everything ran serially, the query ran like a simple union hitting one table at a time. My use case is biased towards select. These are read only tables. So, disappointed.

I agree with Rick, add a table to system option to define the level of parallelism and it should work as I described, give me a nearly 50x speedup for reads to a read only set of tables. As far as I am concerned, make it a read option only, and give us a flag to mark the table as read only if that helps. If this technique is to work, we also need a dummy partitioning algo, yes and I know that is sacreligious. The workaround would be to add a useless sequence column to every table, but now I'm quickly entering the hacker zone. It's a write vs buy decision at this point as I refuse to give up on mariadb. If Shawn can do it using gearman and PHP, I can too. I simply don't want to go that route.

Larry Adams added a comment - 2022-05-07 11:38 Okay, I setup two shard servers and created a faux partitioned table with even more faux hash and a few millions of record of real data, and then ran the most simplest if group by queries with an aggregate function, and then proceeded to observe zero parallelism. Everything ran serially, the query ran like a simple union hitting one table at a time. My use case is biased towards select. These are read only tables. So, disappointed. I agree with Rick, add a table to system option to define the level of parallelism and it should work as I described, give me a nearly 50x speedup for reads to a read only set of tables. As far as I am concerned, make it a read option only, and give us a flag to mark the table as read only if that helps. If this technique is to work, we also need a dummy partitioning algo, yes and I know that is sacreligious. The workaround would be to add a useless sequence column to every table, but now I'm quickly entering the hacker zone. It's a write vs buy decision at this point as I refuse to give up on mariadb. If Shawn can do it using gearman and PHP, I can too. I simply don't want to go that route.

VAROQUI Stephane added a comment - 2022-05-08 12:19

For spider to work wit concurrency you need to tell him if when servers on partition point to localhost and about to query in concurrency please refer to documentation or test cases for this

VAROQUI Stephane added a comment - 2022-05-08 12:19 For spider to work wit concurrency you need to tell him if when servers on partition point to localhost and about to query in concurrency please refer to documentation or test cases for this

Rick James added a comment - 2022-05-08 15:44

I picture Spider as being mainly used to access separate, potentially remote, servers. I have trouble imagining that Spider provides any efficiency when using it to run multiple threads on a single server. I expected this Mdev thread to discuss how to coordinate and optimize the multi-threading of a single SQL statement running on a single server.

Am I wrong about Spider? Does it have special hooks to realize that the targets happen to be the same server, and take advantage of such?

Rick James added a comment - 2022-05-08 15:44 I picture Spider as being mainly used to access separate, potentially remote, servers. I have trouble imagining that Spider provides any efficiency when using it to run multiple threads on a single server. I expected this Mdev thread to discuss how to coordinate and optimize the multi-threading of a single SQL statement running on a single server. Am I wrong about Spider? Does it have special hooks to realize that the targets happen to be the same server, and take advantage of such?

VAROQUI Stephane added a comment - 2022-05-09 06:55

Yes i used it many years ago and without using udf for this

for concurrency it happen at connection level so you need to declare different servers that point to same backend , if the backend is the same as the spider you also need to add spider_same_server_link.
https://mariadb.com/kb/en/spider-server-system-variables/#spider_bgs_mode

VAROQUI Stephane added a comment - 2022-05-09 06:55 Yes i used it many years ago and without using udf for this for concurrency it happen at connection level so you need to declare different servers that point to same backend , if the backend is the same as the spider you also need to add spider_same_server_link. https://mariadb.com/kb/en/spider-server-system-variables/#spider_bgs_mode

VAROQUI Stephane added a comment - 2022-05-09 07:07

I'm sharing some parameters that are of. interest for that case
loose_spider_reset_sql_alloc=1 # because multiple part scan consume more memory

Those are in my favorites but i have no more clue if they play a role on this
loose_spider_connect_mutex = 0
loose_spider_conn_recycle_mode=0
loose_spider_conn_recycle_strict = 0
loose_spider_local_lock_table=0
oose_spider_semi_table_lock = 0
loose_spider_support_xa=0
loose_spider_direct_dup_insert = 1
loose_spider_remote_sql_log_off=1
loose_spider_casual_read=0
loose_spider_bka_mode = 0
loose_spider_quick_mode=3
loose_spider_quick_page_size=1000
loose_spider_sync_trx_isolation=1
loose_spider_sync_autocommit=1

Intrested to get any feadback if they do .

VAROQUI Stephane added a comment - 2022-05-09 07:07 I'm sharing some parameters that are of. interest for that case loose_spider_reset_sql_alloc=1 # because multiple part scan consume more memory Those are in my favorites but i have no more clue if they play a role on this loose_spider_connect_mutex = 0 loose_spider_conn_recycle_mode=0 loose_spider_conn_recycle_strict = 0 loose_spider_local_lock_table=0 oose_spider_semi_table_lock = 0 loose_spider_support_xa=0 loose_spider_direct_dup_insert = 1 loose_spider_remote_sql_log_off=1 loose_spider_casual_read=0 loose_spider_bka_mode = 0 loose_spider_quick_mode=3 loose_spider_quick_page_size=1000 loose_spider_sync_trx_isolation=1 loose_spider_sync_autocommit=1 Intrested to get any feadback if they do .

VAROQUI Stephane added a comment - 2022-05-09 07:14

When Kentoku was onboarded he was supposed to work on SQL rewrite plugin that would have enable easy SQL syntax integration of such concurrent execution

VAROQUI Stephane added a comment - 2022-05-09 07:14 When Kentoku was onboarded he was supposed to work on SQL rewrite plugin that would have enable easy SQL syntax integration of such concurrent execution

VAROQUI Stephane added a comment - 2022-05-09 07:22

The first integration of this feature was possible on queries using a single table, but a following work was made that if other tables of in the query are federated to the same backend as the partition then it possible to used parallel partition scan as well.

VAROQUI Stephane added a comment - 2022-05-09 07:22 The first integration of this feature was possible on queries using a single table, but a following work was made that if other tables of in the query are federated to the same backend as the partition then it possible to used parallel partition scan as well.

Larry Adams added a comment - 2022-05-10 19:13 - edited

Stephane,

I've been going through the documentation, I don't think there is a real good write-up on how to run Spider queries in a true parallel map reduce fashion (or force it to). I'll do more investigation in my home lab. Thus far, I have distributed a years worth of tables (365) to 4 servers that are all separate from the server that will solve as the Spider server. So, I've got a good setup. When I did my first test in a customer environment, as mentioned, it was pure serial one server and one partition at a time (disappointing). But that environment used the spider server as a data server as well, so not sure if that had any impact.

What I'm really trying to do I will attach as a screen grab.

My plan will be to:

1) Setup 20 servers to 4 back-ends
2) Create a script to UNION 40 tables into a partition table (programmatically)
3) Run some queries while watching the process list for parallelism.
4) Tweak settings and repeat 3.

Larry

Larry Adams added a comment - 2022-05-10 19:13 - edited Stephane, I've been going through the documentation, I don't think there is a real good write-up on how to run Spider queries in a true parallel map reduce fashion (or force it to). I'll do more investigation in my home lab. Thus far, I have distributed a years worth of tables (365) to 4 servers that are all separate from the server that will solve as the Spider server. So, I've got a good setup. When I did my first test in a customer environment, as mentioned, it was pure serial one server and one partition at a time (disappointing). But that environment used the spider server as a data server as well, so not sure if that had any impact. What I'm really trying to do I will attach as a screen grab. My plan will be to: 1) Setup 20 servers to 4 back-ends 2) Create a script to UNION 40 tables into a partition table (programmatically) 3) Run some queries while watching the process list for parallelism. 4) Tweak settings and repeat 3. Larry

Larry Adams added a comment - 2022-05-11 14:56

Stephane,

With all those settings, zero performance difference. Each partition is handled serially. There is this document where Shiba-san was claiming that he had a patch that added parallelization and he provided no instructions in the write-up as to how to leverage it. I'm not sure what ever became of that though.

https://blog.mariadb.org/wp-content/uploads/2014/05/Spider_in_MariaDB_20140403.pdf

Larry

Larry Adams added a comment - 2022-05-11 14:56 Stephane, With all those settings, zero performance difference. Each partition is handled serially. There is this document where Shiba-san was claiming that he had a patch that added parallelization and he provided no instructions in the write-up as to how to leverage it. I'm not sure what ever became of that though. https://blog.mariadb.org/wp-content/uploads/2014/05/Spider_in_MariaDB_20140403.pdf Larry

MariaDB Server

Ideas about parallel query execution

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration