[MCOL-5588] Make GROUP BY and joins parallel on cluster Created: 2023-10-10  Updated: 2023-12-21

Status: Open
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: None
Fix Version/s: 23.10

Type: New Feature Priority: Major
Reporter: Sergey Zefirov Assignee: Sergey Zefirov
Resolution: Unresolved Votes: 0
Labels: rm_perf

Sprint: 2023-11, 2023-12

 Description   

It is possible to use SUMMA-like algorithms for joins and GROUP BY functionality.

https://cseweb.ucsd.edu/classes/sp11/cse262-a/Lectures/262-pres1-hal.pdf - a neat description of what SUMMA does. Basically, each node holds part of source matrix, computes part of resulting matrix, accumulates results that belong to node and distributes parts of another source matrix and partial computation results to other nodes.

As we do not have neat mesh definition, we can use ring or ring with skips. We also do not need to perform partial result broadcast (except for ROLLUPs).

Main node can sent different parts of data into different worker nodes for these nodes to compute what data belongs to them and what data need to be sent further. Worker node can also compute hashes of rows so that otherr nodes not need to.

So, this is main overview of the algorithm. I will expand on actual details in comments.


Generated at Thu Feb 08 02:58:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.