[MCOL-5250] Disk-based DISTINCT Created: 2022-10-05  Updated: 2023-12-22

Status: Stalled
Project: MariaDB ColumnStore
Component/s: ExeMgr, PrimProc
Affects Version/s: 22.08.1
Fix Version/s: 23.10

Type: New Feature Priority: Major
Reporter: Roman Assignee: Roman
Resolution: Unresolved Votes: 0
Labels: rm_big_data

Issue Links:
PartOf
includes MCOL-5187 OOM happening when querying large dat... Stalled
includes MCOL-5541 Disk-based distinct :Create a separat... Stalled
Epic Link: Tech debt
Sprint: 2021-17, 2022-22, 2022-23, 2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10, 2023-11

 Description   

As of 22.08.01 MCS does DISTINCT processing TupleAnnexStep. This step leverages hashmap for the purpose. This solution is simple but it:

  • lacks scalability
  • can't leverage disk-based capabilities of RowStorage class used by GROUP BY
  • ResourceManager that accounts RAM consumption doesn't counts the hashmap

This issue is about a new DISTINCT implementation(presumably based on RowStorage) that:

  • can do external DISTINCT spilling on disk if necessary,
  • ResourceManager counts the implemenation RAM consumption
  • scales(this might be tricky since DISTINCT processing overlaps with ORDER BY)

Generated at Thu Feb 08 02:56:28 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.