Major Regression: Selects with aggregates 2x slower in 5.x than in 1.2 (due to collation support) (MCOL-4691)

[MCOL-4717] Conduct experiments measuring the overall impact of using MURMUR inside collation comparators Created: 2021-05-11  Updated: 2023-10-27  Resolved: 2023-10-27

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: None
Fix Version/s: 23.10

Type: Sub-Task Priority: Critical
Reporter: Gregory Dorman (Inactive) Assignee: Leonid Fedorov
Resolution: Won't Do Votes: 0
Labels: None

Sprint: 2021-7, 2021-8, 2021-9, 2021-10, 2021-11, 2021-12

 Description   

The goal is to evaluate a proposal to correct the performance problem.

The method:

  • create a test version of latin1_nopad_bin collation function which would use MURMUR.
  • measure the difference in performance (the expectation is 2x vs current 6x).
  • compare behavior of aggregate queries which use so updated collation against:
    a) original CS method - present in 1.2
    b) current - as in 5.5.2

The last should be done using both flights (30 million rows) and Quinnstreet (1 billion rows).

Once we have the facts, we will be making a decision on what to do.



 Comments   
Comment by Roman [ 2023-01-29 ]

I am curious what exactly are we talking about: collation-aware comparators or hashers?
If we are talking comparators MM3 is outside the scope. If we are talking hashers a different story. There are multiple SQL operators that uses hashing, e.g. JOIN, GROUP BY, DISTINCT. JFYI GB and JOIN are using MM3 to hash weights arrays that represents char/varchar/text preserving order relation. Here is where the hashing takes place. JOIN still uses MDB hashing internally. DISTINCT is using MDB hashing also but I am aware that it will be replaced with GB code.

Generated at Thu Feb 08 02:52:27 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.