[MDEV-32870] Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns Created: 2023-11-24  Updated: 2023-12-11

Status: Open
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: None

Type: Task Priority: Major
Reporter: Daniel Black Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-30660 COUNT DISTINCT seems unnecessarily sl... In Progress

 Description   

Computing COUNT(DISTINCT <col1>, <col2>, <col3>) traditionally is done via storing the values in a UNIQUE object. The UNIQUE object uses a Red-Black tree with logarithmic complexity for insert and lookup.

This means that we achieve a theoretical complexity of O(rows * log(rows))
There is a way to speed this up, if we know that <col1>, <col2>, <col3> are already sorted.

In that case we only need to check when the tuple (col1, col2, col3) changes and increment the count value. This would lead to a complexity of O(rows), similar to the complexity of COUNT(*)

Test case to show the performance penalty:

analyze format=JSON 
select count(DISTINCT c,d,e),b from tt where a = 1 group by b\G
*************************** 1. row ***************************
ANALYZE: {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 518.9,
    "table": {
      "table_name": "tt",
      "access_type": "ref",
      "possible_keys": ["tt_a_IDX"],
      "key": "tt_a_IDX",
      "key_length": "4",
      "used_key_parts": ["a"],
      "ref": ["const"],
      "r_loops": 1,
      "rows": 248763,
      "r_rows": 500000,
      "r_total_time_ms": 99.395,
      "filtered": 100,
      "r_filtered": 100,
      "attached_condition": "tt.a <=> 1",
      "using_index": true
    }
  }
}
1 row in set (0.523 sec)

And here we can see the performance of count(*) case:

analyze format=JSON
select count(*),b from tt where a = 1 group by b\G
*************************** 1. row ***************************
ANALYZE: {
  "query_block": {
    "select_id": 1,
    "r_loops": 1,
    "r_total_time_ms": 114.54,
    "table": {
      "table_name": "tt",
      "access_type": "ref",
      "possible_keys": ["tt_a_IDX"],
      "key": "tt_a_IDX",
      "key_length": "4",
      "used_key_parts": ["a"],
      "ref": ["const"],
      "r_loops": 1,
      "rows": 248763,
      "r_rows": 500000,
      "r_total_time_ms": 97.696,
      "filtered": 100,
      "r_filtered": 100,
      "attached_condition": "tt.a <=> 1",
      "using_index": true
    }
  }
}
1 row in set (0.115 sec)

Data:

CREATE TABLE tt (a int, b varchar(32), c varchar(64), d varchar(64), e varchar(64), f int) 
CREATE INDEX tt_a_IDX USING BTREE ON tt (a,b,c,d,e);
insert into tt select 1, mod(seq, 5), mod(seq, 11), mod(seq, 15), mod(seq, 22), seq from seq_1_to_1000000;



 Comments   
Comment by Sergei Petrunia [ 2023-11-24 ]

is the intent to
A. take advantage of an existing index producing a compatible ordering, or
B. also modify the sorting criteria (and maybe even require sorting) to get the data in some compatible ordering?

Option A is easy as we know it's always cheaper to do that...
Option B is harder as sorting with more sorting criteria (or even sorting where it wasn't previously required) may add some overhead...

Comment by Daniel Black [ 2023-11-24 ]

I was only intending A.

Is this solvable in with minor changes to the existing mechanism?

select DISTINCT b,c,d,e from tt where a = 1

And the following, for deterministic functions F and F2?

select DISTINCT F(b,c),F2(c),d,e,1 from tt where a = 1

Comment by Daniel Black [ 2023-12-11 ]

Also solvable with A:

SELECT b,c,d,e,sum(f) FROM tt where a between 20 and 40 group by b,c,d,e

source problem

Generated at Thu Feb 08 10:34:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.