[MDEV-32870] Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns - Jira

XML

Word

Printable

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Optimizer
Labels:
None

Description

Computing COUNT(DISTINCT <col1>, <col2>, <col3>) traditionally is done via storing the values in a UNIQUE object. The UNIQUE object uses a Red-Black tree with logarithmic complexity for insert and lookup.

This means that we achieve a theoretical complexity of O(rows * log(rows))
There is a way to speed this up, if we know that <col1>, <col2>, <col3> are already sorted.

In that case we only need to check when the tuple (col1, col2, col3) changes and increment the count value. This would lead to a complexity of O(rows), similar to the complexity of COUNT(*)

Test case to show the performance penalty:

analyze format=JSON

select count(DISTINCT c,d,e),b from tt where a = 1 group by b\G

*************************** 1. row ***************************

ANALYZE: {

  "query_block": {

    "select_id": 1,

    "r_loops": 1,

    "r_total_time_ms": 518.9,

    "table": {

      "table_name": "tt",

      "access_type": "ref",

      "possible_keys": ["tt_a_IDX"],

      "key": "tt_a_IDX",

      "key_length": "4",

      "used_key_parts": ["a"],

      "ref": ["const"],

      "r_loops": 1,

      "rows": 248763,

      "r_rows": 500000,

      "r_total_time_ms": 99.395,

      "filtered": 100,

      "r_filtered": 100,

      "attached_condition": "tt.a <=> 1",

      "using_index": true

1 row in set (0.523 sec)

And here we can see the performance of count(*) case:

analyze format=JSON

select count(*),b from tt where a = 1 group by b\G

*************************** 1. row ***************************

ANALYZE: {

  "query_block": {

    "select_id": 1,

    "r_loops": 1,

    "r_total_time_ms": 114.54,

    "table": {

      "table_name": "tt",

      "access_type": "ref",

      "possible_keys": ["tt_a_IDX"],

      "key": "tt_a_IDX",

      "key_length": "4",

      "used_key_parts": ["a"],

      "ref": ["const"],

      "r_loops": 1,

      "rows": 248763,

      "r_rows": 500000,

      "r_total_time_ms": 97.696,

      "filtered": 100,

      "r_filtered": 100,

      "attached_condition": "tt.a <=> 1",

      "using_index": true

1 row in set (0.115 sec)

Data:

CREATE TABLE tt (a int, b varchar(32), c varchar(64), d varchar(64), e varchar(64), f int)

CREATE INDEX tt_a_IDX USING BTREE ON tt (a,b,c,d,e);

insert into tt select 1, mod(seq, 5), mod(seq, 11), mod(seq, 15), mod(seq, 22), seq from seq_1_to_1000000;

Attachments

Issue Links

relates to

MDEV-30660 Aggregation functions fail to leverage uniqueness property

Closed

links to

SO original problem

Activity

People

Assignee:: Unassigned

Reporter:: Daniel Black

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2023-11-24 07:18

Updated:: 2023-12-11 11:37

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.