Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32870

Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns

    XMLWordPrintable

Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Optimizer
    • None

    Description

      Computing COUNT(DISTINCT <col1>, <col2>, <col3>) traditionally is done via storing the values in a UNIQUE object. The UNIQUE object uses a Red-Black tree with logarithmic complexity for insert and lookup.

      This means that we achieve a theoretical complexity of O(rows * log(rows))
      There is a way to speed this up, if we know that <col1>, <col2>, <col3> are already sorted.

      In that case we only need to check when the tuple (col1, col2, col3) changes and increment the count value. This would lead to a complexity of O(rows), similar to the complexity of COUNT(*)

      Test case to show the performance penalty:

      analyze format=JSON 
      select count(DISTINCT c,d,e),b from tt where a = 1 group by b\G
      *************************** 1. row ***************************
      ANALYZE: {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 518.9,
          "table": {
            "table_name": "tt",
            "access_type": "ref",
            "possible_keys": ["tt_a_IDX"],
            "key": "tt_a_IDX",
            "key_length": "4",
            "used_key_parts": ["a"],
            "ref": ["const"],
            "r_loops": 1,
            "rows": 248763,
            "r_rows": 500000,
            "r_total_time_ms": 99.395,
            "filtered": 100,
            "r_filtered": 100,
            "attached_condition": "tt.a <=> 1",
            "using_index": true
          }
        }
      }
      1 row in set (0.523 sec)
      

      And here we can see the performance of count(*) case:

      analyze format=JSON
      select count(*),b from tt where a = 1 group by b\G
      *************************** 1. row ***************************
      ANALYZE: {
        "query_block": {
          "select_id": 1,
          "r_loops": 1,
          "r_total_time_ms": 114.54,
          "table": {
            "table_name": "tt",
            "access_type": "ref",
            "possible_keys": ["tt_a_IDX"],
            "key": "tt_a_IDX",
            "key_length": "4",
            "used_key_parts": ["a"],
            "ref": ["const"],
            "r_loops": 1,
            "rows": 248763,
            "r_rows": 500000,
            "r_total_time_ms": 97.696,
            "filtered": 100,
            "r_filtered": 100,
            "attached_condition": "tt.a <=> 1",
            "using_index": true
          }
        }
      }
      1 row in set (0.115 sec)
      

      Data:

      CREATE TABLE tt (a int, b varchar(32), c varchar(64), d varchar(64), e varchar(64), f int) 
      CREATE INDEX tt_a_IDX USING BTREE ON tt (a,b,c,d,e);
      insert into tt select 1, mod(seq, 5), mod(seq, 11), mod(seq, 15), mod(seq, 22), seq from seq_1_to_1000000;
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.