[MDEV-32870] Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns Created: 2023-11-24 Updated: 2023-12-11 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Optimizer |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major |
| Reporter: | Daniel Black | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||
| Description |
|
Computing COUNT(DISTINCT <col1>, <col2>, <col3>) traditionally is done via storing the values in a UNIQUE object. The UNIQUE object uses a Red-Black tree with logarithmic complexity for insert and lookup. This means that we achieve a theoretical complexity of O(rows * log(rows)) In that case we only need to check when the tuple (col1, col2, col3) changes and increment the count value. This would lead to a complexity of O(rows), similar to the complexity of COUNT(*) Test case to show the performance penalty:
And here we can see the performance of count(*) case:
Data:
|
| Comments |
| Comment by Sergei Petrunia [ 2023-11-24 ] | ||
|
is the intent to Option A is easy as we know it's always cheaper to do that... | ||
| Comment by Daniel Black [ 2023-11-24 ] | ||
|
I was only intending A. Is this solvable in with minor changes to the existing mechanism?
And the following, for deterministic functions F and F2?
| ||
| Comment by Daniel Black [ 2023-12-11 ] | ||
|
Also solvable with A
|