[MDEV-30660] Aggregation functions fail to leverage uniqueness property - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.11
Fix Version/s: 10.5.25, 10.6.18, 10.11.8, 11.0.6, 11.2.4, 11.1.5, 11.4.2
Component/s: Optimizer
Labels:
None
Environment:
CentOS 7 baremetal

Description

Repro-

alias mdb='/path/to/mariadb-client-binary --defaults-file=/path/to/my.cnf'

mdb -e 'CREATE DATABASE countdistinct;'

mdb countdistinct << 'EOF'

CREATE TABLE autoinc (

  c0 INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,

  c1 int(11) DEFAULT NULL,

  c2 int(11) DEFAULT NULL,

  c3 int(11) DEFAULT NULL,

  KEY `k_c1` (`c1`),

  KEY `k_c2` (`c2`,`c1`),

  KEY `k_c3` (`c3`,`c2`,`c1`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci;

EOF

# Load 1M rows of data

for ((i=0;i<1000;i++)); do

  for ((j=0;j<100;j++)); do

    curval=$(( (i * 100) + j ))

    mdb countdistinct -e 'INSERT INTO autoinc (c1) VALUES ('"${curval}"');' &

  done

done

# Prime buffers/caches

time mdb countdistinct -Nse 'SELECT c0 FROM autoinc;' >/dev/null

time mdb countdistinct -Nse 'SELECT c1 FROM autoinc;' >/dev/null

time mdb countdistinct -Nse 'SELECT c2 FROM autoinc;' >/dev/null

time mdb countdistinct -Nse 'SELECT c3 FROM autoinc;' >/dev/null

# Test various ways of obtaining COUNT data

time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'

time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'

time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'

time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'

time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'

The killer here is the GROUP BY ... HAVING ... query performing much faster than COUNT(DISTINCT... as the two are logically pretty similar in what they're doing.

Example uses an AUTO_INCREMENT PK to give a "best-case" scenario for this- performance is worse if the PK is unsorted.

Here are some sample test results-

time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'

real    0m0.039s

user    0m0.019s

sys     0m0.006s

time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'

real    0m0.037s

user    0m0.019s

sys     0m0.004s

time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'

real    0m0.053s

user    0m0.018s

sys     0m0.004s

time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'

real    0m0.042s

user    0m0.019s

sys     0m0.003s

time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'

real    0m0.022s

user    0m0.016s

sys     0m0.005s

Note that this slowness seems exclusive to PKs. Running on the roughly identical c1 column yields only a marginal slowdown for COUNT(DISTINCT(c1)) despite the contents of c0 and c1 being almost identical-

time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'

LECT COUNT(DISTINCT (c1)) FROM autoinc;'

real    0m0.037s

user    0m0.021s

sys     0m0.003s

time mdb countdistinct -Nse 'SELECT COUNT(c1) FROM autoinc;'

real    0m0.036s

user    0m0.019s

sys     0m0.003s

time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c1)) FROM autoinc;'

real    0m0.040s

user    0m0.018s

sys     0m0.003s

time mdb countdistinct -Nse 'SELECT c1, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'

real    0m0.040s

user    0m0.017s

sys     0m0.004s

time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'

real    0m0.022s

user    0m0.018s

sys     0m0.003s

And to just confirm the content is comparable as expected-

mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 ASC LIMIT 5;'

+----+------+

| c0 | c1   |

+----+------+

|  1 |    0 |

|  2 |    1 |

|  3 |    2 |

|  4 |    3 |

|  5 |    4 |

+----+------+

mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 DESC LIMIT 5;'

+--------+-------+

| c0     | c1    |

+--------+-------+

| 100000 | 99999 |

|  99999 | 99998 |

|  99998 | 99997 |

|  99997 | 99996 |

|  99996 | 99995 |

+--------+-------+

The good news is if the non-PK results hold up, that handles most realistic cases for COUNT(DISTINCT.... But it does seem like we should figure out why our PK implementation is such an outlier in case it is stealthily affecting other operations.

Attachments

Issue Links

duplicates

MDEV-10922 optimization on count distinct on primary key

Confirmed

relates to

MDEV-32870 Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns

Open

MDEV-33911 DISTINCT removal in aggregate func do not affect grouping strategy

Open

MDEV-7836 ANALYZE FORMAT=JSON should provide info about GROUP BY

Stalled

mentioned in: Page Failed to load; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(50 mentioned in)

Activity

Ascending order - Click to sort in descending order

View 12 older comments

Sergei Petrunia added a comment - 2024-02-06 10:31

Review input provided in the https://github.com/MariaDB/server/pull/3012.

Sergei Petrunia added a comment - 2024-02-06 10:31 Review input provided in the https://github.com/MariaDB/server/pull/3012 .

Oleg Smirnov added a comment - 2024-02-11 04:14 - edited

The fix has been pushed to 10.5.

Oleg Smirnov added a comment - 2024-02-11 04:14 - edited The fix has been pushed to 10.5.

Sergei Petrunia added a comment - 2024-03-19 09:39 - edited

Notes for the changelog:

Generally, computing aggregate function with DISTINCT argument: aggregate_func(DISTINCT col1, col2, ...) requires producing a de-duplicated set of its arguments, which can be CPU-intensive
When we select from one table the argument list includes the table's PRIMARY (or UNIQUE) key:

SELECT aggregate_func(DISTINCT t1.primary_key, ...) FROM t1;

then the arguments are guaranteed not to have duplicates. This MDEV detects such cases and allows the optimizer to skip de-duplication.

Sergei Petrunia added a comment - 2024-03-19 09:39 - edited Notes for the changelog: Generally, computing aggregate function with DISTINCT argument: aggregate_func(DISTINCT col1, col2, ...) requires producing a de-duplicated set of its arguments, which can be CPU-intensive When we select from one table the argument list includes the table's PRIMARY (or UNIQUE) key: SELECT aggregate_func( DISTINCT t1.primary_key, ...) FROM t1; then the arguments are guaranteed not to have duplicates. This MDEV detects such cases and allows the optimizer to skip de-duplication.

Sergei Petrunia added a comment - 2024-04-15 08:24

Documentation:
https://mariadb.com/kb/en/query-optimizations-distinct-removal-in-aggregate-functions/

Sergei Petrunia added a comment - 2024-04-15 08:24 Documentation: https://mariadb.com/kb/en/query-optimizations-distinct-removal-in-aggregate-functions/

Sergei Petrunia added a comment - 2024-04-15 09:30

The optimization is beneficial, but we're not getting all the benefits we could: the choice of grouping strategy doesn't seem to be affected. Filed this as MDEV-33911.

Sergei Petrunia added a comment - 2024-04-15 09:30 The optimization is beneficial, but we're not getting all the benefits we could: the choice of grouping strategy doesn't seem to be affected. Filed this as MDEV-33911 .

People

Assignee:: Oleg Smirnov

Reporter:: Rob Schwyzer (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023-02-15 21:45

Updated:: 2024-12-20 05:00

Resolved:: 2024-02-11 04:14

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server