[MDEV-30660] Aggregation functions fail to leverage uniqueness property - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.6.11
Fix Version/s: 10.5.25, 10.11.8, 10.6.18, 11.0.6, 11.1.5, 11.2.4, 11.4.2
Component/s: Optimizer
Labels:
None
Environment:
CentOS 7 baremetal

Description

Repro-

alias mdb='/path/to/mariadb-client-binary --defaults-file=/path/to/my.cnf'

mdb -e 'CREATE DATABASE countdistinct;'

mdb countdistinct << 'EOF'

CREATE TABLE autoinc (

  c0 INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,

  c1 int(11) DEFAULT NULL,

  c2 int(11) DEFAULT NULL,

  c3 int(11) DEFAULT NULL,

  KEY `k_c1` (`c1`),

  KEY `k_c2` (`c2`,`c1`),

  KEY `k_c3` (`c3`,`c2`,`c1`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci;

EOF

# Load 1M rows of data

for ((i=0;i<1000;i++)); do

  for ((j=0;j<100;j++)); do

    curval=$(( (i * 100) + j ))

    mdb countdistinct -e 'INSERT INTO autoinc (c1) VALUES ('"${curval}"');' &

  done

done

# Prime buffers/caches

time mdb countdistinct -Nse 'SELECT c0 FROM autoinc;' >/dev/null

time mdb countdistinct -Nse 'SELECT c1 FROM autoinc;' >/dev/null

time mdb countdistinct -Nse 'SELECT c2 FROM autoinc;' >/dev/null

time mdb countdistinct -Nse 'SELECT c3 FROM autoinc;' >/dev/null

# Test various ways of obtaining COUNT data

time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'

time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'

time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'

time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'

time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'

The killer here is the GROUP BY ... HAVING ... query performing much faster than COUNT(DISTINCT... as the two are logically pretty similar in what they're doing.

Example uses an AUTO_INCREMENT PK to give a "best-case" scenario for this- performance is worse if the PK is unsorted.

Here are some sample test results-

time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'

real    0m0.039s

user    0m0.019s

sys     0m0.006s

time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'

real    0m0.037s

user    0m0.019s

sys     0m0.004s

time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'

real    0m0.053s

user    0m0.018s

sys     0m0.004s

time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'

real    0m0.042s

user    0m0.019s

sys     0m0.003s

time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'

real    0m0.022s

user    0m0.016s

sys     0m0.005s

Note that this slowness seems exclusive to PKs. Running on the roughly identical c1 column yields only a marginal slowdown for COUNT(DISTINCT(c1)) despite the contents of c0 and c1 being almost identical-

time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'

LECT COUNT(DISTINCT (c1)) FROM autoinc;'

real    0m0.037s

user    0m0.021s

sys     0m0.003s

time mdb countdistinct -Nse 'SELECT COUNT(c1) FROM autoinc;'

real    0m0.036s

user    0m0.019s

sys     0m0.003s

time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c1)) FROM autoinc;'

real    0m0.040s

user    0m0.018s

sys     0m0.003s

time mdb countdistinct -Nse 'SELECT c1, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'

real    0m0.040s

user    0m0.017s

sys     0m0.004s

time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'

real    0m0.022s

user    0m0.018s

sys     0m0.003s

And to just confirm the content is comparable as expected-

mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 ASC LIMIT 5;'

+----+------+

| c0 | c1   |

+----+------+

|  1 |    0 |

|  2 |    1 |

|  3 |    2 |

|  4 |    3 |

|  5 |    4 |

+----+------+

mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 DESC LIMIT 5;'

+--------+-------+

| c0     | c1    |

+--------+-------+

| 100000 | 99999 |

|  99999 | 99998 |

|  99998 | 99997 |

|  99997 | 99996 |

|  99996 | 99995 |

+--------+-------+

The good news is if the non-PK results hold up, that handles most realistic cases for COUNT(DISTINCT.... But it does seem like we should figure out why our PK implementation is such an outlier in case it is stealthily affecting other operations.

Attachments

Issue Links

duplicates

MDEV-10922 optimization on count distinct on primary key

Confirmed

relates to

MDEV-32870 Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns

Open

MDEV-33911 DISTINCT removal in aggregate func do not affect grouping strategy

Open

MDEV-7836 ANALYZE FORMAT=JSON should provide info about GROUP BY

Stalled

mentioned in: Page Failed to load; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...; Page Loading...

(50 mentioned in)

Activity

People

Assignee:: Oleg Smirnov

Reporter:: Rob Schwyzer (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2023-02-15 21:45

Updated:: 2024-12-20 05:00

Resolved:: 2024-02-11 04:14

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server