Details
Description
Repro-
alias mdb='/path/to/mariadb-client-binary --defaults-file=/path/to/my.cnf'
|
mdb -e 'CREATE DATABASE countdistinct;'
|
|
mdb countdistinct << 'EOF'
|
CREATE TABLE autoinc (
|
c0 INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
|
c1 int(11) DEFAULT NULL,
|
c2 int(11) DEFAULT NULL,
|
c3 int(11) DEFAULT NULL,
|
KEY `k_c1` (`c1`),
|
KEY `k_c2` (`c2`,`c1`),
|
KEY `k_c3` (`c3`,`c2`,`c1`)
|
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci;
|
EOF
|
|
# Load 1M rows of data
|
for ((i=0;i<1000;i++)); do
|
for ((j=0;j<100;j++)); do
|
curval=$(( (i * 100) + j ))
|
mdb countdistinct -e 'INSERT INTO autoinc (c1) VALUES ('"${curval}"');' &
|
done
|
done
|
|
# Prime buffers/caches
|
time mdb countdistinct -Nse 'SELECT c0 FROM autoinc;' >/dev/null
|
time mdb countdistinct -Nse 'SELECT c1 FROM autoinc;' >/dev/null
|
time mdb countdistinct -Nse 'SELECT c2 FROM autoinc;' >/dev/null
|
time mdb countdistinct -Nse 'SELECT c3 FROM autoinc;' >/dev/null
|
|
# Test various ways of obtaining COUNT data
|
time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
|
time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'
|
time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'
|
time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
|
time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
|
The killer here is the GROUP BY ... HAVING ... query performing much faster than COUNT(DISTINCT... as the two are logically pretty similar in what they're doing.
Example uses an AUTO_INCREMENT PK to give a "best-case" scenario for this- performance is worse if the PK is unsorted.
Here are some sample test results-
time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
|
100000
|
|
real 0m0.039s
|
user 0m0.019s
|
sys 0m0.006s
|
time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'
|
100000
|
|
real 0m0.037s
|
user 0m0.019s
|
sys 0m0.004s
|
time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'
|
100000
|
|
real 0m0.053s
|
user 0m0.018s
|
sys 0m0.004s
|
time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
|
|
real 0m0.042s
|
user 0m0.019s
|
sys 0m0.003s
|
time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
|
100131
|
|
real 0m0.022s
|
user 0m0.016s
|
sys 0m0.005s
|
Note that this slowness seems exclusive to PKs. Running on the roughly identical c1 column yields only a marginal slowdown for COUNT(DISTINCT(c1)) despite the contents of c0 and c1 being almost identical-
time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
|
LECT COUNT(DISTINCT (c1)) FROM autoinc;'
|
100000
|
|
real 0m0.037s
|
user 0m0.021s
|
sys 0m0.003s
|
time mdb countdistinct -Nse 'SELECT COUNT(c1) FROM autoinc;'
|
100000
|
|
real 0m0.036s
|
user 0m0.019s
|
sys 0m0.003s
|
time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c1)) FROM autoinc;'
|
100000
|
|
real 0m0.040s
|
user 0m0.018s
|
sys 0m0.003s
|
time mdb countdistinct -Nse 'SELECT c1, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
|
|
real 0m0.040s
|
user 0m0.017s
|
sys 0m0.004s
|
time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
|
100131
|
|
real 0m0.022s
|
user 0m0.018s
|
sys 0m0.003s
|
And to just confirm the content is comparable as expected-
mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 ASC LIMIT 5;'
|
+----+------+
|
| c0 | c1 |
|
+----+------+
|
| 1 | 0 |
|
| 2 | 1 |
|
| 3 | 2 |
|
| 4 | 3 |
|
| 5 | 4 |
|
+----+------+
|
mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 DESC LIMIT 5;'
|
+--------+-------+
|
| c0 | c1 |
|
+--------+-------+
|
| 100000 | 99999 |
|
| 99999 | 99998 |
|
| 99998 | 99997 |
|
| 99997 | 99996 |
|
| 99996 | 99995 |
|
+--------+-------+
|
The good news is if the non-PK results hold up, that handles most realistic cases for COUNT(DISTINCT.... But it does seem like we should figure out why our PK implementation is such an outlier in case it is stealthily affecting other operations.
Attachments
Issue Links
- duplicates
-
MDEV-10922 optimization on count distinct on primary key
- Confirmed
- relates to
-
MDEV-32870 Improve the performance of COUNT DISTINCT when ran on sorted / indexed columns
- Open
-
MDEV-33911 DISTINCT removal in aggregate func do not affect grouping strategy
- Open
- mentioned in
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...
-
Page Loading...