Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30660

Aggregation functions fail to leverage uniqueness property

    XMLWordPrintable

Details

    Description

      Repro-

      alias mdb='/path/to/mariadb-client-binary --defaults-file=/path/to/my.cnf'
      mdb -e 'CREATE DATABASE countdistinct;'
       
      mdb countdistinct << 'EOF'
      CREATE TABLE autoinc (
        c0 INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
        c1 int(11) DEFAULT NULL,
        c2 int(11) DEFAULT NULL,
        c3 int(11) DEFAULT NULL,
        KEY `k_c1` (`c1`),
        KEY `k_c2` (`c2`,`c1`),
        KEY `k_c3` (`c3`,`c2`,`c1`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci;
      EOF
       
      # Load 1M rows of data
      for ((i=0;i<1000;i++)); do
        for ((j=0;j<100;j++)); do
          curval=$(( (i * 100) + j ))
          mdb countdistinct -e 'INSERT INTO autoinc (c1) VALUES ('"${curval}"');' &
        done
      done
       
      # Prime buffers/caches
      time mdb countdistinct -Nse 'SELECT c0 FROM autoinc;' >/dev/null
      time mdb countdistinct -Nse 'SELECT c1 FROM autoinc;' >/dev/null
      time mdb countdistinct -Nse 'SELECT c2 FROM autoinc;' >/dev/null
      time mdb countdistinct -Nse 'SELECT c3 FROM autoinc;' >/dev/null
       
      # Test various ways of obtaining COUNT data
      time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
      time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'
      time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'
      time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
      time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
      

      The killer here is the GROUP BY ... HAVING ... query performing much faster than COUNT(DISTINCT... as the two are logically pretty similar in what they're doing.

      Example uses an AUTO_INCREMENT PK to give a "best-case" scenario for this- performance is worse if the PK is unsorted.

      Here are some sample test results-

      time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
      100000
       
      real    0m0.039s
      user    0m0.019s
      sys     0m0.006s
      time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'
      100000
       
      real    0m0.037s
      user    0m0.019s
      sys     0m0.004s
      time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'
      100000
       
      real    0m0.053s
      user    0m0.018s
      sys     0m0.004s
      time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
       
      real    0m0.042s
      user    0m0.019s
      sys     0m0.003s
      time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
      100131
       
      real    0m0.022s
      user    0m0.016s
      sys     0m0.005s

      Note that this slowness seems exclusive to PKs. Running on the roughly identical c1 column yields only a marginal slowdown for COUNT(DISTINCT(c1)) despite the contents of c0 and c1 being almost identical-

      time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
      LECT COUNT(DISTINCT (c1)) FROM autoinc;'
      100000
       
      real    0m0.037s
      user    0m0.021s
      sys     0m0.003s
      time mdb countdistinct -Nse 'SELECT COUNT(c1) FROM autoinc;'
      100000
       
      real    0m0.036s
      user    0m0.019s
      sys     0m0.003s
      time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c1)) FROM autoinc;'
      100000
       
      real    0m0.040s
      user    0m0.018s
      sys     0m0.003s
      time mdb countdistinct -Nse 'SELECT c1, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
       
      real    0m0.040s
      user    0m0.017s
      sys     0m0.004s
      time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
      100131
       
      real    0m0.022s
      user    0m0.018s
      sys     0m0.003s

      And to just confirm the content is comparable as expected-

      mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 ASC LIMIT 5;'
      +----+------+
      | c0 | c1   |
      +----+------+
      |  1 |    0 |
      |  2 |    1 |
      |  3 |    2 |
      |  4 |    3 |
      |  5 |    4 |
      +----+------+
      mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 DESC LIMIT 5;'
      +--------+-------+
      | c0     | c1    |
      +--------+-------+
      | 100000 | 99999 |
      |  99999 | 99998 |
      |  99998 | 99997 |
      |  99997 | 99996 |
      |  99996 | 99995 |
      +--------+-------+

      The good news is if the non-PK results hold up, that handles most realistic cases for COUNT(DISTINCT.... But it does seem like we should figure out why our PK implementation is such an outlier in case it is stealthily affecting other operations.

      Attachments

        Issue Links

          Activity

            People

              oleg.smirnov Oleg Smirnov
              rob.schwyzer@mariadb.com Rob Schwyzer
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.