Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30660

Aggregation functions fail to leverage uniqueness property

Details

    Description

      Repro-

      alias mdb='/path/to/mariadb-client-binary --defaults-file=/path/to/my.cnf'
      mdb -e 'CREATE DATABASE countdistinct;'
       
      mdb countdistinct << 'EOF'
      CREATE TABLE autoinc (
        c0 INT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
        c1 int(11) DEFAULT NULL,
        c2 int(11) DEFAULT NULL,
        c3 int(11) DEFAULT NULL,
        KEY `k_c1` (`c1`),
        KEY `k_c2` (`c2`,`c1`),
        KEY `k_c3` (`c3`,`c2`,`c1`)
      ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_general_ci;
      EOF
       
      # Load 1M rows of data
      for ((i=0;i<1000;i++)); do
        for ((j=0;j<100;j++)); do
          curval=$(( (i * 100) + j ))
          mdb countdistinct -e 'INSERT INTO autoinc (c1) VALUES ('"${curval}"');' &
        done
      done
       
      # Prime buffers/caches
      time mdb countdistinct -Nse 'SELECT c0 FROM autoinc;' >/dev/null
      time mdb countdistinct -Nse 'SELECT c1 FROM autoinc;' >/dev/null
      time mdb countdistinct -Nse 'SELECT c2 FROM autoinc;' >/dev/null
      time mdb countdistinct -Nse 'SELECT c3 FROM autoinc;' >/dev/null
       
      # Test various ways of obtaining COUNT data
      time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
      time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'
      time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'
      time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
      time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
      

      The killer here is the GROUP BY ... HAVING ... query performing much faster than COUNT(DISTINCT... as the two are logically pretty similar in what they're doing.

      Example uses an AUTO_INCREMENT PK to give a "best-case" scenario for this- performance is worse if the PK is unsorted.

      Here are some sample test results-

      time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
      100000
       
      real    0m0.039s
      user    0m0.019s
      sys     0m0.006s
      time mdb countdistinct -Nse 'SELECT COUNT(c0) FROM autoinc;'
      100000
       
      real    0m0.037s
      user    0m0.019s
      sys     0m0.004s
      time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c0)) FROM autoinc;'
      100000
       
      real    0m0.053s
      user    0m0.018s
      sys     0m0.004s
      time mdb countdistinct -Nse 'SELECT c0, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
       
      real    0m0.042s
      user    0m0.019s
      sys     0m0.003s
      time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
      100131
       
      real    0m0.022s
      user    0m0.016s
      sys     0m0.005s

      Note that this slowness seems exclusive to PKs. Running on the roughly identical c1 column yields only a marginal slowdown for COUNT(DISTINCT(c1)) despite the contents of c0 and c1 being almost identical-

      time mdb countdistinct -Nse 'SELECT COUNT(*) FROM autoinc;'
      LECT COUNT(DISTINCT (c1)) FROM autoinc;'
      100000
       
      real    0m0.037s
      user    0m0.021s
      sys     0m0.003s
      time mdb countdistinct -Nse 'SELECT COUNT(c1) FROM autoinc;'
      100000
       
      real    0m0.036s
      user    0m0.019s
      sys     0m0.003s
      time mdb countdistinct -Nse 'SELECT COUNT(DISTINCT (c1)) FROM autoinc;'
      100000
       
      real    0m0.040s
      user    0m0.018s
      sys     0m0.003s
      time mdb countdistinct -Nse 'SELECT c1, COUNT(*) FROM autoinc GROUP BY 1 HAVING COUNT(*) <> 1;'
       
      real    0m0.040s
      user    0m0.017s
      sys     0m0.004s
      time mdb information_schema -Nse 'SELECT TABLE_ROWS FROM TABLES WHERE TABLE_SCHEMA="countdistinct" AND TABLE_NAME="autoinc";'
      100131
       
      real    0m0.022s
      user    0m0.018s
      sys     0m0.003s

      And to just confirm the content is comparable as expected-

      mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 ASC LIMIT 5;'
      +----+------+
      | c0 | c1   |
      +----+------+
      |  1 |    0 |
      |  2 |    1 |
      |  3 |    2 |
      |  4 |    3 |
      |  5 |    4 |
      +----+------+
      mdb countdistinct -e 'SELECT c0, c1 FROM autoinc ORDER BY c0 DESC LIMIT 5;'
      +--------+-------+
      | c0     | c1    |
      +--------+-------+
      | 100000 | 99999 |
      |  99999 | 99998 |
      |  99998 | 99997 |
      |  99997 | 99996 |
      |  99996 | 99995 |
      +--------+-------+

      The good news is if the non-PK results hold up, that handles most realistic cases for COUNT(DISTINCT.... But it does seem like we should figure out why our PK implementation is such an outlier in case it is stealthily affecting other operations.

      Attachments

        Issue Links

          Activity

            Review input provided in the https://github.com/MariaDB/server/pull/3012.

            psergei Sergei Petrunia added a comment - Review input provided in the https://github.com/MariaDB/server/pull/3012 .
            oleg.smirnov Oleg Smirnov added a comment - - edited

            The fix has been pushed to 10.5.

            oleg.smirnov Oleg Smirnov added a comment - - edited The fix has been pushed to 10.5.
            psergei Sergei Petrunia added a comment - - edited

            Notes for the changelog:

            Generally, computing aggregate function with DISTINCT argument: aggregate_func(DISTINCT col1, col2, ...) requires producing a de-duplicated set of its arguments, which can be CPU-intensive
            When we select from one table the argument list includes the table's PRIMARY (or UNIQUE) key:

            SELECT aggregate_func(DISTINCT t1.primary_key, ...) FROM t1;
            

            then the arguments are guaranteed not to have duplicates. This MDEV detects such cases and allows the optimizer to skip de-duplication.

            psergei Sergei Petrunia added a comment - - edited Notes for the changelog: Generally, computing aggregate function with DISTINCT argument: aggregate_func(DISTINCT col1, col2, ...) requires producing a de-duplicated set of its arguments, which can be CPU-intensive When we select from one table the argument list includes the table's PRIMARY (or UNIQUE) key: SELECT aggregate_func( DISTINCT t1.primary_key, ...) FROM t1; then the arguments are guaranteed not to have duplicates. This MDEV detects such cases and allows the optimizer to skip de-duplication.
            psergei Sergei Petrunia added a comment - Documentation: https://mariadb.com/kb/en/query-optimizations-distinct-removal-in-aggregate-functions/

            The optimization is beneficial, but we're not getting all the benefits we could: the choice of grouping strategy doesn't seem to be affected. Filed this as MDEV-33911.

            psergei Sergei Petrunia added a comment - The optimization is beneficial, but we're not getting all the benefits we could: the choice of grouping strategy doesn't seem to be affected. Filed this as MDEV-33911 .

            People

              oleg.smirnov Oleg Smirnov
              rob.schwyzer@mariadb.com Rob Schwyzer (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.