Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-30766

Issue with NOT IN (subquery that violates "FULL GROUP BY")

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Not a Bug
    • 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL), 10.8(EOL), 10.9(EOL), 10.10(EOL), 10.11, 11.0(EOL)
    • N/A
    • Optimizer
    • None

    Description

      With the following database:

      CREATE TABLE PLAYBACK (
        SESSION_ID INTEGER primary key,
        CUSTOMER_ID INTEGER,
        START_TIME INTEGER,
        END_TIME INTEGER
      );
      INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1);
      INSERT INTO PLAYBACK VALUES (0, 0, 1, 1);
      CREATE TABLE ADS (
        AD_ID INTEGER primary key,
        CUSTOMER_ID INTEGER,
        TIMESTAMP INTEGER
      );
      INSERT INTO ADS VALUES (0, 0, 1);
      INSERT INTO ADS VALUES (1, 0, 1);
      

      After running Q1

      -- Q1
      SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;
      

      the output is a table with one column and one row, and the only value is 0:

      +-----------------------+
      | ANY_VALUE(SESSION_ID) |
      +-----------------------+
      |                     0 |
      +-----------------------+
      

      And we have Q2:

      -- Q2
      SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);
      

      Q2 is basically a query of `SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN Q1`

      Therefore, we would expect it outputs one row with SESSION_ID of -1 . However, the output is an empty set. We can further investigate this by running Q3:

      -- Q3
      SELECT SESSION_ID, (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK;
      

      The output is:

      +------------+------+----------------------+
      | SESSION_ID | Q1   | SESSION_ID NOT IN Q1 |
      +------------+------+----------------------+
      |         -1 |    0 |                    0 |
      |          0 |    0 |                    0 |
      +------------+------+----------------------+
      

      We can see the result doesn't make sense, because -1 NOT IN (0) shouldn't be false.

      Attachments

        Issue Links

          Activity

            alice Alice Sherepa added a comment - - edited

            MariaDB does not have ANY_VALUE() function. Could you please add SHOW CREATE FUNCTION as it is in your database?

            alice Alice Sherepa added a comment - - edited MariaDB does not have ANY_VALUE() function. Could you please add SHOW CREATE FUNCTION as it is in your database?
            pinhan Pinhan Zhao added a comment - - edited

            My apologies – the ANY_VALUE functions shouldn't have been in the queries. I've included the revised complete example below:

            CREATE TABLE PLAYBACK (
              SESSION_ID INTEGER primary key,
              CUSTOMER_ID INTEGER,
              START_TIME INTEGER,
              END_TIME INTEGER
            );
            INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1);
            INSERT INTO PLAYBACK VALUES (0, 0, 1, 1);
            CREATE TABLE ADS (
              AD_ID INTEGER primary key,
              CUSTOMER_ID INTEGER,
              TIMESTAMP INTEGER
            );
            INSERT INTO ADS VALUES (0, 0, 1);
            INSERT INTO ADS VALUES (1, 0, 1);
             
            -- Q1
            SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;
             
            -- Q2
            SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN (SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);
             
            -- Q3
            SELECT SESSION_ID, (SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN (SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK;
            

            And without the ANY_VALUE functions we're then able to reproduce this issue.

            Thanks.

            pinhan Pinhan Zhao added a comment - - edited My apologies – the ANY_VALUE functions shouldn't have been in the queries. I've included the revised complete example below: CREATE TABLE PLAYBACK ( SESSION_ID INTEGER primary key , CUSTOMER_ID INTEGER , START_TIME INTEGER , END_TIME INTEGER ); INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1); INSERT INTO PLAYBACK VALUES (0, 0, 1, 1); CREATE TABLE ADS ( AD_ID INTEGER primary key , CUSTOMER_ID INTEGER , TIMESTAMP INTEGER ); INSERT INTO ADS VALUES (0, 0, 1); INSERT INTO ADS VALUES (1, 0, 1);   -- Q1 SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;   -- Q2 SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN ( SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);   -- Q3 SELECT SESSION_ID, ( SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN ( SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK; And without the ANY_VALUE functions we're then able to reproduce this issue. Thanks.
            serg Sergei Golubchik added a comment - - edited

            I tend to qualify it as a not a bug. For a query to have well-defined result, SESSION_ID must have a functional dependency on CUSTOMER_ID. In other words, if two rows have the same value of CUSTOMER_ID they must have the same value of SESSION_ID. The way the query is written now it's basically invalid and most RDBMS would throw an error and would not execute it.

            MariaDB assumes that the query is valid and have well-defined result, and it optimizes it under this assumption. If you run EXPLAIN EXTENDED you'll see what exactly the optimizer is doing — it omits GROUP BY in a subquery.

            One possible way of making this work would be implementing ANY_VALUE() function in MariaDB. This function would make the query valid again by basically telling the optimizer "the column SESSION_ID might have different values in the same group". And then optimizer would know not to perform optimizations that rely on all values being the same (like omitting GROUP BY).

            ANY_VALUE() will be implemented in MDEV-10426

            serg Sergei Golubchik added a comment - - edited I tend to qualify it as a not a bug. For a query to have well-defined result, SESSION_ID must have a functional dependency on CUSTOMER_ID . In other words, if two rows have the same value of CUSTOMER_ID they must have the same value of SESSION_ID . The way the query is written now it's basically invalid and most RDBMS would throw an error and would not execute it. MariaDB assumes that the query is valid and have well-defined result, and it optimizes it under this assumption. If you run EXPLAIN EXTENDED you'll see what exactly the optimizer is doing — it omits GROUP BY in a subquery. One possible way of making this work would be implementing ANY_VALUE() function in MariaDB. This function would make the query valid again by basically telling the optimizer "the column SESSION_ID might have different values in the same group". And then optimizer would know not to perform optimizations that rely on all values being the same (like omitting GROUP BY ). ANY_VALUE() will be implemented in MDEV-10426

            People

              serg Sergei Golubchik
              pinhan Pinhan Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.