[MDEV-30766] Issue with NOT IN (subquery that violates "FULL GROUP BY") Created: 2023-03-01  Updated: 2023-03-03  Resolved: 2023-03-03

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Pinhan Zhao Assignee: Sergei Golubchik
Resolution: Not a Bug Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-30779 Assertion `field_max_length <= m_retu... Confirmed
relates to MDEV-10426 ANY_VALUE function as a workaround fo... Open

 Description   

With the following database:

CREATE TABLE PLAYBACK (
  SESSION_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  START_TIME INTEGER,
  END_TIME INTEGER
);
INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1);
INSERT INTO PLAYBACK VALUES (0, 0, 1, 1);
CREATE TABLE ADS (
  AD_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  TIMESTAMP INTEGER
);
INSERT INTO ADS VALUES (0, 0, 1);
INSERT INTO ADS VALUES (1, 0, 1);

After running Q1

-- Q1
SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;

the output is a table with one column and one row, and the only value is 0:

+-----------------------+
| ANY_VALUE(SESSION_ID) |
+-----------------------+
|                     0 |
+-----------------------+

And we have Q2:

-- Q2
SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);

Q2 is basically a query of `SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN Q1`

Therefore, we would expect it outputs one row with SESSION_ID of -1 . However, the output is an empty set. We can further investigate this by running Q3:

-- Q3
SELECT SESSION_ID, (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK;

The output is:

+------------+------+----------------------+
| SESSION_ID | Q1   | SESSION_ID NOT IN Q1 |
+------------+------+----------------------+
|         -1 |    0 |                    0 |
|          0 |    0 |                    0 |
+------------+------+----------------------+

We can see the result doesn't make sense, because -1 NOT IN (0) shouldn't be false.



 Comments   
Comment by Alice Sherepa [ 2023-03-02 ]

MariaDB does not have ANY_VALUE() function. Could you please add SHOW CREATE FUNCTION as it is in your database?

Comment by Pinhan Zhao [ 2023-03-03 ]

My apologies – the ANY_VALUE functions shouldn't have been in the queries. I've included the revised complete example below:

CREATE TABLE PLAYBACK (
  SESSION_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  START_TIME INTEGER,
  END_TIME INTEGER
);
INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1);
INSERT INTO PLAYBACK VALUES (0, 0, 1, 1);
CREATE TABLE ADS (
  AD_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  TIMESTAMP INTEGER
);
INSERT INTO ADS VALUES (0, 0, 1);
INSERT INTO ADS VALUES (1, 0, 1);
 
-- Q1
SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;
 
-- Q2
SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN (SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);
 
-- Q3
SELECT SESSION_ID, (SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN (SELECT SESSION_ID FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK;

And without the ANY_VALUE functions we're then able to reproduce this issue.

Thanks.

Comment by Sergei Golubchik [ 2023-03-03 ]

I tend to qualify it as a not a bug. For a query to have well-defined result, SESSION_ID must have a functional dependency on CUSTOMER_ID. In other words, if two rows have the same value of CUSTOMER_ID they must have the same value of SESSION_ID. The way the query is written now it's basically invalid and most RDBMS would throw an error and would not execute it.

MariaDB assumes that the query is valid and have well-defined result, and it optimizes it under this assumption. If you run EXPLAIN EXTENDED you'll see what exactly the optimizer is doing — it omits GROUP BY in a subquery.

One possible way of making this work would be implementing ANY_VALUE() function in MariaDB. This function would make the query valid again by basically telling the optimizer "the column SESSION_ID might have different values in the same group". And then optimizer would know not to perform optimizations that rely on all values being the same (like omitting GROUP BY).

ANY_VALUE() will be implemented in MDEV-10426

Generated at Thu Feb 08 10:18:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.