[MCOL-3492] Do DISTINCT in it's proper order -- after Window Functions and before UNION Created: 2019-09-09  Updated: 2020-08-25  Due: 2019-09-25  Resolved: 2019-12-11

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr
Affects Version/s: 1.2.5
Fix Version/s: 1.2.6

Type: Bug Priority: Major
Reporter: David Hall (Inactive) Assignee: Daniel Lee (Inactive)
Resolution: Fixed Votes: 0
Labels: None

Sprint: 2019-06

 Description   

Currently, DISTINCT is implemented by creating a GROUP BY with all the non-aggregate columns. This causes DISTINCT to be executed before Window Functions, and it includes the partition and order by columns from the window function. The standard is do the uniqueness after Window Functions, just before UNION (if any). The results are similar, unless Window Functions are involved, in which case, the DISTINCT being executed before Window Functions deletes rows that should be included in the Window Function processing, resulting in the wrong answer.

This page shows the expected order of SQL processing: https://www.eversql.com/sql-order-of-operations-sql-query-order-of-execution/



 Comments   
Comment by David Hall (Inactive) [ 2019-09-12 ]

TupleAnnexStep already has this capability. By disabling the DISTINCT as aggregate feature in the presence of Window Functions, we get the proper answer.

Comment by David Hall (Inactive) [ 2019-09-12 ]

regressiontest PR #139

Comment by Daniel Lee (Inactive) [ 2019-12-11 ]

build tested: 1.2.6-1

engine commit:
d4173ef

I don't know if this is a internal logic change, or a change that corrects results output. I tried the following test case from the developer on 1.2.5-1, for both ColumnStore and InnoDB tables (1GB D BT3 database) and got identical results (See below). I then tried the same test case on 1.2.6-1 and also got the same results.

QA needs some input from development on the identified issue and implemented fix.

Thanks.

MariaDB [tpch1]> select count from lineitem where l_orderkey > 5996000;
----------

count

----------

4012

----------
1 row in set (0.155 sec)

MariaDB [tpch1]> select distinct l_orderkey as ord, sum(l_quantity) over (partition by l_orderkey) as qty_sum from lineitem where l_orderkey > 5996000;
----------------+

ord qty_sum

----------------+
.
.
.

5999972 80.00
5999973 50.00
5999974 70.00
5999975 57.00
6000000 33.00

----------------+
1000 rows in set (0.263 sec)

Removed distinct

MariaDB [tpch1]> select l_orderkey as ord, sum(l_quantity) over (partition by l_orderkey) as qty_sum from lineitem where l_orderkey > 5996000;
----------------+

ord qty_sum

----------------+
.
.
.

5999975 57.00
5999975 57.00
5999975 57.00
6000000 33.00
6000000 33.00

----------------+
4012 rows in set (0.177 sec)

Comment by David Hall (Inactive) [ 2019-12-11 ]

This test case tests the modified code. All looks well.

Generated at Thu Feb 08 02:43:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.