[MCOL-5430] Window Functions in projection with GROUP BY delivers wrong results - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 22.08.7
Fix Version/s: 23.10
Component/s: ExeMgr, PrimProc
Labels:
- rm_invalid_data

Sprint:
2023-4, 2023-5, 2023-6, 2023-7, 2023-8, 2023-10, 2023-11

Description

Consider the example:

CREATE TABLE empsalary (

    depname varchar(100),

    empno bigint,

    salary int,

    enroll_date date

)engine=columnstore;

insert into empsalary values ('develop'  , 10, 5200, '2007-08-01');

insert into empsalary values ('sales'    ,  1, 5000, '2006-10-01');

insert into empsalary values ('personnel',  5, 3500, '2007-12-10');

insert into empsalary values ('sales'    ,  4, 4800, '2007-08-08');

insert into empsalary values ('personnel',  2, 3900, '2006-12-23');

insert into empsalary values ('develop'  ,  7, 4200, '2008-01-01');

insert into empsalary values ('develop'  ,  9, 4500, '2008-01-01');

insert into empsalary values ('sales'    ,  3, 4800, '2007-08-01');

insert into empsalary values ('develop'  ,  8, 6000, '2006-10-01');

insert into empsalary values ('develop'  , 11, 5200, '2007-08-15');

insert into empsalary values ('develop'  , 12, null, '2008-08-09');

Here is the SELECT that differs b/w MDB and MCS. Here is MCS version.

MariaDB [test]> select avg(salary),depname, moda(salary) over(partition by depname order by enroll_date) from empsalary group by depname;

+-------------+-----------+--------------------------------------------------------------+

| avg(salary) | depname   | moda(salary) over(partition by depname order by enroll_date) |

+-------------+-----------+--------------------------------------------------------------+

|   6000.0000 | develop   | 6000                                                         |

|   5200.0000 | develop   | 5200                                                         |

|   5200.0000 | develop   | 5200                                                         |

|   4200.0000 | develop   | 5200                                                         |

|   4500.0000 | develop   | 5200                                                         |

|        NULL | develop   | 5200                                                         |

|   3900.0000 | personnel | 3900                                                         |

|   3500.0000 | personnel | 3500                                                         |

|   5000.0000 | sales     | 5000                                                         |

|   4800.0000 | sales     | 4800                                                         |

|   4800.0000 | sales     | 4800                                                         |

+-------------+-----------+--------------------------------------------------------------+

11 rows in set, 1 warning (0.080 sec)

Here is MDB.

MariaDB [test]> select avg(salary),depname, moda(salary) over(partition by depname order by enroll_date) from emp group by depname;

+-------------+-----------+--------------------------------------------------------------+

| avg(salary) | depname   | moda(salary) over(partition by depname order by enroll_date) |

+-------------+-----------+--------------------------------------------------------------+

|   5020.0000 | develop   | 5200                                                         |

|   3700.0000 | personnel | 3500                                                         |

|   4866.6667 | sales     | 5000                                                         |

+-------------+-----------+--------------------------------------------------------------+

3 rows in set (0.002 sec)

The origins of this incorrect behavior is the fact that RowStorage::hash gets an incorrect number of key columns to hash on thus MCS treats every row it receives from scanning in BPP::execute as a unique one.
The calculation of key columns number is buried deep in CSEP to JobList translation, namely in addProjectStepsToBps that is called inside combineJobStepsByTable. In simple words TAS uses the set of columns returned the previous delivery/projection step as TAS input. It removes all aggregate columns from the set and those that left are TAS key columns participate in hashing.
In case when WF is projection combineJobStepsByTable run by TAS sees aux columns added by WF so TAS hashing miscalculates every row hash making all rows potentially unique and this alters TAS result set.

Attachments

Issue Links

relates to

MCOL-5187 OOM happening when querying large datasets and using distinct

Closed

Activity

People

Assignee:: Roman

Reporter:: Roman

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2023-02-20 11:39

Updated:: 2023-12-22 13:55

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.