[MCOL-5522] PM join exceeded the join match count -invalid result Created: 2023-06-29  Updated: 2024-01-10  Resolved: 2023-08-07

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 23.02.3
Fix Version/s: 23.10.0

Type: Bug Priority: Major
Reporter: Allen Herrera Assignee: Denis Khalikov
Resolution: Fixed Votes: 0
Labels: rm_invalid_data

Attachments: PNG File CS0466747_compare_columnstore_rows_to_innodb_rows.png     PNG File RightvsWrongResult.png     Zip Archive generate_csv_to_reproduce_bad_rowcounts.zip    
Issue Links:
Duplicate
duplicates MCOL-5543 Aggregated count on left join result ... Closed
Sprint: 2023-7, 2023-8
Assigned for Review: Roman Roman

 Description   

Reproduction

SET columnstore_s3_region='us-west-2';
SET columnstore_s3_key='xxxxxxx';
SET columnstore_s3_secret='xxxxxxxxf';
 
DATABASE="test"
mariadb -e "create database if not exists $DATABASE; use $DATABASE;"
mariadb $DATABASE -e "CREATE TABLE IF NOT EXISTS lu_proc (  PROC_CD  varchar(25) NOT NULL,  PROC_DESC  varchar(200) NOT NULL,  PROC_GRP_CD  varchar(2) NOT NULL,  SERV_CTG_CD  varchar(2) DEFAULT NULL) ENGINE=Columnstore DEFAULT CHARSET=latin1;";
mariadb $DATABASE -e "CREATE TABLE IF NOT EXISTS mdcdm_bill_line (  SERV_YR  smallint(6) NOT NULL,  PROC_CD_DER  varchar(25) DEFAULT NULL,  PROC_GRP_CD_DER  varchar(2) DEFAULT NULL,  SERV_CTG_CD  varchar(2) DEFAULT NULL) ENGINE=Columnstore DEFAULT CHARSET=latin1;";
CALL columnstore_info.load_from_s3('xxxxxxxx','lu_proc.csv','test', 'lu_proc', ',', '"', '^');
CALL columnstore_info.load_from_s3('xxxxxxxx','mdcdm_bill_line-quarter.csv','test', 'mdcdm_bill_line', ',', '"', '^');
 
# 1.4mill and 256 mill
mariadb $DATABASE -q -e "select count(*) from lu_proc; select count(*) from mdcdm_bill_line;" 
 
# correct result
mariadb $DATABASE -q -e "select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' group by SERV_YR order by SERV_YR"
 
# wrong result
mariadb $DATABASE -q -e "select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' AND SERV_YR >= 2011 group by SERV_YR order by SERV_YR"

See private notes for credentials and bucket for test data.
Also test with the full 800 mill dataset

Right Result from above

 
+---------+-----------------------+
| SERV_YR | count(*)              |
+---------+-----------------------+
|       0 |                   905 |
|     123 |                     1 |
|     932 |                     2 |
|    1901 |                     7 |
|    1907 |                     1 |
|    1920 |                     1 |
|    1969 |                     2 |
|    1981 |                     2 |
|    1987 |                     6 |
|    1988 |                     3 |
|    1989 |                     2 |
|    1990 |                     9 |
|    1992 |                     2 |
|    1993 |                     4 |
|    1994 |                     3 |
|    1995 |                    10 |
|    1996 |                    14 |
|    1997 |                    20 |
|    1998 |                    25 |
|    1999 |                    33 |
|    2000 |                    16 |
|    2001 |                    59 |
|    2002 |                   100 |
|    2003 |                   118 |
|    2004 |                   488 |
|    2005 |                  2168 |
|    2006 |                  3684 |
|    2007 |                  6258 |
|    2008 |                 18524 |
|    2009 |                 62637 |
|    2010 |               1253499 |
|    2011 |               1684772 |
|    2012 |               1595403 |
|    2013 |               1542136 |
|    2014 |               1416396 |
|    2015 |               1181466 |
|    2016 |                958816 |
|    2017 |                911336 |
|    2018 |                930263 |
|    2019 |                932707 |
|    2020 |                761918 |
|    2021 |                749684 |
|    2022 |                530476 |
|    2023 |                   362 |
+---------+-----------------------+

Wrong Result

+---------+-----------------------+
| SERV_YR | count(*)              |
+---------+-----------------------+
|    2011 |               8759604 |
|    2012 |               8478452 |
|    2013 |               8242161 |
|    2014 |               7697444 |
|    2015 |               6473961 |
|    2016 |               5255764 |
|    2017 |               4948912 |
|    2018 |               5063151 |
|    2019 |               5055187 |
|    2020 |               4003684 |
|    2021 |               3929401 |
|    2022 |               2770433 |
|    2023 |                  1237 |
+---------+-----------------------+



 Comments   
Comment by alexey vorovich (Inactive) [ 2023-06-29 ]

denis0x0D please take a look at that.
dleeyh could u scan our mtr tests to look for a similiar test

Comment by Daniel Lee (Inactive) [ 2023-07-26 ]

Build tested: develop branch

engine: 65cde8c8945398c9a2f9b605ba09bd2515c77680
server: 62d6100a913699fec9ff48284a76bfe6226e70bc
buildNo: 8262

Reproduced the reported issue in release 23.02.04 and verified the fix. The new test result matches that of the InnoDB engine.

It is a great news that the performance of these two queries on ColumnStore are 760 and 459 times faster over InnoDB.

I was unable to set the mentioned session variable.

MariaDB [test]> set columnstore_max_pm_join_result_count=1;
ERROR 1193 (HY000): Unknown system variable 'columnstore_max_pm_join_result_count'

ColumnStore result - release 23.02.4

MariaDB [test]> select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' group by SERV_YR order by SERV_YR;
+---------+----------+
| SERV_YR | count(*) |
+---------+----------+
|    2005 |    12120 |
|    2006 |     9170 |
|    2007 |    10699 |
|    2008 |    11519 |
|    2009 |    12543 |
|    2010 |    10609 |
|    2011 |  1790704 |
|    2012 |  1782953 |
|    2013 |  1771222 |
|    2014 |    23555 |
|    2015 |    23149 |
|    2016 |    25115 |
|    2017 |    23822 |
|    2018 |    26874 |
|    2019 |    28462 |
|    2020 |    19259 |
|    2021 |    28140 |
|    2022 |    24688 |
|    2023 |     9770 |
+---------+----------+
19 rows in set (0.442 sec)
 
MariaDB [test]> 
MariaDB [test]> select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' AND SERV_YR >= 2011 group by SERV_YR order by SERV_YR;
+---------+----------+
| SERV_YR | count(*) |
+---------+----------+
|    2011 | 18283970 |
|    2012 | 18198957 |
|    2013 | 18081108 |
|    2014 |   240324 |
|    2015 |   236094 |
|    2016 |   256952 |
|    2017 |   242896 |
|    2018 |   274923 |
|    2019 |   290595 |
|    2020 |   197131 |
|    2021 |   287291 |
|    2022 |   253189 |
|    2023 |    99570 |
+---------+----------+
13 rows in set (4.711 sec)

ColumnStore result - with fix

MariaDB [test]> select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' group by SERV_YR order by SERV_YR;
+---------+----------+
| SERV_YR | count(*) |
+---------+----------+
|    2005 |    12120 |
|    2006 |     9170 |
|    2007 |    10699 |
|    2008 |    11519 |
|    2009 |    12543 |
|    2010 |    10609 |
|    2011 |  1790704 |
|    2012 |  1782953 |
|    2013 |  1771222 |
|    2014 |    23555 |
|    2015 |    23149 |
|    2016 |    25115 |
|    2017 |    23822 |
|    2018 |    26874 |
|    2019 |    28462 |
|    2020 |    19259 |
|    2021 |    28140 |
|    2022 |    24688 |
|    2023 |     9770 |
+---------+----------+
19 rows in set (0.427 sec)
 
MariaDB [test]> select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' AND SERV_YR >= 2011 group by SERV_YR order by SERV_YR;
+---------+----------+
| SERV_YR | count(*) |
+---------+----------+
|    2011 |  1790704 |
|    2012 |  1782953 |
|    2013 |  1771222 |
|    2014 |    23555 |
|    2015 |    23149 |
|    2016 |    25115 |
|    2017 |    23822 |
|    2018 |    26874 |
|    2019 |    28462 |
|    2020 |    19259 |
|    2021 |    28140 |
|    2022 |    24688 |
|    2023 |     9770 |
+---------+----------+
13 rows in set (0.680 sec)

InnoDB results

MariaDB [mytest]> select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' group by SERV_YR order by SERV_YR;
 
+---------+----------+
| SERV_YR | count(*) |
+---------+----------+
|    2005 |    12120 |
|    2006 |     9170 |
|    2007 |    10699 |
|    2008 |    11519 |
|    2009 |    12543 |
|    2010 |    10609 |
|    2011 |  1790704 |
|    2012 |  1782953 |
|    2013 |  1771222 |
|    2014 |    23555 |
|    2015 |    23149 |
|    2016 |    25115 |
|    2017 |    23822 |
|    2018 |    26874 |
|    2019 |    28462 |
|    2020 |    19259 |
|    2021 |    28140 |
|    2022 |    24688 |
|    2023 |     9770 |
+---------+----------+
19 rows in set (5 min 26.324 sec)
 
MariaDB [mytest]> 
MariaDB [mytest]> select SERV_YR, count(*) from (SELECT A.SERV_YR, A.SERV_CTG_CD FROM mdcdm_bill_line A LEFT JOIN lu_proc B ON A.PROC_CD_DER = B.PROC_CD AND A.PROC_GRP_CD_DER = B.PROC_GRP_CD) j WHERE SERV_CTG_CD = '11' AND SERV_YR >= 2011 group by SERV_YR order by SERV_YR;
+---------+----------+
| SERV_YR | count(*) |
+---------+----------+
|    2011 |  1790704 |
|    2012 |  1782953 |
|    2013 |  1771222 |
|    2014 |    23555 |
|    2015 |    23149 |
|    2016 |    25115 |
|    2017 |    23822 |
|    2018 |    26874 |
|    2019 |    28462 |
|    2020 |    19259 |
|    2021 |    28140 |
|    2022 |    24688 |
|    2023 |     9770 |
+---------+----------+
13 rows in set (5 min 12.662 sec)

Comment by alexey vorovich (Inactive) [ 2023-07-26 ]

dleeyh denis0x0D
One of the PR that added variable support is not merged yet

Comment by Daniel Lee (Inactive) [ 2023-08-07 ]

Build verified: latest develop branch

engine: f2affca162232063481a2f6189db3ab08c28cfaa
server: 62d6100a913699fec9ff48284a76bfe6226e70bc
buildNo: 8342

Generated at Thu Feb 08 02:58:30 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.