[MCOL-4800] IN clause wrong results -add error msg Created: 2021-07-06  Updated: 2023-11-17  Resolved: 2022-08-10

Status: Closed
Project: MariaDB ColumnStore
Component/s: None
Affects Version/s: 5.5.2, 5.6.5, 6.2.1
Fix Version/s: 22.08.1

Type: Bug Priority: Major
Reporter: Todd Stoffel (Inactive) Assignee: David Hall (Inactive)
Resolution: Fixed Votes: 2
Labels: None

Attachments: File hourly_sample.tsv.gz    
Sprint: 2021-17
Assigned for Review: Roman Roman
Assigned for Testing: Daniel Lee Daniel Lee (Inactive)

 Description   

When specifying 100,000 values ​​in SELECT WHERE IN(1,2,10,), only 34,464 rows are returned. If the table does not specify Engine = Columnstore, it returns 100,000 rows correctly.

Is there a limit to the number of IN() in ColumnStore?γ€€

(MAX 65,535?) MariaDB Version is 10.5.11.



 Comments   
Comment by Eugênio Pacceli Reis da Fonseca [ 2021-08-26 ]

I can testify to this IN clause unreliable behavior. Many of our queries using col IN (val1, val2, ...) were presenting incorrect results when compared to the equivalent query using the col=val1 OR col=val2 OR col=val3 expression. The later comes up with the correct results.

We have solved this kind of issue in our ColumnStore client server app by replacing every x IN (a,b,c...) by x=a OR x=b OR x=c .... in our query builder. The IN clause in ColumnStore is not reliable as of now!

Our AND x IN (vals...) clause - implemented with (x=vals OR ...) - builder, TypeScript:

    andInIfNotEmpty(columnName: string, arr: Array<any>): string {
        const safeColName = escape(columnName);
        if (arr != null && arr.length > 0) {
            const treated = arr.map(element => {
                if (typeof element == 'string') {
                    return `'${escape(element)}'`; // String elements escaped and between ''
                } else if (element != null) {
                    return element; // Others stay raw
                }
                return -1; // Invalid element
            });
 
            let orEquals = "";
 
            for (const e of treated) {
                orEquals += `${safeColName}=${e} OR `;
            }
 
            orEquals = orEquals.substring(0, orEquals.length - 4); // removes last ' OR '
 
            return `AND (${orEquals}) `;
        }
        return '';
    }

Comment by Roman [ 2021-08-27 ]

eugenio.pacceli Greetings! Internally MCS turns col IN (val1, val2..., valX) into a filter expression that equals to that you use in a workaround so I presume the incorrect filtering happens b/c of the truncation of the filter expression sent from MDB side into ExeMgr. I wonder if you tried binary search to get the threshold after which MCS delivers wrong results?

Comment by alexey vorovich (Inactive) [ 2022-06-29 ]

There are 2 issues here
-support of long list(unlimited lists ??/) . Agree with David on the correct approach with tmp table. Also see for comparison http://www.dba-oracle.com/t_maximum_number_of_sql_in_list_values.htm
-do we produce an error msg? allen.herrera if not, lets open a ticket for that

Comment by David Hall (Inactive) [ 2022-07-26 ]

@dlee Please comment on this. I think we can close it but I'm not sure.

Comment by alexey vorovich (Inactive) [ 2022-07-27 ]

David.Hall , is tis the right ticket ? There is no PR here

Comment by David Hall (Inactive) [ 2022-07-29 ]

I tracked down the problem to the count of the number of items being stored in a 16 bit unsigned. This maxes out at 65535 count. The code doesn't break (segv or error of some sort) because the filters themselves are being stored in a bytestream which can handle arbitrary sized streams (within memory constraints). But the count rolls over so 65536 becomes 0, and all those filters are ignored.

Thinking this could be an easy fix, I tried to change it to a 32 bit int, but soon discovered that, in some code paths (meaning for some queries), this number is tied to the number of RIDs in a block, which is stored in a 16 bit unsigned. Rather than chase the rabbit down this possibly very deep hole, I backed out and added an edit to error out if the number of items is > 65535.

In truth, we shouldn't need to support any more than this. Really, any count > than 1000 is inefficient. The query should be re-worked to use a temp table to hold the values and a join or subselect used rather than "IN" (or "NOT IN", or "BETWEEN", etc.)

Comment by Daniel Lee (Inactive) [ 2022-08-10 ]

Build verified: 22.08-1 (#5243)

Verified reported test scenario:

[root@s0 data]# mariadb mytest < sample.sql|wc -l
ERROR 1815 (HY000) at line 1: Internal error: MCS-1001: Function 'in with argument count > 65535' isn't supported.
0

Generated at Thu Feb 08 02:53:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.