[MCOL-4800] IN clause wrong results -add error msg Created: 2021-07-06 Updated: 2023-11-17 Resolved: 2022-08-10 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 5.5.2, 5.6.5, 6.2.1 |
| Fix Version/s: | 22.08.1 |
| Type: | Bug | Priority: | Major |
| Reporter: | Todd Stoffel (Inactive) | Assignee: | David Hall (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | None | ||
| Attachments: |
|
| Sprint: | 2021-17 |
| Assigned for Review: | |
| Assigned for Testing: | |
| Description |
|
When specifying 100,000 values ββin SELECT WHERE IN(1,2,10,), only 34,464 rows are returned. If the table does not specify Engine = Columnstore, it returns 100,000 rows correctly. Is there a limit to the number of IN() in ColumnStore?γ (MAX 65,535?) MariaDB Version is 10.5.11. |
| Comments |
| Comment by Eugênio Pacceli Reis da Fonseca [ 2021-08-26 ] | ||||||||||||||||||||||||
|
I can testify to this IN clause unreliable behavior. Many of our queries using col IN (val1, val2, ...) were presenting incorrect results when compared to the equivalent query using the col=val1 OR col=val2 OR col=val3 expression. The later comes up with the correct results. We have solved this kind of issue in our ColumnStore client server app by replacing every x IN (a,b,c...) by x=a OR x=b OR x=c .... in our query builder. The IN clause in ColumnStore is not reliable as of now! Our AND x IN (vals...) clause - implemented with (x=vals OR ...) - builder, TypeScript:
| ||||||||||||||||||||||||
| Comment by Roman [ 2021-08-27 ] | ||||||||||||||||||||||||
|
eugenio.pacceli Greetings! Internally MCS turns col IN (val1, val2..., valX) into a filter expression that equals to that you use in a workaround so I presume the incorrect filtering happens b/c of the truncation of the filter expression sent from MDB side into ExeMgr. I wonder if you tried binary search to get the threshold after which MCS delivers wrong results? | ||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-06-29 ] | ||||||||||||||||||||||||
|
There are 2 issues here | ||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2022-07-26 ] | ||||||||||||||||||||||||
|
@dlee Please comment on this. I think we can close it but I'm not sure. | ||||||||||||||||||||||||
| Comment by alexey vorovich (Inactive) [ 2022-07-27 ] | ||||||||||||||||||||||||
|
David.Hall , is tis the right ticket ? There is no PR here | ||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2022-07-29 ] | ||||||||||||||||||||||||
|
I tracked down the problem to the count of the number of items being stored in a 16 bit unsigned. This maxes out at 65535 count. The code doesn't break (segv or error of some sort) because the filters themselves are being stored in a bytestream which can handle arbitrary sized streams (within memory constraints). But the count rolls over so 65536 becomes 0, and all those filters are ignored. Thinking this could be an easy fix, I tried to change it to a 32 bit int, but soon discovered that, in some code paths (meaning for some queries), this number is tied to the number of RIDs in a block, which is stored in a 16 bit unsigned. Rather than chase the rabbit down this possibly very deep hole, I backed out and added an edit to error out if the number of items is > 65535. In truth, we shouldn't need to support any more than this. Really, any count > than 1000 is inefficient. The query should be re-worked to use a temp table to hold the values and a join or subselect used rather than "IN" (or "NOT IN", or "BETWEEN", etc.) | ||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2022-08-10 ] | ||||||||||||||||||||||||
|
Build verified: 22.08-1 (#5243) Verified reported test scenario: [root@s0 data]# mariadb mytest < sample.sql|wc -l |