[MCOL-3395] regression: dictionary de-duplication cache bleeding between columns Created: 2019-06-27 Updated: 2019-10-28 Resolved: 2019-07-05 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | PrimProc |
| Affects Version/s: | 1.2.4 |
| Fix Version/s: | 1.2.5 |
| Type: | Bug | Priority: | Critical |
| Reporter: | David Hall (Inactive) | Assignee: | Daniel Lee (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||
| Sprint: | 2019-06 | ||||||||||||||||
| Description |
|
Consider the following table (from misc/bug3669) create table stringtest (c1 char(10), c2 varchar(10), c3 varchar(6))engine=columnstore;
In the first, we're using a char field < 7 wide. In the third we're using a varchar field < 7 wide, In Columnstore 1.1, we get:
|
| Comments |
| Comment by David Hall (Inactive) [ 2019-07-01 ] | |||||||||||||||||||||||||||
|
A new development: This works:
This does not work:
| |||||||||||||||||||||||||||
| Comment by David Hall (Inactive) [ 2019-07-01 ] | |||||||||||||||||||||||||||
|
Tested cpimport and it breaks: st1.tbl: [root@srvhall04 queries]# cpimport tpch1 st1 /home/calpont/st1.tbl Using table OID 37916 as the default JOB ID [root@srvhall04 queries]# MariaDB [tpch1]> select * from st1;
------
------ MariaDB [tpch1]> select * from st1 where c2='abc'; | |||||||||||||||||||||||||||
| Comment by Andrew Hutchings (Inactive) [ 2019-07-02 ] | |||||||||||||||||||||||||||
|
Regression happened between 1.2.3 and 1.2.4 and only seems to affect compressed tables | |||||||||||||||||||||||||||
| Comment by Andrew Hutchings (Inactive) [ 2019-07-02 ] | |||||||||||||||||||||||||||
|
Regression caused by | |||||||||||||||||||||||||||
| Comment by Andrew Hutchings (Inactive) [ 2019-07-03 ] | |||||||||||||||||||||||||||
|
PR in engine and regression suite. Cause is the the new dictionary de-duplication code in 1.2.4. Basically the cache is persisting between columns in a single insert. So when different columns contain the same data the token for the wrong column is returned. This patch clears the cache on soft as well as hard close. For QA: test added to regression suite and you can use the test in description (the c2 where condition should return a result). | |||||||||||||||||||||||||||
| Comment by Andrew Hutchings (Inactive) [ 2019-07-03 ] | |||||||||||||||||||||||||||
|
Full explanation sent to David Hall: When a dictionary write is happening the Dctnry class is used to check/insert into the de-duplication cache, write the dictionary data if required and return the token. The problem comes where the Dctnry class is reused and just opens a new file and re-loads the cache. With the new de-duplication code in 1.2.4 the cache is no longer cleared on load. Lets take your insert query as an example:
First the Dctnry class is used to insert 'abc' and 'cde' into c1, add to the de-duplication cache and return tokens. Then 'cde' and 'abc' are inserted into c2. The de-duplication cache isn't cleared on file open so the Dctnry class has cache hits and is returning the tokens for c1's dictionary file. This means on a basic select the data is there, because c2's token column is pointing to the LBIDs/offsets of c1's dictionary. But when you try to scan c2's dictionary with that WHERE condition to get tokens the file is empty. | |||||||||||||||||||||||||||
| Comment by Daniel Lee (Inactive) [ 2019-07-05 ] | |||||||||||||||||||||||||||
|
Build verified: 1.2.5-1 nightly reproduced issue in 1.2.4-1 |