[MCOL-1559] Regression on working_tpch1/misc/bug3669 string pad compare not correct Created: 2018-07-13  Updated: 2020-04-01  Resolved: 2020-02-12

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 1.0.0
Fix Version/s: 1.2.6, 1.4.3

Type: Bug Priority: Minor
Reporter: David Hall (Inactive) Assignee: David Hall (Inactive)
Resolution: Done Votes: 0
Labels: None

Issue Links:
Blocks
is blocked by MCOL-3395 regression: dictionary de-duplication... Closed
Duplicate
is duplicated by MCOL-1673 Different trailing behaviour between ... Closed
is duplicated by MCOL-1831 multibyte space (zenkaku) breaks stri... Closed
PartOf
is part of MCOL-2161 Fix remaining errors in working_tpch1 Closed
Relates
relates to MCOL-3778 Replace glibc with google's re2 for r... Closed
Sprint: 2019-04, 2019-06

 Description   

Probably because of a change made in MCOL-1403.

Using a padded string (i.e., "abc ") in a where clause should compare favorably with a column that has no padding. It no longer works that way.



 Comments   
Comment by David Hall (Inactive) [ 2019-03-11 ]

Regression test cases:
Compare failed - working_tpch1/misc/bug3669.sql
Compare failed - working_tpch1/misc/MCOL-1246.sql

Comment by David Hall (Inactive) [ 2019-04-10 ]

Since I used MCOL-1559 for the code branch, it's easier to reopen this and close 1673 as the duplicate.

Comment by David Hall (Inactive) [ 2019-04-10 ]

Your comment in MCOL-1673 said maybe undo the change for MCOL-1403 with a caveat for LIKE. Unfortunately, we still have the issue that MariaDB has different behavior for CHAR and VARCHAR. Why this is so is beyond my paygrade. Also, trimming the incoming constant doesn't account for when the column value has trailing spaces. So I came up with another, albeit slightly less efficient, solution that gets (I hope) all the bases.

Comment by Daniel Lee (Inactive) [ 2019-04-19 ]

Build verified: 1.2.4-1 nightly
[dlee@master centos7]$ cat gitversionInfo.txt
server commit:
137b9a8
engine commit:
b3a7559

Reproduced the issue in 1.1.5-1

Verified the fix in 1.2.4-1

MariaDB [mytest]> create table stringtest (c1 char(10), c2 varchar(10), c3 varchar(6))engine=columnstore;
Query OK, 0 rows affected (0.363 sec)

MariaDB [mytest]> insert into stringtest values ('abc','cde','abc'), ('cde','abc','cde');
Query OK, 2 rows affected (0.345 sec)
Records: 2 Duplicates: 0 Warnings: 0

MariaDB [mytest]> select * from stringtest where c1='abc' or c2='abc';
--------------

c1 c2 c3

--------------

abc cde abc
cde abc cde

--------------
2 rows in set (0.115 sec)

MariaDB [mytest]> select * from stringtest where c1= 'abc ';
--------------

c1 c2 c3

--------------

abc cde abc

--------------
1 row in set (0.020 sec)

MariaDB [mytest]> select * from stringtest where c2='abc ';
--------------

c1 c2 c3

--------------

cde abc cde

--------------
1 row in set (0.018 sec)

MariaDB [mytest]> select * from stringtest where c1='abc ' or c2='abc ';
--------------

c1 c2 c3

--------------

abc cde abc
cde abc cde

--------------
2 rows in set (0.020 sec)

MariaDB [mytest]> select * from stringtest where c1 = 'abc' or c3='cde ';
--------------

c1 c2 c3

--------------

abc cde abc
cde abc cde

--------------
2 rows in set (0.026 sec)

MariaDB [mytest]> select concat('abc ', c2), concat('cde ', c3) from stringtest;
--------------------------------------+

concat('abc ', c2) concat('cde ', c3)

--------------------------------------+

abc cde cde abc
abc abc cde cde

--------------------------------------+
2 rows in set (0.012 sec)

Comment by David Hall (Inactive) [ 2019-05-22 ]

The fix causes a serious performance regression. We need to find a way to do this without string copy.

Comment by David Hall (Inactive) [ 2019-07-05 ]

Added a new fix.

QA: This needs to be checked for performance, as the last fix hurt in this area.

Comment by David Hall (Inactive) [ 2019-07-23 ]

QA:
The compare of strings with trailing blanks is confusing. Given a specific query, you ask, "Should this compare or not?" The answer is, it depends.

If comparing to a constant (the most common of filters), these are the rules:

1. If the column is a VARCHAR or a CHAR, a compare for an operator (=, <, >, etc) should be true if there are trailing spaces in the column data or in the compare data or both.

2. For CHAR and VARCHAR, if the column value has no trailing space, then it should compare false for LIKE if the constant has a trailing space and true if the constant has no trailing space.

3. For CHAR, if the column value has a trailing space, then it should compare false for LIKE if the constant has a trailing space and would otherwise compare (go figure).

4. For VARCHAR, if the column value has a trailing space, then it should compare true for LIKE if the constant has a trailing space and would otherwise compare.

5. For CHAR, if the column value has a trailing space, then it should compare true for LIKE if the constant has no trailing space and otherwise compares.

6. For VARCHAR, if the column value has a trailing space, then it should compare false for LIKE if the constant has no trailing space and otherwise compares.

Truth table for LIKE

Type Column Compare value Compare result
CHAR N N Y
CHAR N Y N
CHAR Y N Y
CHAR Y Y N
VARCHAR N N Y
VARCHAR N Y N
VARCHAR Y N N
VARCHAR Y Y Y
Comment by David Hall (Inactive) [ 2019-07-24 ]

Regression test queries/working_tpch1_compareLogOnly/misc/MCOL-1559.sql has been added.

Comment by David Hall (Inactive) [ 2019-07-24 ]

I changed the code to always use boost::regex_match() rather than #ifdefing to regexec(). I'm not sure why the difference was made. Perhaps because of performance (regex == faster?).

Removed a memcpy of the string into a stack allocated array. This seems wasteful, and is surely dangerous if the string was long and blew up the stack. And the code was not portable, using a gcc extension for variable length arrays (VLA). Future versions of gcc will not allow VLA with this syntax. We could do the same thing with alloca, but the copy and potential stack problems say we probably shouldn't.

// Ugh. The strings returned by getStrVal have null padding out to the col width. boost::regex
// considers these nulls significant, but they're not in the pattern, so we need to strip
// them off...

This comment for this bit of code makes no sense. Perhaps in the past there was no version of regexec or regex_match that took a null-terminated string, so it was needed? Basically, the string was copied and then put into a new std::string (making two physical copies), ostensibly with only one NULL.

But we can pass in a NULL terminated string to regexec or to boost::regex_match, so what the what? std::string.c_str() works just fine.

For 1.4, this should all be changed to use std::regex_match (C++11).

Comment by David Hall (Inactive) [ 2019-07-24 ]

Don't forget about performance testing here. Do we want Daniel to do his thing before we merge?

Comment by Daniel Lee (Inactive) [ 2019-12-10 ]

Build tested: 1.2.6-1

engine commit:
d4173ef

Reproduced the issue in bug3669.sql and verified the issue in 1.2.6-1

MariaDB [mytest]> select * from stringtest where c1='abc' or c2='abc';
--------------

c1 c2 c3

--------------

abc cde abc
cde abc cde

--------------
2 rows in set (0.100 sec)

MariaDB [mytest]> select * from stringtest where c1= 'abc ';
--------------

c1 c2 c3

--------------

abc cde abc

--------------
1 row in set (0.026 sec)

MariaDB [mytest]> select * from stringtest where c2='abc ';
--------------

c1 c2 c3

--------------

cde abc cde

--------------
1 row in set (0.023 sec)

MariaDB [mytest]> select * from stringtest where c1='abc ' or c2='abc ';
--------------

c1 c2 c3

--------------

abc cde abc
cde abc cde

--------------
2 rows in set (0.027 sec)

MariaDB [mytest]> select * from stringtest where c1 = 'abc' or c3='cde ';
--------------

c1 c2 c3

--------------

abc cde abc
cde abc cde

--------------
2 rows in set (0.028 sec)

MariaDB [mytest]> select concat('abc ', c2), concat('cde ', c3) from stringtest;
--------------------------------------+

concat('abc ', c2) concat('cde ', c3)

--------------------------------------+

abc cde cde abc
abc abc cde cde

--------------------------------------+
2 rows in set (0.017 sec)

Comment by Daniel Lee (Inactive) [ 2019-12-10 ]

Did a performance comparison for concat() on integer, date, char-1byte, char-dictionary for 1gb database lineitem table. On average, the fix is about 8% slower.

Comment by Daniel Lee (Inactive) [ 2019-12-10 ]

Opened ticket MCOL-3661 for further performance investigation. Closing this ticket now.

Comment by Gregory Dorman (Inactive) [ 2020-02-07 ]

This needs to be reopened. The fix as made has created an unacceptable performance impact.

Comment by David Hall (Inactive) [ 2020-02-12 ]

This was placed in stalled because of some performance concerns. MCOL-3661 addressed those concerns, so this can be closed.

Generated at Thu Feb 08 02:29:40 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.