[MCOL-4498] LIKE is not collation aware Created: 2021-01-15  Updated: 2021-04-02  Resolved: 2021-04-02

Status: Closed
Project: MariaDB ColumnStore
Component/s: ExeMgr, PrimProc
Affects Version/s: None
Fix Version/s: 5.6.1

Type: Bug Priority: Critical
Reporter: Alexander Barkov Assignee: Roman
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Relates
relates to MCOL-495 Make string comparison not case sensi... Closed
relates to MCOL-3536 Order by with UTF Closed
relates to MCOL-4064 Make JOIN collation aware Closed
relates to MCOL-4065 DISTINCT is case sensitive Closed
relates to MCOL-4388 Equality does not respect the NOPAD c... Closed
relates to MCOL-4417 Non-equality comparison operators do ... Closed
relates to MCOL-4499 NOT LIKE is not compatible with Maria... Open
relates to MCOL-4539 WHERE short_char_column='literal' ign... Closed
Sprint: 2021-2, 2021-3, 2021-4, 2021-5

 Description   

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(3) CHARACTER SET utf8) ENGINE=ColumnStore;
INSERT INTO t1 VALUES ('A'),('a');
SELECT * FROM t1 WHERE a LIKE 'a';

+------+
| a    |
+------+
| a    |
+------+

Notice, it returns only one row ignoring the fact that the collations is utf8_general_ci, which is case insensitive.

It should return two rows like MyISAM does:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (a CHAR(3) CHARACTER SET utf8) ENGINE=MyISAM;
INSERT INTO t1 VALUES ('A'),('a');
SELECT * FROM t1 WHERE a LIKE 'a';

+------+
| a    |
+------+
| A    |
| a    |
+------+



 Comments   
Comment by Alexander Barkov [ 2021-03-04 ]

Hello dleeyh,

When doing QA for this task can you please check if the change affects performance?

In MariaDB Server, the code from the collation library behind the LIKE operator is about twice faster than RLIKE:

MariaDB [test]> SELECT  BENCHMARK(10000000, 'aaa' LIKE 'aaa');
+---------------------------------------+
| BENCHMARK(10000000, 'aaa' LIKE 'aaa') |
+---------------------------------------+
|                                     0 |
+---------------------------------------+
1 row in set (0.405 sec)

MariaDB [test]> SELECT  BENCHMARK(10000000, 'aaa' RLIKE '^aaa$');
+------------------------------------------+
| BENCHMARK(10000000, 'aaa' RLIKE '^aaa$') |
+------------------------------------------+
|                                        0 |
+------------------------------------------+
1 row in set (0.906 sec)

As this change in ColumnStore replaces the regex based LIKE implementation to the MariaDB collation library implementation, I also expect near 2 times performance improvement. It would be nice to verify this.

Thank you!

Comment by Daniel Lee (Inactive) [ 2021-03-23 ]

Build tested: 5.6.1 (Drone #1915)

The issue has been fixed.

MariaDB [mytest]> DROP TABLE IF EXISTS t1;
Query OK, 0 rows affected, 1 warning (0.022 sec)
 
MariaDB [mytest]> CREATE TABLE t1 (a CHAR(3) CHARACTER SET utf8) ENGINE=ColumnStore;
Query OK, 0 rows affected (0.148 sec)
 
MariaDB [mytest]> INSERT INTO t1 VALUES ('A'),('a');
Query OK, 2 rows affected (0.237 sec)
Records: 2  Duplicates: 0  Warnings: 0
 
MariaDB [mytest]> SELECT * FROM t1 WHERE a LIKE 'a';
+------+
| a    |
+------+
| A    |
| a    |
+------+
2 rows in set (0.054 sec)

Still need to check performance

Comment by Daniel Lee (Inactive) [ 2021-03-23 ]

MariaDB [(none)]> SELECT  BENCHMARK(10000000, 'aaa' LIKE 'aaa');
+---------------------------------------+
| BENCHMARK(10000000, 'aaa' LIKE 'aaa') |
+---------------------------------------+
|                                     0 |
+---------------------------------------+
1 row in set (1.557 sec)
 
MariaDB [(none)]> SELECT  BENCHMARK(10000000, 'aaa' RLIKE '^aaa$');
+------------------------------------------+
| BENCHMARK(10000000, 'aaa' RLIKE '^aaa$') |
+------------------------------------------+
|                                        0 |
+------------------------------------------+
1 row in set (2.940 sec)

Comment by Alexander Barkov [ 2021-03-24 ]

Sorry, I must have not been precise enough about benchmarking.
The goal is to test how the patch affects the LIKE operator performance in ColumnStore tables.
I think it could be done as follows::

  • Create a ColumnStore table with a huge amount of data
  • Run a query involving LIKE using a binary without this patch (using any latest develop-5 binary), e.g:

    SELECT * FROM huge_table WHERE some_column LIKE '... some pattern ..';
    

    Or the query can use COUNT to reduce the overhead needed to send values over the network:

    SELECT COUNT(*) FROM huge_table WHERE some_column LIKE '... some pattern ..';
    

  • Run the same query using a binary with the patch for MCOL-4498
  • Measure the performance difference
  • Repeat the above procedure for TEXT, VARCHAR, long CHAR and short CHAR columns

The tests that you pasted in the last comment with queries like this:

SELECT BENCHMARK(10000000, 'aaa' LIKE 'aaa');

do not involve ColumnStore at all. They are run on the server side completely and don't demonstrate how the patch affected LIKE performance in ColumnStore. I gave them only to show that in MariaDB, LIKE is faster than RLIKE. Therefore, in ColumnStore new LIKE implementation will hopefully be also faster than the old regex based one.

Comment by Alexander Barkov [ 2021-03-29 ]

drrtuy, benchmarks showed a good performance improvement. Please continue with the review. Thanks.

Comment by Roman [ 2021-04-02 ]

Closing b/c dleeyh had tested the patch.

Generated at Thu Feb 08 02:50:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.