[MDEV-31496] Make optimizer handle UCASE(varchar_col)=... Created: 2023-06-19  Updated: 2023-11-23  Resolved: 2023-09-12

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Fix Version/s: 11.3.0

Type: Task Priority: Critical
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 2
Labels: Preview_11.3

Attachments: File mdev31496-verify-utf8mb3.sql    
Issue Links:
Blocks
PartOf
Problem/Incident
Relates
relates to MDEV-31946 Optimizer handle UCASE(varchar_col)=.... Closed
relates to MDEV-31955 Make optimizer handle LCASE(varchar_c... Open
relates to MDEV-31975 UCASE(varchar_col)=... not handled f... Closed

 Description   

This is based on a customer case.

The issue

The customer query has a (repeating) fragment

SELECT ...
FROM 
  t1, t2
WHERE
  UCASE(t1.key_column)= UCASE(t2.column) AND ...

Both t1.key_column and t2.column use collation utf8mb3_general_ci.

There is an index on t1.key_column, and the above condition is highly-selective.

However the optimizer is not able to use it and has to compute a cross join between t1 and t2 and then apply the WHERE.

Another query uses UCASE with IN:

WHERE  UCASE(key_col) IN ('const1','const2', ...)

How to optimize this

In general, it is not possible to use an index to make lookups based on

UCASE(key_col)='foo'

as FOO, foo and FOo are not sorted together in case-sensitive collations.

It seems that case-insensitive collations should have this property:

For any a,b:   (UCASE(a)=UCASE(b))   <=>   (a=b)               (CI-UCASE-REMOVAL)

Making a rewrite according to this will solve the issue.

The bad news is that CI-UCASE-REMOVAL doesn't always hold. Some collations have exceptions to this rule. See examples below.

The good news is that utf8mb3_general_ci and utf8mb4_general_ci do have this property.

Exceptions

Exception #1: Czech collation

See https://collation-charts.org/mysql60/mysql604.utf8_czech_ci.html, CH, Ch and ch are considered a single letter which sorts between H and I. But the pair of characters "cH" is not treated in this way. The effect of this is:

CREATE OR REPLACE TABLE t2 (a VARCHAR(32) COLLATE utf8mb3_czech_ci);
INSERT INTO t2 VALUES ('ch'),('CH'),('Ch'),('cH');
SELECT t1.a, t2.a, t1.a=t2.a, UCASE(t1.a)=UCASE(t2.a)  FROM t2 t1, t2 t2 
WHERE (t1.a=t2.a) <> (UCASE(t1.a)=UCASE(t2.a));

+------+------+-----------+-------------------------+
| a    | a    | t1.a=t2.a | UCASE(t1.a)=UCASE(t2.a) |
+------+------+-----------+-------------------------+
| cH   | ch   |         0 |                       1 |
| cH   | CH   |         0 |                       1 |
| cH   | Ch   |         0 |                       1 |
| ch   | cH   |         0 |                       1 |
| CH   | cH   |         0 |                       1 |
| Ch   | cH   |         0 |                       1 |
+------+------+-----------+-------------------------+

Note that UCASE(t1.a)=UCASE(t2.a) holds while t1.a=t2.a doesn't. If it was other way around, we could use t1.a=t2.a for row retrieval and then check the UCASE(...)=UCASE(...) equality. But alas, if we try to use t1.a=t2.a to read rows we may miss rows. This makes the rewrite completely unapplicable.

Exception #2: Danish collation

https://collation-charts.org/mysql60/mysql604.utf8_danish_ci.html

CREATE OR REPLACE TABLE t4 (a VARCHAR(32) COLLATE utf8mb3_danish_ci);
INSERT INTO t4 VALUES ('aa'),('AA'),('Aa'),('aA');
SELECT t1.a, t4.a, t1.a=t4.a, UCASE(t1.a)=UCASE(t4.a)  FROM t4 t1, t4 t4
WHERE (t1.a=t4.a) <> (UCASE(t1.a)=UCASE(t4.a));

+------+------+-----------+-------------------------+
| a    | a    | t1.a=t4.a | UCASE(t1.a)=UCASE(t4.a) |
+------+------+-----------+-------------------------+
| aA   | aa   |         0 |                       1 |
| aA   | AA   |         0 |                       1 |
| aA   | Aa   |         0 |                       1 |
| aa   | aA   |         0 |                       1 |
| AA   | aA   |         0 |                       1 |
| Aa   | aA   |         0 |                       1 |
+------+------+-----------+-------------------------+

Exception #3: Turkish collation

CREATE OR REPLACE TABLE t1 (a VARCHAR(32) COLLATE utf8mb3_uca1400_turkish_as_ci);
INSERT INTO t1 VALUES ('I'),('i'),('İ'),('ı');
SELECT t1.a, t2.a, UCASE(t1.a), UCASE(t2.a), t1.a=t2.a, UCASE(t1.a)=UCASE(t2.a)
FROM t1 t1, t1 t2
WHERE (t1.a=t2.a) <> (UCASE(t1.a)=UCASE(t2.a));

+------+------+-------------+-------------+-----------+-------------------------+
| a    | a    | UCASE(t1.a) | UCASE(t2.a) | t1.a=t2.a | UCASE(t1.a)=UCASE(t2.a) |
+------+------+-------------+-------------+-----------+-------------------------+
| İ    | i    | İ           | İ           |         0 |                       1 |
| i    | İ    | İ           | İ           |         0 |                       1 |
+------+------+-------------+-------------+-----------+-------------------------+

The same can be observed on utf8mb3_uca1400_as_ci collation.

Further exceptions

Collations *uca1400_*as* are likely to have violations to the rule as well.

Doing the rewrite

We can follow the example of MDEV-8320 with its datetime condition rewrites and create another rewrite.

Introduce Item_func_eq::varchar_upper_cmp_transformer() and do the rewrite there.

Checking the criteria

  • the collation $COLL that is used to do the comparison supports ucase removal
  • The left side has form UCASE(table.col)
    • (UPPER() is a synonym for UCASE, no need for separate check)
  • table.col is a "character-based" type and uses the same collation $COLL.

if the above conditions are met, we can do the rewrite.

Doing the rewrite

  • Create a new Item_func_eq(table.col, ... )
  • Call fix_fields() for it and return it from the transformer function.

This seems to be more straightforward than trying to replace UPPER(col) with col. If we replace ourselves, we'll need to care about undoing the rewrite. Rewrites done by transformer will be rolled back automatically.

Note that

coercibility(table.col)=coercibility(upper(table.col))

so we will not change the semantics of the statement.

There is a problem with creating Item_func_in: it will copy and then call fix_fields() for the entire IN-list. Using Item_func_in::build_clone doesn't improve the situation much.



 Comments   
Comment by Sergei Petrunia [ 2023-06-19 ]

First patch: https://github.com/MariaDB/server/tree/bb-10.6-MDEV-31496

Comment by Sergei Golubchik [ 2023-06-19 ]
  • LCASE() too?
  • could you add a test case with a collation where UCASE cannot be removed?
Comment by Sergei Petrunia [ 2023-06-30 ]

Addressed all input from Bar:

commit 0d2f861f575bcc0b9c65590d579c0f7526d4542c (HEAD -> bb-10.6-MDEV-31496, origin/bb-10.6-MDEV-31496)
Author: Sergei Petrunia <sergey@mariadb.com>
Date:   Mon Jun 19 17:53:16 2023 +0300
 
    MDEV-31496: Make optimizer handle UCASE(varchar_col)=...

Comment by Sergei Petrunia [ 2023-07-06 ]

Latest patch:

commit c638c4a04be40870ffd41511c32feaa0fb3807cf (HEAD -> bb-10.6-MDEV-31496, origin/bb-10.6-MDEV-31496)
Author: Sergei Petrunia <sergey@mariadb.com>
Date:   Mon Jun 19 17:53:16 2023 +0300
 
    MDEV-31496: Make optimizer handle UCASE(varchar_col)=...

Comment by Pandikrishnan Gurusamy [ 2023-07-12 ]

psergei This fix will be part of the ES 10.6.15 release ?

Comment by Sergei Petrunia [ 2023-07-12 ]

Patch with the latest review input addressed:

commit 80f02c7aadd622feef83cf14f3dd9427d03dbc16 (HEAD -> bb-10.6-MDEV-31496, origin/bb-10.6-MDEV-31496)
Author: Sergei Petrunia <sergey@mariadb.com>
Date:   Mon Jun 19 17:53:16 2023 +0300
 
    MDEV-31496: Make optimizer handle UCASE(varchar_col)=...
    
    (Review input addressed)
    If the properties of the used collation allow, do the following
    equivalent rewrites:

Comment by Sergei Petrunia [ 2023-07-12 ]

pandi.gurusamy we do not have that a goal (should we?) but if I get ok to push in time, it will be.

Comment by Alexander Barkov [ 2023-07-13 ]

The patch
https://github.com/MariaDB/server/commit/80f02c7aadd622feef83cf14f3dd9427d03dbc16
is OK to push.

Comment by Pandikrishnan Gurusamy [ 2023-07-14 ]

ralf.gebhardt As this feature is disabled by default, are we backporting this fix to 10.6.15?

Comment by Sergei Petrunia [ 2023-08-16 ]

feature tree for 11.3: https://github.com/MariaDB/server/tree/bb-11.3-MDEV-31496-sargable-casefold (there, the feature is ON by default)

Comment by Lena Startseva [ 2023-08-23 ]

Verification is done in branch bb-10.6-MDEV-31496. Ok to push.

Comment by Sergei Petrunia [ 2023-08-24 ]

Updated variant for 11.3 is in bb-11.3-MDEV-31496-sargable-casefold-v2:
https://github.com/MariaDB/server/commit/7ca2b9606d3a2402f5db54ed596c472942a0dc3e

Comment by Sergei Petrunia [ 2023-09-12 ]

Pushed into 11.3

Generated at Thu Feb 08 10:24:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.