[MDEV-21265] IN predicate conversion to IN subquery should be allowed for a broader set of datatype comparison Created: 2019-12-10  Updated: 2022-01-09  Resolved: 2020-12-01

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.3, 10.4, 10.5
Fix Version/s: 10.3.28, 10.4.18, 10.5.9

Type: Bug Priority: Critical
Reporter: Varun Gupta (Inactive) Assignee: Varun Gupta (Inactive)
Resolution: Fixed Votes: 2
Labels: None

Issue Links:
Relates
relates to MDEV-20900 IN predicate to IN subquery conversio... Closed
relates to MDEV-23704 Optimizer does not use semijoin optim... Closed
relates to MDEV-24319 Create semi-join materialization tabl... Open

 Description   

Here is a test case demonstrating the issue

MariaDB [test]> create table t1(a int);
Query OK, 0 rows affected (0.047 sec)
 
MariaDB [test]> insert into t1 select seq from seq_1_to_100;
Query OK, 100 rows affected (0.025 sec)
 
MariaDB [test]> set in_predicate_conversion_threshold=9;
Query OK, 0 rows affected (0.001 sec)

Query 1:

SELECT * FROM t1 WHERE a IN (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
the column a has the same datatype as the elements in the IN list.

MariaDB [test]> explain select * from t1 where a in (0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
+------+--------------+-------------+--------+---------------+--------------+---------+------+------+----------------+
| id   | select_type  | table       | type   | possible_keys | key          | key_len | ref  | rows | Extra          |
+------+--------------+-------------+--------+---------------+--------------+---------+------+------+----------------+
|    1 | PRIMARY      | t1          | ALL    | NULL          | NULL         | NULL    | NULL |  100 |                |
|    1 | PRIMARY      | <subquery2> | eq_ref | distinct_key  | distinct_key | 4       | func |    1 |                |
|    2 | MATERIALIZED | <derived3>  | ALL    | NULL          | NULL         | NULL    | NULL |   10 |                |
|    3 | DERIVED      | NULL        | NULL   | NULL          | NULL         | NULL    | NULL | NULL | No tables used |
+------+--------------+-------------+--------+---------------+--------------+---------+------+------+----------------+
4 rows in set (0.005 sec)

Query 2:

select * from t1 where a in ('0','1','2','3','4','5','6','7','8','9')
the column a has different datatype as the elements in the IN list.

MariaDB [test]> explain select * from t1 where a in ('0','1','2','3','4','5','6','7','8','9');
+------+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+-------------+-------+------+---------------+------+---------+------+------+-------------+
|    1 | SIMPLE      | t1    | ALL  | NULL          | NULL | NULL    | NULL |  100 | Using where |
+------+-------------+-------+------+---------------+------+---------+------+------+-------------+
1 row in set (0.002 sec)

For query 2 it would be good if we could convert the the elements of the IN list to DOUBLE and do the comparison in DOUBLE. By doing this we could do materialization of the IN subquery, when the IN predicate gets converted.



 Comments   
Comment by Sergei Petrunia [ 2020-10-07 ]

Had a discussion with bar about this.

Take-aways:

The best solution is to get the materialized table field to use the collation that is used to do the IN-comparison. This way, current
collation-based limitations will be lifted. This solution might be intrusive,though.

If we continue to use the original charset of the subquery's select list for materialization, then there are collations that are "subsets" of others.

For example, utf8_general_ci is "a subset" of utf8m4_general_ci. The subset relationship is defined as:

 for any strings X,Y : (X <= Y in utf8_general_ci) ==> (X <= Y in utf8mb4_general_ci)

Use of '<=' comparison implies equality and in-equality, so the Materialization-with-unique-key process will produce the same set of value groups that it would have produced if it was done using utf8mb4_general_ci.

The character set library has a couple of functions that check for a very similar relationship. The check looks like so:

    if (!field_cs.encoding_allows_reinterpret_as(new_type.charset))
      return false;
    if (!field_cs.eq_collation_specific_names(new_type.charset))
      return !is_part_of_a_primary_key(field);

(encoding_allows_reinterpret_as() might be a too strong condition)

Comment by Sergei Petrunia [ 2020-10-14 ]

Implemented the above suggestion https://gist.github.com/spetrunia/828d85cec078bf9810f485d43dab4822 and now debugging a testcase:

create table t_inner ( col1 varchar (32) collate utf8_general_ci);
create table t_outer2 ( col1 varchar (32) collate utf8mb4_general_ci);
insert into t_inner select seq from seq_1_to_1000;

Let's see what happens when one tries to make a lookup for a UTF8MB4 character in a UTF8 materialized table:

insert into t_outer2 values (_utf8mb4 X'F09F9883'), ('abc');
select    col1, col1 in (select col1 from t_inner)  from t_outer2;

Looking at how the lookup value is constructed:

  #0  store_key_item::copy_inner (this=0x7ffe6c0935a8) at /sql/sql_select.h:2027
  #1  0x0000555555e11675 in store_key::copy (this=0x7ffe6c0935a8) at /sql/sql_select.h:1910
  #2  0x00005555561be3c6 in subselect_uniquesubquery_engine::copy_ref_key (this=0x7ffe6c0936c8, skip_constants=false) at /sql/item_subselect.cc:4075
  #3  0x00005555561c3067 in subselect_partial_match_engine::exec (this=0x7ffe6c098890) at /sql/item_subselect.cc:6120
  #4  0x00005555561b11a5 in Item_subselect::exec (this=0x7ffe6c013ce0) at /sql/item_subselect.cc:746
  #5  0x00005555561b184b in Item_in_subselect::exec (this=0x7ffe6c013ce0) at /sql/item_subselect.cc:926
  #6  0x00005555561b11f5 in Item_subselect::exec (this=0x7ffe6c013ce0) at /sql/item_subselect.cc:758
  #7  0x00005555561b184b in Item_in_subselect::exec (this=0x7ffe6c013ce0) at /sql/item_subselect.cc:926
  #8  0x00005555561b5224 in Item_in_subselect::val_bool (this=0x7ffe6c013ce0) at /sql/item_subselect.cc:1804
  #9  0x0000555555c51f39 in Item::val_bool_result (this=0x7ffe6c013ce0) at /sql/item.h:1549
  #10 0x000055555611c6ce in Item_in_optimizer::val_int (this=0x7ffe6c015cb8) at /sql/item_cmpfunc.cc:1643

Character conversion takes place, and it seems to return an appropriate error code. The lookup key returned is '?' (which is stored because the original character could not be converted).

But then, I see a lookup is made with the '?' value ? This shouldn't be happening?

Comment by Sergei Petrunia [ 2020-10-15 ]

... But there are WHERE condition checks for both semi-join and non-semijoin case which make it harmless (will post details here)

Comment by Sergei Petrunia [ 2020-10-22 ]

Takeaways from the Tuesday call:
It would be nice to develop a more general solution, so that Materialization is applicable even when there is a datatype mismatch. It is clear that this is possible, in general. However it is still not clear how to implement this.

Comment by Sergei Petrunia [ 2020-11-26 ]

Review: https://lists.launchpad.net/maria-developers/msg12472.html

varun: one more item: please make the commit message mention that it backports some of the code from (and the MDEV which code was backported).

Comment by Panagiotis Malakoudis [ 2022-01-09 ]

If I understand correctly, this should be fixed in 10.3.28 and 10.5.9. However, the behaviour in mariadb from debian 11 (10.5.12) is still like the original report. What am I missing here?

edit: Just to clarify, I am talking about the behaviour for
"select * from t1 where a in ('0','1','2','3','4','5','6','7','8','9')" query

Generated at Thu Feb 08 09:05:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.