Currently if MariaDB checks whether possibility to use range access based on [NOT] IN predicate with a long list of values it build SEL_TREEs taking huge amount of memory.
Meanwhile using CTE specified by table value constructor such a predicate could be transformed into an equivalent [NOT] IN subquery:
SELECT ... WHERE ... (expr1, ...) [NOT] IN (value_list) ...;
=>
WITH t(col1, ...) AS (SELECT * FROM VALUES value_list)
SELECT ... WHERE ... (expr1, ...) [NOT] IN (SELECT * FROM t)...;
Attachments
Issue Links
causes
MDEV-14281Wrong result from query with NOT IN predicate in WHERE
Closed
MDEV-17795Query with long IN(...) list in WHERE about 40 times slower
Closed
MDEV-20105Case for bringing in_subquery_conversion_threshold back in next possible release
Closed
MDEV-20900IN predicate to IN subquery conversion causes performance regression
Closed
MDEV-33560Spider: query with long IN clause has performance issues
Closed
is blocked by
MDEV-12172Implement tables specified by table value constructors.
Optimization that transforms IN predicate into IN subquery that uses TVC was made.
It works this way:
SELECT ... WHERE (expr1, ...) [NOT] IN (value_list) ...;
=>
SELECT ... WHERE (expr1, ...) [NOT] IN (SELECT * FROM (VALUES value_list) AS tvc_name) ...;
Optimization works when IN predicate is used in WHERE clause of the query or when it is ON expression of some join. It also works when IN predicate is in derived tables, CTEs and subqueries.
What's done:
1. global variable in_subquery_conversion_threshold that controls optimization working
Optimization works only when the total number of scalar values used in IN predicate is more than this variable
2. transformer Item_func_in::in_predicate_to_in_subs_transformer that transforms IN predicate in IN subselect
3. creation of TVC specified by values from IN predicate
Galina Shalygina (Inactive)
added a comment - - edited Final evaluation results:
( see the the list of commits here https://github.com/MariaDB/server/compare/10.3...shagalla:10.3-mdev12172 )
Optimization that transforms IN predicate into IN subquery that uses TVC was made.
It works this way:
SELECT ... WHERE (expr1, ...) [NOT] IN (value_list) ...;
=>
SELECT ... WHERE (expr1, ...) [NOT] IN (SELECT * FROM (VALUES value_list) AS tvc_name) ...;
Optimization works when IN predicate is used in WHERE clause of the query or when it is ON expression of some join. It also works when IN predicate is in derived tables, CTEs and subqueries.
What's done:
1. global variable in_subquery_conversion_threshold that controls optimization working
Optimization works only when the total number of scalar values used in IN predicate is more than this variable
2. transformer Item_func_in::in_predicate_to_in_subs_transformer that transforms IN predicate in IN subselect
3. creation of TVC specified by values from IN predicate
4. creation of new statement
SELECT * FROM (VALUES value_list) AS tvc_name
5. new test file: opt_tvc.test
To clone this project from github you need to do:
git clone https://github.com/shagalla/server 10.3-mdev12172
cd 10.3-mdev12172
git branch 10.3-mdev12172
git checkout 10.3-mdev12172
maria-git/10.3-mdev12172> git pull https://github.com/shagalla/server 10.3-mdev12172
SELECT CONCAT("9906643000004NEXT", LPAD(a.next_id_int, 16, "0")) AS next_id, a.ts - MOD(a.ts, 300), ROUND(AVG(value), 3), AVG(quality) FROM leitsystem.zr_te_value a left join t ON a.next_id_int=t.next_id_int WHERE t.next_id_int IS NULL AND a.ts >= 1542668700 AND a.ts < 1542755400 AND a.zr_art = 12 GROUP BY a.next_id, a.ts DIV 300 ORDER BY a.ts;
it can help to check if the issue comes from icp shortcut, where filtering may take place after sending record or from full join cost ?
VAROQUI Stephane
added a comment - - edited Can i suggest to experiment
create temporary table t(next_id_int , key idx(next_id_int) ) engine =memory;
insert t select ('0000000000006447', '0000000000005272', ... );
SELECT CONCAT("9906643000004NEXT", LPAD(a.next_id_int, 16, "0")) AS next_id, a.ts - MOD(a.ts, 300), ROUND(AVG(value), 3), AVG(quality) FROM leitsystem.zr_te_value a left join t ON a.next_id_int=t.next_id_int WHERE t.next_id_int IS NULL AND a.ts >= 1542668700 AND a.ts < 1542755400 AND a.zr_art = 12 GROUP BY a.next_id, a.ts DIV 300 ORDER BY a.ts;
it can help to check if the issue comes from icp shortcut, where filtering may take place after sending record or from full join cost ?
igor the column type is int. I told the application developers long ago they should give me int at this, but as it never made problems i did not put force on that.
Server version: 10.3.10-MariaDB-1:10.3.10+maria~bionic-log mariadb.org binary distribution
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> set @@in_predicate_conversion_threshold=1000000;
ERROR 1193 (HY000): Unknown system variable 'in_predicate_conversion_threshold'
MariaDB [(none)]>
Frank Sagurna
added a comment - - edited igor the column type is int. I told the application developers long ago they should give me int at this, but as it never made problems i did not put force on that.
Server version: 10.3.10-MariaDB-1:10.3.10+maria~bionic-log mariadb.org binary distribution
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
MariaDB [(none)]> set @@in_predicate_conversion_threshold=1000000;
ERROR 1193 (HY000): Unknown system variable 'in_predicate_conversion_threshold'
MariaDB [(none)]>
stephane@skysql.com:
I had to adapt it a little bit, but that was fast enough.
But it gives me totally different row count. The original query gives me about 400k rows, this only about 80k rows (That are the numbers that phpmyadmin shows me):
createtemporarytable t(next_id_int INT , key idx(next_id_int) ) engine =memory;
INSERTINTO t (next_id_int) VALUES (6447),(5272),...,(3104);
FROM leitsystem.zr_te_value a left join t ON a.next_id_int=t.next_id_int
WHERE t.next_id_int IS NULL AND a.ts >= 1542668700 AND a.ts < 1542755400 AND a.zr_art = 12 GROUP BY a.next_id_int, a.ts DIV 300 ORDER BY a.ts LIMIT 0, 25;
Frank Sagurna
added a comment - - edited stephane@skysql.com :
I had to adapt it a little bit, but that was fast enough.
But it gives me totally different row count. The original query gives me about 400k rows, this only about 80k rows (That are the numbers that phpmyadmin shows me):
create temporary table t(next_id_int INT , key idx(next_id_int) ) engine =memory;
INSERT INTO t (next_id_int) VALUES (6447),(5272),...,(3104);
SELECT CONCAT( "9906643000004NEXT" , LPAD(a.next_id_int, 16, "0" )) AS next_id, a.ts - MOD(a.ts, 300), ROUND( AVG (value), 3), AVG (quality)
FROM leitsystem.zr_te_value a left join t ON a.next_id_int=t.next_id_int
WHERE t.next_id_int IS NULL AND a.ts >= 1542668700 AND a.ts < 1542755400 AND a.zr_art = 12 GROUP BY a.next_id_int, a.ts DIV 300 ORDER BY a.ts;
# Time: 181121 9:48:07
# User@Host:
# Thread_id: 525 Schema: leitsystem QC_hit: No
# Query_time: 4.268082 Lock_time: 0.000173 Rows_sent: 25 Rows_examined: 1474207
# Rows_affected: 0 Bytes_sent: 1793
# Tmp_tables: 1 Tmp_disk_tables: 0 Tmp_table_sizes: 18162792
# Full_scan: No Full_join: No Tmp_table: Yes Tmp_table_on_disk: No
# Filesort: Yes Filesort_on_disk: No Merge_passes: 0 Priority_queue: Yes
#
# explain: id select_type table type possible_keys key key_len ref rows r_rows filtered r_filtered Extra
# explain: 1 SIMPLE a range PRIMARY,ts ts 4 NULL 2063432 981521.00 75.00 50.19 Using where; Using temporary; Using filesort
# explain: 1 SIMPLE t ref idx idx 5 leitsystem.a.next_id_int 2 0.84 100.00 100.00 Using where
#
use leitsystem;
SET timestamp=1542790087;
SELECT CONCAT("9906643000004NEXT", LPAD(a.next_id_int, 16, "0")) AS next_id, a.ts - MOD(a.ts, 300), ROUND(AVG(value), 3), AVG(quality)
FROM leitsystem.zr_te_value a left join t ON a.next_id_int=t.next_id_int
WHERE t.next_id_int IS NULL AND a.ts >= 1542668700 AND a.ts < 1542755400 AND a.zr_art = 12 GROUP BY a.next_id_int, a.ts DIV 300 ORDER BY a.ts LIMIT 0, 25;
Tx Frank
Who is correct that is the question, in the original query you get string IN ('0000xxxx') witch looks wrong at first place if data type of next_id_int is INTEGER. it looks like the optimization will build a temp table with the datatype of NOT IN list : VARCHAR where in your table you get INTEGER, forcing conversion from int to string and not equivalent from string to int.
1 <=> "1"
1 <=> "01"
"1" <> "01"
"1" <=> "1"
VAROQUI Stephane
added a comment - Tx Frank
Who is correct that is the question, in the original query you get string IN ('0000xxxx') witch looks wrong at first place if data type of next_id_int is INTEGER. it looks like the optimization will build a temp table with the datatype of NOT IN list : VARCHAR where in your table you get INTEGER, forcing conversion from int to string and not equivalent from string to int.
1 <=> "1"
1 <=> "01"
"1" <> "01"
"1" <=> "1"
Frank,
"The INT conversion should not be the problem, as for both tests i used a list of INTs."
This is exactly the cause of the problem.
Anyway you can report it as a bug.
Igor Babaev (Inactive)
added a comment - Frank,
"The INT conversion should not be the problem, as for both tests i used a list of INTs."
This is exactly the cause of the problem.
Anyway you can report it as a bug.
Can you report it as a regular bug a close task does not look the good place ! you can link that task as a reference !
/svar
VAROQUI Stephane
added a comment - Hi Frank !
Can you report it as a regular bug a close task does not look the good place ! you can link that task as a reference !
/svar
MariaDB [(none)]> set @@in_predicate_conversion_threshold=1000000;
ERROR 1193 (HY000): Unknown system variable 'in_predicate_conversion_threshold'
MariaDB [(none)]>
Frank Sagurna
added a comment - Should i open an additional bug for this one?
MariaDB [(none)] > set @@in_predicate_conversion_threshold=1000000;
ERROR 1193 (HY000): Unknown system variable 'in_predicate_conversion_threshold'
MariaDB [(none)] >
Final evaluation results:
( see the the list of commits here https://github.com/MariaDB/server/compare/10.3...shagalla:10.3-mdev12172 )
Optimization that transforms IN predicate into IN subquery that uses TVC was made.
It works this way:
SELECT ... WHERE (expr1, ...) [NOT] IN (value_list) ...;
=>
SELECT ... WHERE (expr1, ...) [NOT] IN (SELECT * FROM (VALUES value_list) AS tvc_name) ...;
Optimization works when IN predicate is used in WHERE clause of the query or when it is ON expression of some join. It also works when IN predicate is in derived tables, CTEs and subqueries.
What's done:
1. global variable in_subquery_conversion_threshold that controls optimization working
Optimization works only when the total number of scalar values used in IN predicate is more than this variable
2. transformer Item_func_in::in_predicate_to_in_subs_transformer that transforms IN predicate in IN subselect
3. creation of TVC specified by values from IN predicate
4. creation of new statement
SELECT * FROM (VALUES value_list) AS tvc_name
5. new test file: opt_tvc.test
To clone this project from github you need to do:
git clone https://github.com/shagalla/server 10.3-mdev12172
cd 10.3-mdev12172
git branch 10.3-mdev12172
git checkout 10.3-mdev12172
maria-git/10.3-mdev12172> git pull https://github.com/shagalla/server 10.3-mdev12172