[MDEV-8566] Dependend_subquery with in doesn't use (primary) index Created: 2015-07-31  Updated: 2021-02-15

Status: Stalled
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 5.3.12, 5.5, 10.0, 10.1
Fix Version/s: 10.2

Type: Bug Priority: Major
Reporter: Marc Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: upstream


 Description   

Having the following query causes the database to make a table scan:

SELECT *,(SELECT GROUP_CONCAT(Name) FROM user WHERE ID IN (a,b,c)) FROM `test` WHERE a!=0

Explain returns:

+------+--------------------+-------+------+---------------+------+---------+------+------+-------------+
| id   | select_type        | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+------+--------------------+-------+------+---------------+------+---------+------+------+-------------+
|    1 | PRIMARY            | test  | ALL  | a             | NULL | NULL    | NULL | 1335 | Using where |
|    2 | DEPENDENT SUBQUERY | user  | ALL  | NULL          | NULL | NULL    | NULL | 4545 | Using where |
+------+--------------------+-------+------+---------------+------+---------+------+------+-------------+

Table definitions:

CREATE TABLE `test` (
  `ID` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `a` int(10) unsigned NOT NULL,
  `b` int(10) unsigned NOT NULL,
  `c` int(10) unsigned NOT NULL,
  PRIMARY KEY (`ID`),
  KEY `a` (`a`)
) ENGINE=Aria

CREATE TABLE `user` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(10) COLLATE latin1_german1_ci NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=Aria

As you might guess the original expression is more complex. As you can see, the result of the subquery can only have 3 entries which is grouped. The Subquery is dependend so this has to be performed on each resulting row - correct. But why does the subquery don't use the primary index which would only use 3 result-entries, not 4545



 Comments   
Comment by Elena Stepanova [ 2015-08-01 ]

Assigning to psergey to determine whether there is a bug in here. If there is, MySQL is also affected (tried 5.7).

Comment by Sergei Petrunia [ 2015-08-05 ]

Ok so the subquery is

(SELECT GROUP_CONCAT(Name) FROM user WHERE user.ID IN (test.a, test.b, test.c)

and the question is why doesn't the optimizer use an index lookup on user.ID.

I'll try to answer:

1. MariaDB (or MySQL) optimizer doesn't support ref access on anything other than single equalities. For example:

MariaDB [j6]> explain select * from user, test where user.ID=test.a;
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------+
| id   | select_type | table | type   | possible_keys | key     | key_len | ref       | rows | Extra |
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------+
|    1 | SIMPLE      | test  | ALL    | a             | NULL    | NULL    | NULL      | 1001 |       |
|    1 | SIMPLE      | user  | eq_ref | PRIMARY       | PRIMARY | 4       | j6.test.a |    1 |       |
+------+-------------+-------+--------+---------------+---------+---------+-----------+------+-------+

OK.

If the optimizer is not able to infer a single equality from the condition and faces the prospect of doing a cross-product join,
it will try to use "Range checked for each record" optimization:

MariaDB [j6]> explain select * from user, test where user.ID in (test.a, test.b);
+------+-------------+-------+------+---------------+------+---------+------+---------+------------------------------------------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra                                          |
+------+-------------+-------+------+---------------+------+---------+------+---------+------------------------------------------------+
|    1 | SIMPLE      | test  | ALL  | NULL          | NULL | NULL    | NULL |    1001 |                                                |
|    1 | SIMPLE      | user  | ALL  | PRIMARY       | NULL | NULL    | NULL | 1000000 | Range checked for each record (index map: 0x1) |
+------+-------------+-------+------+---------------+------+---------+------+---------+------------------------------------------------+

The execution is basically this:

for each row R in table test {
  if (try to construct range/index_merge access from user.ID in (R.a, R.b))
     use range access;
  else
    use full scan;

This is slow but sometimes it is just what one needs.

Comment by Sergei Petrunia [ 2015-08-05 ]

Trying to move to subqueries:

ref access works across subquery bounds:

MariaDB [j6]> explain select * ,(select group_concat(name) from user where user.ID=test.a ) from test;
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------+
| id   | select_type        | table | type   | possible_keys | key     | key_len | ref       | rows | Extra |
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------+
|    1 | PRIMARY            | test  | ALL    | NULL          | NULL    | NULL    | NULL      | 1001 |       |
|    2 | DEPENDENT SUBQUERY | user  | eq_ref | PRIMARY       | PRIMARY | 4       | j6.test.a |    1 |       |
+------+--------------------+-------+--------+---------------+---------+---------+-----------+------+-------+

but Range checked for each record does not:

MariaDB [j6]> explain select * ,(select group_concat(name) from user where user.ID  IN (test.a,test.b)) from test;
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
| id   | select_type        | table | type | possible_keys | key  | key_len | ref  | rows    | Extra       |
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+
|    1 | PRIMARY            | test  | ALL  | NULL          | NULL | NULL    | NULL |    1001 |             |
|    2 | DEPENDENT SUBQUERY | user  | ALL  | NULL          | NULL | NULL    | NULL | 1000000 | Using where |
+------+--------------------+-------+------+---------------+------+---------+------+---------+-------------+

note user.possible_keys=NULL.

Comment by Marc [ 2015-08-05 ]

I thought the algorithm executes each subquery on the base of the result set, if the index/merge doesn't work.
If these subqueries get executed after the main query, it would result in less database questions, can use indices and lead to the same result.

Do you think this is sth. which would be an improvement, or do you just think it should be left as is?

Comment by Sergei Petrunia [ 2015-08-05 ]

I am concerned about expanding the scope of Range checked for each record strategy. Its execution is expensive, and there is no way to predict whether it is worth using.

Comment by Sergei Petrunia [ 2015-08-05 ]

> I thought the algorithm executes each subquery on the base of the result set, if the index/merge doesn't work.
> If these subqueries get executed after the main query, it would result in less database questions, can use indices and lead to the same result.

The subquery will be invoked multiple times during execution of the top-level select.

Each execution will have use the same query plan. And the problem is that there is no query plan that is "dynamic" (depends on the top-level query) and fast to execute.

Comment by Sergei Petrunia [ 2015-08-05 ]

Do you think this is sth. which would be an improvement, or do you just think it should be left as is?

I think fixing this would require a lot of effort, and is probably not worth it (unless this turns out to be a very common query pattern (but I believe it is not)).

Thanks for taking time to report this, anyway.

Comment by Marc [ 2015-08-05 ]

I'm not familiar with the source code of mariadb, to argue against or for it.

From application point of view, I try to get a most complete result which is displayed. Looping on the result on application level has overhead. But currently the loop with multiple queries will be faster than the query with a subquery in mariadb.

I see, I thought each of the subqueries, if not merged, would be as if it is executed by the application - but since on the same memory, machine, ... faster than through network.

Comment by Sergei Petrunia [ 2015-08-05 ]

Yes. This is a reasonable assumption. However, certain limitations in the optimizer make it a bit difficult.

I think I'll reopen and discuss with other optimizer devs.

Comment by Sergei Petrunia [ 2015-08-05 ]

For the record

test=# explain select * ,(select sum(length(name)) from user1 where user1.ID  IN (test.a,test.b)) from test;
                                      QUERY PLAN                                       
---------------------------------------------------------------------------------------
 Seq Scan on test  (cost=0.00..16920.99 rows=1000 width=16)
   SubPlan 1
     ->  Aggregate  (cost=16.89..16.90 rows=1 width=6)
           ->  Index Scan using user1_pkey on user1  (cost=0.42..16.88 rows=2 width=6)
                 Index Cond: (id = ANY (ARRAY[test.a, test.b]))
 Planning time: 0.342 ms

Generated at Thu Feb 08 07:28:06 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.