Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-3708

LP:928048 - Query containing IN subquery with OR in the where clause returns a wrong result

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • None
    • None
    • None

    Description

      A query with IN subquery that can be converted to a semi-join may return a wrong result in maridb-5.3 if the where clause of the subquery contains OR condition.

      The following test case provides such a query.

      create table t1 (a int, b int);
      insert into t1 values (7,5), (3,3), (5,4), (9,3);

      create table t2 (a int, b int, index i_a(a));

      insert into t2 values
      (4,2), (7,9), (7,4), (3,1), (5,3), (3,1), (9,4), (8,1);

      set optimizer_switch='semijoin=on,materialization=on';
      select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);

      The query in from the test case returns a wrong result if the optimizer switch flags 'semijoin' and 'materialization' are set to 'on', a it returns the correct answer if these flags are set to 'off'.

      MariaDB [test]> set optimizer_switch='semijoin=on,materialization=on';
      Query OK, 0 rows affected (0.00 sec)

      MariaDB [test]> select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);
      ----------+

      a b

      ----------+

      7 5
      3 3
      5 4
      9 3

      ----------+
      4 rows in set (0.00 sec)

      MariaDB [test]> set optimizer_switch='semijoin=off,materialization=off';
      Query OK, 0 rows affected (0.00 sec)

      MariaDB [test]> select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);
      ----------+

      a b

      ----------+

      7 5
      3 3

      ----------+
      2 rows in set (0.00 sec)

      The warning returned by EXPLAIN EXTENDED executed for the query with
      optimizer_switch set to 'semijoin=on,materialization=on'
      shows that it happens because in this case the optimizer generates an invalid execution plan:

      MariaDB [test]> set optimizer_switch='semijoin=on,materialization=on';
      Query OK, 0 rows affected (0.00 sec)

      MariaDB [test]> explain extended
      -> select * from t1 where t1.a in (select a from t2 where t2.a=7 or t2.b<=1);
      --------------------------------------------------------------------------------------

      id select_type table type possible_keys key key_len ref rows filtered Extra

      --------------------------------------------------------------------------------------

      1 PRIMARY t1 ALL NULL NULL NULL NULL 4 100.00  
      1 PRIMARY <subquery2> eq_ref distinct_key distinct_key 5 func 1 100.00  
      2 MATERIALIZED t2 ALL i_a NULL NULL NULL 8 100.00  

      --------------------------------------------------------------------------------------
      3 rows in set, 1 warning (0.00 sec)

      MariaDB [test]> show warnings;
      ------------------------------------------------------------------------------------------------------------------------------------------------------------------

      Level Code Message

      ------------------------------------------------------------------------------------------------------------------------------------------------------------------

      Note 1003 select `test`.`t1`.`a` AS `a`,`test`.`t1`.`b` AS `b` from `test`.`t1` semi join (`test`.`t2`) where (((`test`.`t1`.`a` = 7) or (`test`.`t2`.`b` <= 1)))

      ------------------------------------------------------------------------------------------------------------------------------------------------------------------
      1 row in set (0.00 sec)

      Attachments

        Activity

          Re: Query containing IN subquery with OR in the where clause returns a wrong result
          == Analysis ==

          Equality propagation converts the WHERE clause into this:

          (multiple equal(7, t1.a, t2.a) or (t2.b <= 1))
          and
          multiple equal(t1.a, t2.a)

          This is ok.

          Then, equality substitution produces this WHERE clause:

          (t1.a = 7) or (t2.b <= 1)

          we dont expect this kind of WHERE clauses to be produced when
          SJ-Materialization-Lookup strategy is used.

          With that strategy, we expect that the WHERE clause can be broken into two
          AND-parts:

          • Part#1. depend only on SJ-inner tables
          • Part#2. depend only on SJ-outer tables

          The only thing joining the two parts is the IN-equality. IN-equality is not
          put into the WHERE condition, it is generated and checked inside the
          SJ-Materialization code.

          However, make_join_select() gets this clause:

          (t1.a = 7) or (t2.b <= 1)

          which can only be checked when one has both t1.a and t2.b. This never happens,
          so make_join_select() is unable to attach this condition anywhere, and it is
          never checked.

          psergei Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result == Analysis == Equality propagation converts the WHERE clause into this: (multiple equal(7, t1.a, t2.a) or (t2.b <= 1)) and multiple equal(t1.a, t2.a) This is ok. Then, equality substitution produces this WHERE clause: (t1.a = 7) or (t2.b <= 1) we dont expect this kind of WHERE clauses to be produced when SJ-Materialization-Lookup strategy is used. With that strategy, we expect that the WHERE clause can be broken into two AND-parts: Part#1. depend only on SJ-inner tables Part#2. depend only on SJ-outer tables The only thing joining the two parts is the IN-equality. IN-equality is not put into the WHERE condition, it is generated and checked inside the SJ-Materialization code. However, make_join_select() gets this clause: (t1.a = 7) or (t2.b <= 1) which can only be checked when one has both t1.a and t2.b. This never happens, so make_join_select() is unable to attach this condition anywhere, and it is never checked.

          Re: Query containing IN subquery with OR in the where clause returns a wrong result
          Some fix of this bug was pushed into mysql-5.6 (see the fix for bug #13414014).
          I did not review this fix, so I cannot say anything about about its validity.

          igor Igor Babaev (Inactive) added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result Some fix of this bug was pushed into mysql-5.6 (see the fix for bug #13414014). I did not review this fix, so I cannot say anything about about its validity.

          Re: Query containing IN subquery with OR in the where clause returns a wrong result
          == Analysis continued ==

          Equality propagation assumes certain ordering of tables, and tries to
          move make all substitutions to evaluate as early as possible.

          Graphically, if we draw join order like this:

          >tb1->-tbl2->-tbl3-...

          equality propagation will try pushing conditions to the left. The problem
          with semi-joins (or bushy joins in general) is that there is no global
          ordering anymore.

          If we take a query

          select * from ot1,ot2,ot3,ot4... where col in (select ... from it1, it2)

          and its SJM query plan:

          >ot1-ot2>------------(sjm)>ot3--ot4-....

          it1>it2>+

          then one can push the condition left along the top line through tables otX, or
          go down to tables itX, and follow to the left along the lower level.

          What equality substitution currently does is to assume the ordering of
          (ot1, ot2, it1, it2, sjm, ot3, ...) and mostly ignore the fact that there are
          multiple ways to go to the left.

          This almost works, because materialization is only applied to un-
          correlated queries, which gives us warranty that there are no expressions
          that refer to both an otX.col and an itY.col.
          The only exception is the IN-equality, which usually has the form of

          otX.col=itY.col

          presense of equalities of this form changes a lot.

          psergei Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result == Analysis continued == Equality propagation assumes certain ordering of tables, and tries to move make all substitutions to evaluate as early as possible. Graphically, if we draw join order like this: > tb1 - > - tbl2 - > - tbl3 -... equality propagation will try pushing conditions to the left. The problem with semi-joins (or bushy joins in general) is that there is no global ordering anymore. If we take a query select * from ot1,ot2,ot3,ot4... where col in (select ... from it1, it2) and its SJM query plan: > ot1 - ot2 >------------ (sjm) > ot3 -- ot4 -.... it1 > it2 > + then one can push the condition left along the top line through tables otX, or go down to tables itX, and follow to the left along the lower level. What equality substitution currently does is to assume the ordering of (ot1, ot2, it1, it2, sjm, ot3, ...) and mostly ignore the fact that there are multiple ways to go to the left. This almost works, because materialization is only applied to un- correlated queries, which gives us warranty that there are no expressions that refer to both an otX.col and an itY.col. The only exception is the IN-equality, which usually has the form of otX.col=itY.col presense of equalities of this form changes a lot.

          Re: Query containing IN subquery with OR in the where clause returns a wrong result
          === Deficiency ===

          First, the approach "move each condition as much as possible to the left"
          doesn't guarantee maximum possible filtering anymore. When one is following left
          and he reaches the (sjm) split, he should not take one way. He should use both!

          Consider the queries:

          Q1 select * from t1 where t1.col in (select t2.col from t2 where cond(t2.col))
          Q2 select * from t1 where t1.col in (select t2.col from t2) and cond(t1.col)

          Suppose they both are executed with this query plan:

          id select_type table type
          1 PRIMARY t1 ALL
          1 PRIMARY <subquery2> eq_ref
          2 MATERIALIZED t2 ALL

          it is apparent that, for both queries

          1. cond(t1.col) should be attached to table t1
          2. cond(t2.col) should be attached to table t2

          yet, current equality propagation/substituion scheme will not do that, at least
          not when 'cond' is a condition other than equality.

          Fixing this is a non-trivial though, because the fix will need to change the
          whole concept of equality substition from the

          walk through the WHERE clause and substitute certain occurences of
          "tblX.colA" for "tblY.colB"

          to something else. (the above definition doesn't cover substitution of
          Item_equal objects)

          psergei Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result === Deficiency === First, the approach "move each condition as much as possible to the left" doesn't guarantee maximum possible filtering anymore. When one is following left and he reaches the (sjm) split, he should not take one way. He should use both! Consider the queries: Q1 select * from t1 where t1.col in (select t2.col from t2 where cond(t2.col)) Q2 select * from t1 where t1.col in (select t2.col from t2) and cond(t1.col) Suppose they both are executed with this query plan: id select_type table type 1 PRIMARY t1 ALL 1 PRIMARY <subquery2> eq_ref 2 MATERIALIZED t2 ALL it is apparent that, for both queries 1. cond(t1.col) should be attached to table t1 2. cond(t2.col) should be attached to table t2 yet, current equality propagation/substituion scheme will not do that, at least not when 'cond' is a condition other than equality. Fixing this is a non-trivial though, because the fix will need to change the whole concept of equality substition from the walk through the WHERE clause and substitute certain occurences of "tblX.colA" for "tblY.colB" to something else. (the above definition doesn't cover substitution of Item_equal objects)

          Re: Query containing IN subquery with OR in the where clause returns a wrong result
          === Fixing the wrong query result ===

          Let's set a more modest goal of

          • not producing wrong query results
          • not generating regressions wrt non-semijoin materialization

          The idea is that given this chart

          >ot1-ot2>------------(sjm)>ot3--ot4-....

          it1>it2>+

          we

          "must not attempt to move a condition from one line to another"

          e.g. don't move the condition from the top line into the bottom or vice
          versa. (This bug is a result of us moving one part of OR-clause from
          bottom to the top).

          The implementation of is:

          • Don't substitute SJ-inner columns for SJ-outer, or vice versa (however it
            is always okay to substitute with references to const tables)
          • When exploding multiple-equalities:

          if (equality has a const item)

          { for each equality member tblX.colY produce "tblX.colY=const_item" }

          else
          {
          for X in

          {top_level, each semi-join nest}

          { FIRST= first member in X; for each other element OTHER produce "FIRST=OTHER" }

          }

          As for not producing equalities that were "fixed on the upper level":
          don't produce them only as long as they would have been produced at this
          level.

          psergei Sergei Petrunia added a comment - Re: Query containing IN subquery with OR in the where clause returns a wrong result === Fixing the wrong query result === Let's set a more modest goal of not producing wrong query results not generating regressions wrt non-semijoin materialization The idea is that given this chart > ot1 - ot2 >------------ (sjm) > ot3 -- ot4 -.... it1 > it2 > + we "must not attempt to move a condition from one line to another" e.g. don't move the condition from the top line into the bottom or vice versa. (This bug is a result of us moving one part of OR-clause from bottom to the top). The implementation of is: Don't substitute SJ-inner columns for SJ-outer, or vice versa (however it is always okay to substitute with references to const tables) When exploding multiple-equalities: if (equality has a const item) { for each equality member tblX.colY produce "tblX.colY=const_item" } else { for X in {top_level, each semi-join nest} { FIRST= first member in X; for each other element OTHER produce "FIRST=OTHER" } } As for not producing equalities that were "fixed on the upper level": don't produce them only as long as they would have been produced at this level.

          Launchpad bug id: 928048

          ratzpo Rasmus Johansson (Inactive) added a comment - Launchpad bug id: 928048

          People

            psergei Sergei Petrunia
            igor Igor Babaev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.