Details

    Description

      Add support for FULL OUTER JOIN

      https://www.w3schools.com/sql/sql_join_full.asp

      One of the way how to implement is to re-write the query

      select t1.*, t2.* from t1 full outer join t2 on P(t1,t2)
      

      into the following union all:

      select t1.*, t2.* from t1 left outer join t2 on P(t1,t2)
      union all
      select t1.*,t2.* from t2 left outer join t1 on P(t1,t2) where t1.a is null
      

      Here t1.a is some non-nullable column of t1 (e.g. the column of single column primary key).

      Solution considerations

      contents

      0. Parser
      1. Conversion to LEFT/INNER join
      2. FULL OUTER JOIN Execution
      2.1 Approach #1: implement full outer join as an operation in executor
      2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?
      2.2 Approach #2: rewrite into two OUTER joins
      2.2.1 How to check for non-matches if all columns are NULL-able
      2.2.2 Non-symmetry
      3. Choosing between rewrite and operator
      4. FULL OUTER JOIN as an operator
      4.1 Implementation plan
      4.1.1 Before the optimizer
      4.1.2 convert to left/inner joins
      4.1.3 Join optimization
      4.1.4 Plan refinement
      4.1.5 Execution
      

      0. Parser

      We'll need to modify the parser and semantic analysis phases. This seems to be fairly straightforward?

      1. Conversion to LEFT/INNER join

      We will need code that analyzes/rewrites FULL OUTER JOIN into LEFT/RIGHT/INNER join based on NULL-rejecting predicates in the WHERE (or ON expressions that are outside the outer join in question).

      2. FULL OUTER JOIN Execution

      One can think of:
      Approach #1: implement full outer join as an operation in executor.
      Approach #2: rewrite it into two inner joins

      2.1 Approach #1: implement full outer join as an operation in executor

      This is what e.g. Cockroach and PostgreSQL seem to do.

      FULL OUTER JOIN can be handled by Merge Join (or similar) operation which is symmetric wrt its inputs and/or can emit rows that didn't have matches.

      (note: for PostgreSQL, I saw FULL OUTER JOIN to be handled by a variant of Hash join and Merge Join. PostgreSQL seems to require that FULL OUTER JOIN conditions have equalities. This error is produced

      ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join condition
      

      when one doesn't have a join equality in the ON expression.)

      2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?

      Can't think of anything really efficient here.
      We basically need to pick one side of the join to be "inner" and then compute an outer join but also keep track of which record combinations of the "inner" side were hit and which weren't?
      All approaches seem to be quite expensive.

      2.2 Approach #2: rewrite into two OUTER joins

      In the query

        SELECT
          ...
          t1 FULL OUTER JOIN t2 ON fjcond
          ...
      

      where t1 and t2 may be base tables or more complex expressions,

      Rewrite the

         t1 FULL OUTER JOIN t2 ON fjcond
      

      part into

        (
          select ... from t1 left join t2 on fjcond
          union all
          select ... from t2 left join t1 on fjcond where t1.pk IS NULL
        ) as full_oj_tbl
      

      Here we assume that t1 has a not-null column "pk", so we can construct a condition meaning "outer join had no matches".
      The "..." denotes all the needed columns.

      2.2.1 How to check for non-matches if all columns are NULL-able

      If we cannot find a column t1.pk such that we could add a t1.pk IS NULL, we can rewrite into

          select ... from t2 left join ( select *, '1' as NULL_MARKER from t1) on fjcond where NULL_MARKER IS NULL
      

      (In the code, see Item_direct_view_ref::null_ref_table).

      2.2.2 Non-symmetry

      Note that the rewrite is not symmetric wrt t1 and t2. Does it matter? Well, one side will be able to benefit from outer join's Not-Exists optimization while the other won't?

      3. Choosing between rewrite and operator

      Suppose we have

       left_table FULL OUTER JOIN right_table ON expr
      

      Here both left_table and right_table can be nested joins.
      w.l.o.g. suppose for inner join the best join order is left_table -> right_table.

      when we compute FULL OUTER JOIN with some computation algorithm, we

      • run left_table -> right_table
      • also keep track of what in right_table we saw as the matches.
      • then, run compute right_table and filter out the rows (or record combinations) that were a match and are already in the output (1).

      When we do rewrite, we basically do

      • run left_table -> right_table
      • run right_table -> left_table with extra condition specifying we want non-matches (2).

      It seems, the check (1) should normally be cheaper than (2).
      (1) is a lookup to check a rowid value (or a record combination), while (2) can be much more expensive.

      4. FULL OUTER JOIN as an operator

      For

       (t10 ... t11 ...  t1N) FULL OUTER JOIN (t20... t21 ... t2K)
      

      The execution could be as follows:

      Execute (t10 ... t11 ... t1N) LEFT OUTER JOIN (t20... t21 ... t2K) and note which record combinations of (t20, ... ,t2K) were in the output. This can be done for example by storing concat(t20.rowid, ... t2K.rowid) in a temporary table with a unique key.

      When we're done, execute a join nest representing the best plan for t20 ... t2K (that is, assuming there's no join prefix of t10...t1N). Filter out record combinations that have been previously noted (e.g. by checking the concatenation of the rowids in the temp. table).

      The above requires that there are two sub-plans produced for computing a join of (t20... t21 ... t2K) :
      1. As a suffix of LEFT OUTER JOIN operation: (t10 ... t11 ... t1N) LEFT OUTER JOIN (t20... t21 ... t2K)
      2. Without the prefix of t10...t1N, just as t20... t21 ... t2K.

      The second part will need to have their own POSITION arrays and JOIN_TAB objects.
      Will it also need TABLE or handler* objects?

      4.1 Implementation plan

      4.1.1 Before the optimizer

      Make the parser accept FULL OUTER JOIN construct.
      (I searched if this was done as part of GSoC projects but found nothing).

      Handle name resolution (should be straighforward)

      4.1.2 convert to left/inner joins

      One needs to do two things in simplify_joins:

      1. Currently, the code there uses this equivalence:

      t1 JOIN (t2 LEFT JOIN t3 ON cond_t2_t3) = (t1 JOIN t2) LEFT JOIN t3 ON cond_t2_t3  
      

      It doesn't hold when LEFT JOIN is replaced with FULL OUTER JOIN.

      Second, FULL OUTER join can be converted into LEFT, RIGHT, or INNER join based on NULL-rejecting predicates.

      4.1.3 Join optimiation

      Both sides of FULL OUTER JOIN follow the "no-interleaving" rule.
      Join order can start with either side.

      (TODO: can "unrelated" tables come between the left and the right side? In the current execution algorithm, they can't. This seems to be an okay limitation at the moment)

      When we've put both sides into the join prefix, we also create a query plan for producing the left-NULL-complements. That is, if we have

        tables1 FULL OUTER JOIN tables2 
      

      and we have constructed a join prefix:

        some_prefix tables1 tables2 
      

      Then now we also build a prefix of

        some_prefix tables2
      

      and add the join_prefix_cost(some_prefix tables2) - join_prefix_cost(some_prefix) to the cost of join prefix we're considering.

      I don't see any easy way to save the join order for tables2 here. We can discard it after getting its cost and then we will rebuild it after we've picked the join order.

      4.1.4 Plan refinement

      Here we would need to create an extra "join nest" for the left-NULL-complements sub-plan.
      It can work in a similar way semi-join-materialization nests.
      There is a separate array of JOIN_TAB structures. An entry in the top-level JOIN_TAB array has a pointer to the child array which is invoked as necessary.

      So, for the

       (t10 JOIN  t11 JOIN  t12) FULL OUTER JOIN (t20 JOIN t21 JOIN t2K)
      

      the query plan might look like:

      (t10 t11 t12) (t20 t21 t22)       <-- call this primary-join-order
                               |
                              (t22-t21-t20) 
                                        ^--- call this null-compl-join-order
      

      It is obvious that null-compl-join-order will need its own JOIN_TAB objects.
      Will it need its own handler objects?

      Current code does one-off handler object initalizations like

      • handler->rnd_init()
      • handler->index_init(indexnr) (may also also enable HA_EXTRA_KEYREAD for this specific index)
      • handler->idx_cond_push()
      • handler->rowid_filter_push()

      Note that active index and so pushed index condition and rowid_filter may be different between primary-join-order and null-compl-join-order.
      Considering this, null-compl-join-order will need its own handler objects? (The other option is to re-initialize handlers).

      But join execution code reaches the handler object through TABLE pointer, tab->table->file. Should we re-point tab->table all the time?

      4.1.4.1 Join buffer

      Join

      4.1.4.2 Other details

      We should also take care that the output of t1 FULL OUTER JOIN t2 is not sorted by t1.key even if we use t1.key for retrieval.

      4.1.5 Execution

      The base join order has

        some_prefix tables1 tables2 
      

      after tables2, we put the concatenation of rowids for tables2 into a temporary table.

      When we've finished join execution for tables2, we set tables1 to have NULL-complemented records and start executing the join order from the join nest.
      Each generated record combination is checked in the temporary table. Those that have a match are discarded.

      Other notes: Can join buffering work "across" the bounds of FULL OUTER JOIN nests? I think they can be disabled in the first version .

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Juan Juan Telleria
              Votes:
              16 Vote for this issue
              Watchers:
              29 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.