[MDEV-13648] Add FULL OUTER JOIN to MariaDB - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: None
Component/s: Optimizer
Labels:

Description

Add support for FULL OUTER JOIN

https://www.w3schools.com/sql/sql_join_full.asp

One of the way how to implement is to re-write the query

select t1.*, t2.* from t1 full outer join t2 on P(t1,t2)

into the following union all:

select t1.*, t2.* from t1 left outer join t2 on P(t1,t2)

union all

select t1.*,t2.* from t2 left outer join t1 on P(t1,t2) where t1.a is null

Here t1.a is some non-nullable column of t1 (e.g. the column of single column primary key).

Solution considerations

contents

0. Parser

1. Conversion to LEFT/INNER join

2. FULL OUTER JOIN Execution

2.1 Approach #1: implement full outer join as an operation in executor

2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?

2.2 Approach #2: rewrite into two OUTER joins

2.2.1 How to check for non-matches if all columns are NULL-able

2.2.2 Non-symmetry

3. Choosing between rewrite and operator

4. FULL OUTER JOIN as an operator

4.1 Implementation plan

4.1.1 Before the optimizer

4.1.2 convert to left/inner joins

4.1.3 Join optimization

4.1.4 Plan refinement

4.1.5 Execution

0. Parser

We'll need to modify the parser and semantic analysis phases. This seems to be fairly straightforward?

1. Conversion to LEFT/INNER join

We will need code that analyzes/rewrites FULL OUTER JOIN into LEFT/RIGHT/INNER join based on NULL-rejecting predicates in the WHERE (or ON expressions that are outside the outer join in question).

2. FULL OUTER JOIN Execution

One can think of:
Approach #1: implement full outer join as an operation in executor.
Approach #2: rewrite it into two inner joins

2.1 Approach #1: implement full outer join as an operation in executor

This is what e.g. Cockroach and PostgreSQL seem to do.

FULL OUTER JOIN can be handled by Merge Join (or similar) operation which is symmetric wrt its inputs and/or can emit rows that didn't have matches.

(note: for PostgreSQL, I saw FULL OUTER JOIN to be handled by a variant of Hash join and Merge Join. PostgreSQL seems to require that FULL OUTER JOIN conditions have equalities. This error is produced

ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join condition

when one doesn't have a join equality in the ON expression.)

2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?

Can't think of anything really efficient here.
We basically need to pick one side of the join to be "inner" and then compute an outer join but also keep track of which record combinations of the "inner" side were hit and which weren't?
All approaches seem to be quite expensive.

2.2 Approach #2: rewrite into two OUTER joins

In the query

  SELECT

...

    t1 FULL OUTER JOIN t2 ON fjcond

...

where t1 and t2 may be base tables or more complex expressions,

Rewrite the

   t1 FULL OUTER JOIN t2 ON fjcond

part into

    select ... from t1 left join t2 on fjcond

    union all

    select ... from t2 left join t1 on fjcond where t1.pk IS NULL

  ) as full_oj_tbl

Here we assume that t1 has a not-null column "pk", so we can construct a condition meaning "outer join had no matches".
The "..." denotes all the needed columns.

2.2.1 How to check for non-matches if all columns are NULL-able

If we cannot find a column t1.pk such that we could add a t1.pk IS NULL, we can rewrite into

    select ... from t2 left join ( select *, '1' as NULL_MARKER from t1) on fjcond where NULL_MARKER IS NULL

(In the code, see Item_direct_view_ref::null_ref_table).

2.2.2 Non-symmetry

Note that the rewrite is not symmetric wrt t1 and t2. Does it matter? Well, one side will be able to benefit from outer join's Not-Exists optimization while the other won't?

3. Choosing between rewrite and operator

Suppose we have

 left_table FULL OUTER JOIN right_table ON expr

Here both left_table and right_table can be nested joins.
w.l.o.g. suppose for inner join the best join order is left_table -> right_table.

when we compute FULL OUTER JOIN with some computation algorithm, we

run left_table -> right_table
also keep track of what in right_table we saw as the matches.
then, run compute right_table and filter out the rows (or record combinations) that were a match and are already in the output (1).

When we do rewrite, we basically do

run left_table -> right_table
run right_table -> left_table with extra condition specifying we want non-matches (2).

It seems, the check (1) should normally be cheaper than (2).
(1) is a lookup to check a rowid value (or a record combination), while (2) can be much more expensive.

4. FULL OUTER JOIN as an operator

For

 (t10 ... t11 ...  t1N) FULL OUTER JOIN (t20... t21 ... t2K)

The execution could be as follows:

Execute (t10 ... t11 ... t1N) LEFT OUTER JOIN (t20... t21 ... t2K) and note which record combinations of (t20, ... ,t2K) were in the output. This can be done for example by storing concat(t20.rowid, ... t2K.rowid) in a temporary table with a unique key.

When we're done, execute a join nest representing the best plan for t20 ... t2K (that is, assuming there's no join prefix of t10...t1N). Filter out record combinations that have been previously noted (e.g. by checking the concatenation of the rowids in the temp. table).

The above requires that there are two sub-plans produced for computing a join of (t20... t21 ... t2K) :
1. As a suffix of LEFT OUTER JOIN operation: (t10 ... t11 ... t1N) LEFT OUTER JOIN (t20... t21 ... t2K)
2. Without the prefix of t10...t1N, just as t20... t21 ... t2K.

The second part will need to have their own POSITION arrays and JOIN_TAB objects.
Will it also need TABLE or handler* objects?

4.1 Implementation plan

4.1.1 Before the optimizer

Make the parser accept FULL OUTER JOIN construct.
(I searched if this was done as part of GSoC projects but found nothing).

Handle name resolution (should be straighforward)

4.1.2 convert to left/inner joins

One needs to do two things in simplify_joins:

1. Currently, the code there uses this equivalence:

t1 JOIN (t2 LEFT JOIN t3 ON cond_t2_t3) = (t1 JOIN t2) LEFT JOIN t3 ON cond_t2_t3

It doesn't hold when LEFT JOIN is replaced with FULL OUTER JOIN.

Second, FULL OUTER join can be converted into LEFT, RIGHT, or INNER join based on NULL-rejecting predicates.

4.1.3 Join optimiation

Both sides of FULL OUTER JOIN follow the "no-interleaving" rule.
Join order can start with either side.

(TODO: can "unrelated" tables come between the left and the right side? In the current execution algorithm, they can't. This seems to be an okay limitation at the moment)

When we've put both sides into the join prefix, we also create a query plan for producing the left-NULL-complements. That is, if we have

  tables1 FULL OUTER JOIN tables2

and we have constructed a join prefix:

  some_prefix tables1 tables2

Then now we also build a prefix of

  some_prefix tables2

and add the join_prefix_cost(some_prefix tables2) - join_prefix_cost(some_prefix) to the cost of join prefix we're considering.

I don't see any easy way to save the join order for tables2 here. We can discard it after getting its cost and then we will rebuild it after we've picked the join order.

4.1.4 Plan refinement

Here we would need to create an extra "join nest" for the left-NULL-complements sub-plan.
It can work in a similar way semi-join-materialization nests.
There is a separate array of JOIN_TAB structures. An entry in the top-level JOIN_TAB array has a pointer to the child array which is invoked as necessary.

So, for the

 (t10 JOIN  t11 JOIN  t12) FULL OUTER JOIN (t20 JOIN t21 JOIN t2K)

the query plan might look like:

(t10 t11 t12) (t20 t21 t22)       <-- call this primary-join-order

                        (t22-t21-t20)

                                  ^--- call this null-compl-join-order

It is obvious that null-compl-join-order will need its own JOIN_TAB objects.
Will it need its own handler objects?

Current code does one-off handler object initalizations like

handler->rnd_init()
handler->index_init(indexnr) (may also also enable HA_EXTRA_KEYREAD for this specific index)
handler->idx_cond_push()
handler->rowid_filter_push()

Note that active index and so pushed index condition and rowid_filter may be different between primary-join-order and null-compl-join-order.
Considering this, null-compl-join-order will need its own handler objects? (The other option is to re-initialize handlers).

But join execution code reaches the handler object through TABLE pointer, tab->table->file. Should we re-point tab->table all the time?

4.1.4.1 Join buffer

Join

4.1.4.2 Other details

We should also take care that the output of t1 FULL OUTER JOIN t2 is not sorted by t1.key even if we use t1.key for retrieval.

4.1.5 Execution

The base join order has

  some_prefix tables1 tables2

after tables2, we put the concatenation of rowids for tables2 into a temporary table.

When we've finished join execution for tables2, we set tables1 to have NULL-complemented records and start executing the join order from the join nest.
Each generated record combination is checked in the temporary table. Those that have a match are discarded.

Other notes: Can join buffering work "across" the bounds of FULL OUTER JOIN nests? I think they can be disabled in the first version .

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

dplyr Combine Tables.PNG
2018-12-10 08:03
101 kB
Juan Telleria

Issue Links

blocks

MCOL-3303 Add FULL OUTER JOIN to MariaDB ColumnStore

Closed

MDEV-15041 Implement MERGE statement

Open

MDEV-20018 sql_mode="oracle" does not support FULL OUTER JOIN

Open

MDEV-34323 Oracle compatibility project 3

Open

is part of

MDEV-10872 Providing compatibility Oracle database

Open

relates to

MCOL-3303 Add FULL OUTER JOIN to MariaDB ColumnStore

Closed

(1 relates to)

MariaDB Server

Add FULL OUTER JOIN to MariaDB

Details

Description

Solution considerations

0. Parser

1. Conversion to LEFT/INNER join

2. FULL OUTER JOIN Execution

2.1 Approach #1: implement full outer join as an operation in executor

2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?

2.2 Approach #2: rewrite into two OUTER joins

2.2.1 How to check for non-matches if all columns are NULL-able

2.2.2 Non-symmetry

3. Choosing between rewrite and operator

4. FULL OUTER JOIN as an operator

4.1 Implementation plan

4.1.1 Before the optimizer

4.1.2 convert to left/inner joins

4.1.3 Join optimiation

4.1.4 Plan refinement

4.1.4.1 Join buffer

4.1.4.2 Other details

4.1.5 Execution

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration