Details
-
New Feature
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
None
Description
Add support for FULL OUTER JOIN
https://www.w3schools.com/sql/sql_join_full.asp
One of the way how to implement is to re-write the query
select t1.*, t2.* from t1 full outer join t2 on P(t1,t2) |
into the following union all:
select t1.*, t2.* from t1 left outer join t2 on P(t1,t2) |
union all |
select t1.*,t2.* from t2 left outer join t1 on P(t1,t2) where t1.a is null |
Here t1.a is some non-nullable column of t1 (e.g. the column of single column primary key).
Solution considerations
contents
0. Parser
|
1. Conversion to LEFT/INNER join
|
2. FULL OUTER JOIN Execution
|
2.1 Approach #1: implement full outer join as an operation in executor
|
2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?
|
2.2 Approach #2: rewrite into two OUTER joins
|
2.2.1 How to check for non-matches if all columns are NULL-able
|
2.2.2 Non-symmetry
|
3. Choosing between rewrite and operator
|
4. FULL OUTER JOIN as an operator
|
4.1 Implementation plan
|
4.1.1 Before the optimizer
|
4.1.2 convert to left/inner joins
|
4.1.3 Join optimization
|
4.1.4 Plan refinement
|
4.1.5 Execution
|
0. Parser
We'll need to modify the parser and semantic analysis phases. This seems to be fairly straightforward?
1. Conversion to LEFT/INNER join
We will need code that analyzes/rewrites FULL OUTER JOIN into LEFT/RIGHT/INNER join based on NULL-rejecting predicates in the WHERE (or ON expressions that are outside the outer join in question).
2. FULL OUTER JOIN Execution
One can think of:
Approach #1: implement full outer join as an operation in executor.
Approach #2: rewrite it into two inner joins
2.1 Approach #1: implement full outer join as an operation in executor
This is what e.g. Cockroach and PostgreSQL seem to do.
FULL OUTER JOIN can be handled by Merge Join (or similar) operation which is symmetric wrt its inputs and/or can emit rows that didn't have matches.
(note: for PostgreSQL, I saw FULL OUTER JOIN to be handled by a variant of Hash join and Merge Join. PostgreSQL seems to require that FULL OUTER JOIN conditions have equalities. This error is produced
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join condition
|
when one doesn't have a join equality in the ON expression.)
2.1.1 Extend MariaDB's OUTER JOIN code to compute FULL OUTER JOIN?
Can't think of anything really efficient here.
We basically need to pick one side of the join to be "inner" and then compute an outer join but also keep track of which record combinations of the "inner" side were hit and which weren't?
All approaches seem to be quite expensive.
2.2 Approach #2: rewrite into two OUTER joins
In the query
SELECT |
...
|
t1 FULL OUTER JOIN t2 ON fjcond |
...
|
where t1 and t2 may be base tables or more complex expressions,
Rewrite the
t1 FULL OUTER JOIN t2 ON fjcond |
part into
(
|
select ... from t1 left join t2 on fjcond |
union all |
select ... from t2 left join t1 on fjcond where t1.pk IS NULL |
) as full_oj_tbl |
Here we assume that t1 has a not-null column "pk", so we can construct a condition meaning "outer join had no matches".
The "..." denotes all the needed columns.
2.2.1 How to check for non-matches if all columns are NULL-able
If we cannot find a column t1.pk such that we could add a t1.pk IS NULL, we can rewrite into
select ... from t2 left join ( select *, '1' as NULL_MARKER from t1) on fjcond where NULL_MARKER IS NULL |
(In the code, see Item_direct_view_ref::null_ref_table).
2.2.2 Non-symmetry
Note that the rewrite is not symmetric wrt t1 and t2. Does it matter? Well, one side will be able to benefit from outer join's Not-Exists optimization while the other won't?
3. Choosing between rewrite and operator
Suppose we have
left_table FULL OUTER JOIN right_table ON expr
|
Here both left_table and right_table can be nested joins.
w.l.o.g. suppose for inner join the best join order is left_table -> right_table.
when we compute FULL OUTER JOIN with some computation algorithm, we
- run left_table -> right_table
- also keep track of what in right_table we saw as the matches.
- then, run compute right_table and filter out the rows (or record combinations) that were a match and are already in the output (1).
When we do rewrite, we basically do
- run left_table -> right_table
- run right_table -> left_table with extra condition specifying we want non-matches (2).
It seems, the check (1) should normally be cheaper than (2).
(1) is a lookup to check a rowid value (or a record combination), while (2) can be much more expensive.
4. FULL OUTER JOIN as an operator
For
(t10 ... t11 ... t1N) FULL OUTER JOIN (t20... t21 ... t2K) |
The execution could be as follows:
Execute (t10 ... t11 ... t1N) LEFT OUTER JOIN (t20... t21 ... t2K) and note which record combinations of (t20, ... ,t2K) were in the output. This can be done for example by storing concat(t20.rowid, ... t2K.rowid) in a temporary table with a unique key.
When we're done, execute a join nest representing the best plan for t20 ... t2K (that is, assuming there's no join prefix of t10...t1N). Filter out record combinations that have been previously noted (e.g. by checking the concatenation of the rowids in the temp. table).
The above requires that there are two sub-plans produced for computing a join of (t20... t21 ... t2K) :
1. As a suffix of LEFT OUTER JOIN operation: (t10 ... t11 ... t1N) LEFT OUTER JOIN (t20... t21 ... t2K)
2. Without the prefix of t10...t1N, just as t20... t21 ... t2K.
The second part will need to have their own POSITION arrays and JOIN_TAB objects.
Will it also need TABLE or handler* objects?
4.1 Implementation plan
4.1.1 Before the optimizer
Make the parser accept FULL OUTER JOIN construct.
(I searched if this was done as part of GSoC projects but found nothing).
Handle name resolution (should be straighforward)
4.1.2 convert to left/inner joins
One needs to do two things in simplify_joins:
1. Currently, the code there uses this equivalence:
t1 JOIN (t2 LEFT JOIN t3 ON cond_t2_t3) = (t1 JOIN t2) LEFT JOIN t3 ON cond_t2_t3 |
It doesn't hold when LEFT JOIN is replaced with FULL OUTER JOIN.
Second, FULL OUTER join can be converted into LEFT, RIGHT, or INNER join based on NULL-rejecting predicates.
4.1.3 Join optimiation
Both sides of FULL OUTER JOIN follow the "no-interleaving" rule.
Join order can start with either side.
(TODO: can "unrelated" tables come between the left and the right side? In the current execution algorithm, they can't. This seems to be an okay limitation at the moment)
When we've put both sides into the join prefix, we also create a query plan for producing the left-NULL-complements. That is, if we have
tables1 FULL OUTER JOIN tables2
|
and we have constructed a join prefix:
some_prefix tables1 tables2
|
Then now we also build a prefix of
some_prefix tables2
|
and add the join_prefix_cost(some_prefix tables2) - join_prefix_cost(some_prefix) to the cost of join prefix we're considering.
I don't see any easy way to save the join order for tables2 here. We can discard it after getting its cost and then we will rebuild it after we've picked the join order.
4.1.4 Plan refinement
Here we would need to create an extra "join nest" for the left-NULL-complements sub-plan.
It can work in a similar way semi-join-materialization nests.
There is a separate array of JOIN_TAB structures. An entry in the top-level JOIN_TAB array has a pointer to the child array which is invoked as necessary.
So, for the
(t10 JOIN t11 JOIN t12) FULL OUTER JOIN (t20 JOIN t21 JOIN t2K) |
the query plan might look like:
(t10 t11 t12) (t20 t21 t22) <-- call this primary-join-order
|
|
|
(t22-t21-t20)
|
^--- call this null-compl-join-order
|
It is obvious that null-compl-join-order will need its own JOIN_TAB objects.
Will it need its own handler objects?
Current code does one-off handler object initalizations like
- handler->rnd_init()
- handler->index_init(indexnr) (may also also enable HA_EXTRA_KEYREAD for this specific index)
- handler->idx_cond_push()
- handler->rowid_filter_push()
Note that active index and so pushed index condition and rowid_filter may be different between primary-join-order and null-compl-join-order.
Considering this, null-compl-join-order will need its own handler objects? (The other option is to re-initialize handlers).
But join execution code reaches the handler object through TABLE pointer, tab->table->file. Should we re-point tab->table all the time?
4.1.4.1 Join buffer
Join
4.1.4.2 Other details
We should also take care that the output of t1 FULL OUTER JOIN t2 is not sorted by t1.key even if we use t1.key for retrieval.
4.1.5 Execution
The base join order has
some_prefix tables1 tables2
|
after tables2, we put the concatenation of rowids for tables2 into a temporary table.
When we've finished join execution for tables2, we set tables1 to have NULL-complemented records and start executing the join order from the join nest.
Each generated record combination is checked in the temporary table. Those that have a match are discarded.
Other notes: Can join buffering work "across" the bounds of FULL OUTER JOIN nests? I think they can be disabled in the first version .
Attachments
Issue Links
- blocks
-
MCOL-3303 Add FULL OUTER JOIN to MariaDB ColumnStore
-
- Closed
-
-
MDEV-15041 Implement MERGE statement
-
- Open
-
-
MDEV-20018 sql_mode="oracle" does not support FULL OUTER JOIN
-
- Open
-
-
MDEV-34323 Oracle compatibility project 3
-
- Open
-
- is part of
-
MDEV-10872 Providing compatibility Oracle database
-
- Open
-
- relates to
-
MCOL-3303 Add FULL OUTER JOIN to MariaDB ColumnStore
-
- Closed
-