[MDEV-16502] Join pushdown to something like an IN filter for different storage engine Created: 2018-06-16  Updated: 2021-10-25

Status: Open
Project: MariaDB Server
Component/s: Optimizer, Plugins
Fix Version/s: None

Type: Task Priority: Major
Reporter: Andrew Hutchings (Inactive) Assignee: Igor Babaev
Resolution: Unresolved Votes: 0
Labels: smart_engine

Issue Links:
PartOf
is part of MDEV-22366 Changes in optimizer to allow better ... Open
Relates
relates to MDEV-9062 join pushdown to storage engines Open

 Description   

Extension of the join pushdown in MDEV-9062 we need a way to pushdown data from small tables in a join when pushdown join is used.

Example:
InnoDB table with 100 rows and 4 ColumnStore tables in a join. This feature should convert the 100 rows (or filtered results of the 100 rows) into something like an IN filter to pushdown the InnoDB results into ColumnStore.

Customers often use InnoDB tables as pivot tables in AX which have very small number of rows so this is a common use case.

There of course should be a threshold on this (a sysvar?), we don't know at what point this will become too expensive.

I'm guessing an optimizer flag will be needed to separate to pushdown join but only allowed to be turned on when pushdown join is turned on.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2018-06-16 ]

Additional clarification: this should only happen if the storage engine is not the one in the join push down. In my example it should not do this if the ColumnStore table only has say 100 rows.

Comment by Michael Widenius [ 2018-06-16 ]

I think that converting things to an IN is the wrong approach and not something we should do (unless all other approaches fails).
First we need to decide what kind of queries we need to optimize and then decide what is the right approach to solve a majority of them. The IN approach will not work in a majority of cases and there are better ways to do it.

Problems:

  • We already have a multi-range-read interface that is superior to using IN (if it can be used). Without knowing the
    type of queries we want to solve in the first release, it's hard to say.
  • IN will not work for anything else than the simplest queries.
  • It will require a lot of changes in the optimizer to first find all keys that should be in the 'in' and then
    again read the rows to verify that the data is correct (as an IN will not be able to solve other parts of the WHERE)

A better, more general solution that would help both CS, Spider and KRAC would be to instead support
a push down of a full temporary table to a storage engine (A temporary table is better than the original 'small table' as we can remove not needed fields and even join several internal tables into one). This should solve most issues with normal queries, including complex WHERE clauses. (sub queries is still a problem, but we could also consider pushing down another table for these if needed).

Comment by Andrew Hutchings (Inactive) [ 2018-06-16 ]

I agree a more general solution should be created. I meant "something like IN". I don't know what the best implementation of this would be.

Comment by Gregory Dorman (Inactive) [ 2020-09-24 ]

As written this is way too narrow (also appears to be applicable to ROW INTERFACE only, not when Select Handlers are involved). Presently, cross engine join in ColumnStore is a hack with endless problems in the field (security mainly, but not only). CS gets the entire query, connects back to server (using some ugly acrobatics with credentials), and runs select from InnoDB (applicable predicates applied). The result is fetched, hashed and processed, from here the same way as for any normal CS table (hash is the only join method in CS). Mathematically speaking, this is a poor man equivalent of Monty's thesis (just give the engine the small table as if it was his), except that the small table is being pulled by CS rather than pushed by server. Worthy of note is that CS sometimes does flip a small table part of join graph into IN clause (I am not sure if this ever happens in cross engine situations, though).

The real and hugely valuable change would be to make server a true manager of federated queries when Smart Engines are in play. This would benefit XPAND equally (or more). This could be 100% covered by MDEV-22366.

  • if all tables in the query are from the same smart engine - select handler takes the whole thing (if compatible syntax-wise).
  • otherwise, a recursive descent through the query tree would break up the query into subtrees each dealing with one engine only. It is important that this should be recursive descent (top down, not bottom up, as what's happening now to a limited extent).
  • these "homogenous" query fragments would be pushed down to each engine by Derived Handler (or managed by ROW INTERFACE, depending on engine's "smartness").
    • Logically I am treating the plain table join as though it was a derived table, for simplicity. SELECT FROM eng1,eng2 WHERE <things> would be treated the same as SELECT FROM (select from eng1) q1, (select from eng2) q2 WHERE <things>. Engine 1 and Engine 2 would receive full selects with as much of predicate pushing as is possible.
    • nested loops should probably be avoided, except in some specific cases.
  • Monty's proposition to materialize small result in other engine's temp table can be applied on top of it too, provided one can figure out the smallness.

My further theory is that if 22366 is done, with possible addon of temp table materializations in foreign engines in some cases, it will swallow 9062 also, leaving us with essentially one project covering main needs of both smart engines (XPAND and CS).

Ralf is correct about both this and 9062 being a part of 22366.

Generated at Thu Feb 08 08:29:24 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.