[MDEV-9062] join pushdown to storage engines - Jira

Details

Type: New Feature
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Optimizer, Plugins
Labels:
- smart_engine

Description

Make it possible for a storage engine to take any query fragment to be executed internally in the engine. In particular, it should be possible to push joins, GROUP BY, ORDER BY, WHERE, HAVING.

Assorted thoughts:

this is step 1, only do simple cases, see below.
only push down joins when all tables in a query are in the same storage engine
interface to the server: replace all that with a pseudo-table? with a handler?
PS/SP reexecution — restore or keep? probably, restore. easier for the engine. and we rerun optimizer anyway
optimizer: no changes, in step 1 there is no cost estimation for pushed down joins.
EXPLAIN:
| 1 | t1 | ... | pushed down
| 2 | t2 | ... | pushed down
| 3 | <PUSHDOWN RESULT> | ... | for 1 and 2
UNION — don't, every UNION part is handled separately
derived tables, views: decide to materialize or merge, as usual. if merged — push together with the upper join, if materialized — push separately
subqueries: if materialized — push separately, if semijoin — push with the upper join, otherwise — don't.
partitioning: not affected
use FederatedX as a PoC

Attachments

Issue Links

is blocked by

MDEV-17096 Pushdown of simple derived tables to storage engines

Closed

is part of

MDEV-22366 Changes in optimizer to allow better operation of Smart Engine

Open

relates to

MDEV-16502 Join pushdown to something like an IN filter for different storage engine

Open

Activity

Ascending order - Click to sort in descending order

Igor Babaev (Inactive) added a comment - 2018-06-22 07:07 - edited

Here's a suggested scheme of the implementation of this feature.

Let Q be an n-way join query of tables such that some of tables belong to the engines that can execute join operations SELECT ... FROM t[0],...,t[n] WHERE P(t[0],...,t[n]).
Consider a possible join order t[i[0]],...,t[i[n]] and let T[0],...T[k] be partitioning of this join order such that any
partition contains only tables from the same engine and no neighbors has tables from the same engine.
Let's also assume that for any two neighbors only one belongs to the engine that can execute join by its own means while the other executes joins by MariaDB SQL processor. Let's assume that T[1] belongs to the engine that can execute join by its own means. Then any odd partition also belongs to such an engine while joins of any even partition are executed by SQL processor.
Let partition T[j] consists of tables t[j][0],...t[j][lj]
It's obvious that query Q is equivalent to the following query

SELECT ... FROM

   (SELECT (...

      (SELECT ...FROM

          (SELECT ... FROM t[0][0],...,t[0][l0]) AS dt[1],

          t[1][0],...,t[1][l1]) AS dt[2] ...) AS dtk, t[k][0],...,t[k][lk]

 WHERE P(t[0],...,t[n]).

In this query first we define a derived table dt[1] that joins tables from T[0]. Then we define the derived table dt[2] that joins dt[1] and tables from T[1]. The derived table dt[j+1] joins the derived table dt[j] with tables from T[j].

Igor Babaev (Inactive) added a comment - 2018-06-22 07:07 - edited Here's a suggested scheme of the implementation of this feature. Let Q be an n-way join query of tables such that some of tables belong to the engines that can execute join operations SELECT ... FROM t [0] ,...,t [n] WHERE P(t [0] ,...,t [n] ). Consider a possible join order t[i [0] ],...,t[i [n] ] and let T [0] ,...T [k] be partitioning of this join order such that any partition contains only tables from the same engine and no neighbors has tables from the same engine. Let's also assume that for any two neighbors only one belongs to the engine that can execute join by its own means while the other executes joins by MariaDB SQL processor. Let's assume that T [1] belongs to the engine that can execute join by its own means. Then any odd partition also belongs to such an engine while joins of any even partition are executed by SQL processor. Let partition T [j] consists of tables t [j] [0] ,...t [j] [lj] It's obvious that query Q is equivalent to the following query SELECT ... FROM (SELECT (... (SELECT ...FROM (SELECT ... FROM t[0][0],...,t[0][l0]) AS dt[1], t[1][0],...,t[1][l1]) AS dt[2] ...) AS dtk, t[k][0],...,t[k][lk] WHERE P(t[0],...,t[n]). In this query first we define a derived table dt [1] that joins tables from T [0] . Then we define the derived table dt [2] that joins dt [1] and tables from T [1] . The derived table dt [j+1] joins the derived table dt [j] with tables from T [j] .

Igor Babaev (Inactive) added a comment - 2018-08-01 02:43

Another scheme just encapsulate tables from odd partitions into derived tables and pushes conditions
into these derived tables

SELECT ... FROM

...

(SELECT ... FROM t[k][0],...,t[k][lk] WHERE P(t[k][0],...,t[k][lk])) as dtk

...

WHERE P(t[0],...,t[n])

Here P(t[k][0],...,t[k][lk])) is the condition over t[k][0],...,t[k][lk] that can be extracted from P(t[0],...,t[n]).

With this scheme we rather push into the engine SELECTs that specify derived tables.
This well fits into the scheme how SELECT queries are handled by ColumnStore.

Igor Babaev (Inactive) added a comment - 2018-08-01 02:43 Another scheme just encapsulate tables from odd partitions into derived tables and pushes conditions into these derived tables SELECT ... FROM ... (SELECT ... FROM t[k][0],...,t[k][lk] WHERE P(t[k][0],...,t[k][lk])) as dtk ... WHERE P(t[0],...,t[n]) Here P(t [k] [0] ,...,t [k] [lk] )) is the condition over t [k] [0] ,...,t [k] [lk] that can be extracted from P(t [0] ,...,t [n] ). With this scheme we rather push into the engine SELECTs that specify derived tables. This well fits into the scheme how SELECT queries are handled by ColumnStore.

Igor Babaev (Inactive) added a comment - 2018-08-01 15:02 - edited

Let's call a derived table External Derived Table (EDT) if it is specified by a SELECT over tables belonging to the same external database engine capable to execute SQL operations.
If an EDT is used in a query we always consider it as a materialized derived table whose materialization is performed by an external engine.

Igor Babaev (Inactive) added a comment - 2018-08-01 15:02 - edited Let's call a derived table External Derived Table (EDT) if it is specified by a SELECT over tables belonging to the same external database engine capable to execute SQL operations. If an EDT is used in a query we always consider it as a materialized derived table whose materialization is performed by an external engine.

Igor Babaev (Inactive) added a comment - 2018-08-08 03:01

Some observations
1. Rows of a EDT for ColumnStore engine could be produced in the same way as the result set for a select query.
2. To process EDT the server needs a special function to fill the EDT that would use the iterator function over the rows of EDT
3. The optimizer has to know the cardinality of EDT. This cardinality either calculated by the external engine or is calculated by the optimizer itself employing the statistical data on the tables and columns referenced in the specification of EDT.

Igor Babaev (Inactive) added a comment - 2018-08-08 03:01 Some observations 1. Rows of a EDT for ColumnStore engine could be produced in the same way as the result set for a select query. 2. To process EDT the server needs a special function to fill the EDT that would use the iterator function over the rows of EDT 3. The optimizer has to know the cardinality of EDT. This cardinality either calculated by the external engine or is calculated by the optimizer itself employing the statistical data on the tables and columns referenced in the specification of EDT.

Igor Babaev (Inactive) added a comment - 2018-08-29 21:10 - edited

I spinned off ~~MDEV-17096~~ "Pushdown of derived tables to storage engines".
This is the first step to resolve of the problem of pushing joins into storage engines.
This is probably the only task that realistically can be done within 10.4.

Igor Babaev (Inactive) added a comment - 2018-08-29 21:10 - edited I spinned off MDEV-17096 "Pushdown of derived tables to storage engines". This is the first step to resolve of the problem of pushing joins into storage engines. This is probably the only task that realistically can be done within 10.4.

MariaDB Server

join pushdown to storage engines

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration