[MDEV-35970] streaming window functions - Jira

Details

Type: Task
Status: Needs Feedback (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: 12.1
Component/s: Optimizer
Labels:
- optimizer-12.1-sprint-candidate

Description

Study of the approach with window functions.

Contents

  The task

  Search algorithm draft

  Notes about the algorithm

  A case of non-matches

    Limiting look-ahead when the user doesn't need the rrf value.

The task

So, we're looking to efficiently compute a query in this form:

SELECT

id,

  1/(60 + RANK() OVER (ORDER BY VEC_DISTANCE(vec, x'03401272819837e')))

  1/(60 + RANK() OVER (ORDER BY MATCH (text) AGAINST ("red tshirt with fun text") DESC))

  AS rrf

FROM

t1

ORDER BY rrf DESC

LIMIT 10

(TODO: is this RANK() where peers are equal or ROW_NUMBER()? This is not important at this point)
(TODO: does the user actually need the value of rrf or it's only there for ORDER BY? A: no, the value of rrf is only usable for ordering)

We can construct index scans that return rows in the increasing ORDER BY criteria.
Can we construct a query plan that would take advantage of the LIMIT and scan fewer rows?

Search algorithm draft

So, we have

   stream1 produces (value1,rowid) ordered by value1 desc

   stream2 produces (value2,rowid) ordered by value2 desc

and we want to produce first LIMIT $LIMIT rows ordered by

    value1 + value2   DESC

Do as follows:

const N= <pick some reasonable lookahead size>;

Priority_queue priority_queue(size=N);

start:

get N rows from stream1 into stream_buf1;

get N rows from stream2 into stream_buf2;

if (stream_buf1.is_empty() && steam_buf2.is_empty())

  return;

foreach row R1 in stream_buf1 {

  if (R2= stream_buf2.find_row(R1.rowid)) {

    // Ok, {R1,R2} is a matching combination.

    priority_queue.put({ R1, R2 });

    stream_buf1.delete_row(R1);

    stream_buf2.delete_row(R2);

Ok, priority_queue has matching record combinations. Can we emit them already?
Consider a row combination {R1, R2} that's in the priority_queue.

Let's take the "most competitive" value from stream_buf1:

R1min= stream_buf1.first_value1();

We know it doesn't have a match in stream_buf2 (we would have removed it if it did).
Its match is "worse" than anything in stream_buf2. The "worst" value there is stream_buf2.last_value2(). On the picture, these are drawn in red:

So, we can emit {R1, R2} if :

R1.value1 + R2.value2 < stream_buf1.first_value1() + stream_buf2.last_value2()

There is symmetric potential competitor on the right (shown in green on the picture):

R1.value1 + R2.value2 < stream_buf1.last_value1() + stream_buf2.first_value2()

So, we can do this:

auto  min_bound=

  MIN(stream_buf1.first_value1() + stream_buf2.last_value2(),

      stream_buf1.last_value1() + stream_buf2.first_value2());

// Remove from PQ everything below the min_bound

while ( {R1, R2} = priority_queue.top_element()) {

  if (R1.value1 + R2.value2< min_bound) {

    priority_queue.remove_top();

    emit_row({R1, R2});

  else

     break;

At this point, we may have:

something left in the priority_queue
some "singles" in stream_buf1 and stream_buf2 that haven't got their match, yet.
We should read more rows from the input and continue the process, so we:

goto start;

Notes about the algorithm

There's no limit on how large stream_buf1 and stream_buf2 can get.
If we're unlucky and the searches return rows with few/no overlaps we can end up having stream_buf1.size() + stream_buf2.size() = rows_in_the_table .

A case of non-matches

A problem with the used SQL form is that we must find the rank() for both parts of the rrf.
Suppose we scan 1000 rows in the vector index and get rowids of 1000 nearest rows:

+------------------+-------+

| 1/(60+RANK(...))| rowid |

+------------------+-------+

|           0.0164 | rw1   |

|           0.0161 | rw2   |

|           0.0159 | rw3   |

|           0.0156 | ...   |

|           0.0154 |       |

|           0.0152 |       |

|           0.0149 |       |

|           0.0147 |       |

|           0.0145 |       |

...

...

|           0.0009 |       |

+------------------+-------+

and get the same, rowids of rows with 1000 best fulltext matches.
Suppose these have no overlap at all.
Then, do we really want to continue searching to find matches for the best rows?
The best row has a vector score of 0.0164 .
We know its fulltext score is less than 0.0009, that is, it will make a difference of less than 5.4 %.
Maybe we should take the #LIMIT=10 rows with the best partial scores we've seen so far and return them?
The search is approximate anyhow.

Limiting look-ahead when the user doesn't need the rrf value.

In case of RRF function

1/ (60 +  rank() over (...) )    +  1 / (60 + rank(...))

the ranks from both "sources" are:

R(1) = 1/60

R(2)=1/61

R(3)=1/62

...

If we only get rows without matches, how far do we need to look
before we know that the row with R(1) is ordered as first regardless
of what rank it has in the other "source" ?
This is

1 / (1/60 + 1/61)  - 60 = 3722 rows.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mdev35970-set-search-params.sql
19 kB
2025-04-09 12:19
merge-streams-1.jpg
1.67 MB
2025-01-29 13:05

Issue Links

is blocked by

MDEV-36593 Window functions: reuse sorting and/or scanning with the main query

Open

is part of

MDEV-35629 Hybrid Search (Text + Vector) w/ Simple Re-Ranking

Open

relates to

MDEV-36538 EXPLAIN shows wrong rows value for LIMIT in derived tables

Open

MDEV-36539 Non-merged CTEs: reuse the CTE if there's no condition pushdown

Open

Activity

Ascending order - Click to sort in descending order

View 5 older comments

Sergei Petrunia added a comment - 2025-04-09 12:31

Putting it all together (and changing LIMIT to 10):

WITH

vector_limit_search as (

  SELECT title, VEC_DISTANCE_EUCLIDEAN(embedding, @search_term_vector) AS dist

  FROM products

  ORDER BY dist

  LIMIT 10

),

fulltext_limit_search as (

  SELECT

    title, MATCH(title) AGAINST (@search_term) as match_score

  FROM products

  WHERE MATCH(title) AGAINST (@search_term)

  ORDER BY match_score LIMIT 10

),

vector_score as (

    SELECT

        title,

        RANK() OVER (ORDER BY dist) AS relevance

    FROM vector_limit_search

),

fulltext_score AS (

    SELECT

        title,

        RANK() OVER (ORDER BY match_score) AS relevance

    FROM fulltext_limit_search

SELECT

    v.title,

    (1 / (@k + v.relevance)) + (1 / (@k + f.relevance)) AS rrf

FROM vector_score v

JOIN fulltext_score f USING (title)

ORDER BY rrf DESC

LIMIT 2;

finishes in 0.001 second for me. We probably need full outer join here (or something like that). The speed will be similar though.

Sergei Petrunia added a comment - 2025-04-09 12:31 Putting it all together (and changing LIMIT to 10): WITH vector_limit_search as ( SELECT title, VEC_DISTANCE_EUCLIDEAN(embedding, @search_term_vector) AS dist FROM products ORDER BY dist LIMIT 10 ), fulltext_limit_search as ( SELECT title, MATCH(title) AGAINST (@search_term) as match_score FROM products WHERE MATCH(title) AGAINST (@search_term) ORDER BY match_score LIMIT 10 ), vector_score as ( SELECT title, RANK() OVER ( ORDER BY dist) AS relevance FROM vector_limit_search ), fulltext_score AS ( SELECT title, RANK() OVER ( ORDER BY match_score) AS relevance FROM fulltext_limit_search ) SELECT v.title, (1 / (@k + v.relevance)) + (1 / (@k + f.relevance)) AS rrf FROM vector_score v JOIN fulltext_score f USING (title) ORDER BY rrf DESC LIMIT 2; finishes in 0.001 second for me. We probably need full outer join here (or something like that). The speed will be similar though.

Sergei Petrunia added a comment - 2025-04-09 12:46 - edited

alejandro-duarte does this achieve what you intended to do?

Sergei Petrunia added a comment - 2025-04-09 12:46 - edited alejandro-duarte does this achieve what you intended to do?

Sergei Petrunia added a comment - 2025-04-09 12:50 - edited

Now, I accept that this SQL might be too convoluted to expect our users to write. We could improve the optimizer to allow less convoluted SQL.
For example

detect that "ORDER BY ascending_function(X)" can be satisfied by query plan producing rows ordered by X.
Allow for "streamed" window function computation. (Currently, window function computation will materialize the entire SELECT output in a temporary table, first).
- Then, detect that if one only uses "Ascending" window functions without PARTITION BY clause AND has ORDER BY window_func LIMIT n:
  SELECT RANK() OVER (ORDER BY col) as rf FROM ... ORDER BY rf LIMIT n
  then we can short-cut the execution to only enumerate n rows.

Also we may need: Full Outer join (MDEV-13648)
in case we don't have full outer join, we could do: MDEV-36539.

Some of these features may require a lot of code changes so we need to really commit to doing them.

Sergei Petrunia added a comment - 2025-04-09 12:50 - edited Now, I accept that this SQL might be too convoluted to expect our users to write. We could improve the optimizer to allow less convoluted SQL. For example detect that "ORDER BY ascending_function(X)" can be satisfied by query plan producing rows ordered by X. Allow for "streamed" window function computation. (Currently, window function computation will materialize the entire SELECT output in a temporary table, first). Then, detect that if one only uses "Ascending" window functions without PARTITION BY clause AND has ORDER BY window_func LIMIT n: SELECT RANK() OVER (ORDER BY col) as rf FROM ... ORDER BY rf LIMIT n then we can short-cut the execution to only enumerate n rows. Also we may need: Full Outer join ( MDEV-13648 ) in case we don't have full outer join, we could do: MDEV-36539 . Some of these features may require a lot of code changes so we need to really commit to doing them.

Adam Luciano added a comment - 1 week ago

Would it be possible to try with this dataset / approach - it was designed for the hybrid search case - https://github.com/snap-stanford/STaRK ?

Adam Luciano added a comment - 1 week ago Would it be possible to try with this dataset / approach - it was designed for the hybrid search case - https://github.com/snap-stanford/STaRK ?

Sergei Petrunia added a comment - 1 week ago - edited

I suppose that if certan row X is not found by fulltext or vector search, we need to
count that its contribution to the score, that is, its

    (1 / (@k + relevance)

is zero. There can be rows that vector finds highly relevant while fulltext finds them
irrelevant (and vice versa).

So, instead of requiring that both searches find them relevant and using INNER JOIN, lets emulate FULL OUTER JOIN. Using one more CTE, I get:

WITH

vector_limit_search as (

  SELECT id, title, VEC_DISTANCE_EUCLIDEAN(embedding, @search_term_vector) AS dist

  FROM products

  ORDER BY dist

  LIMIT 10

),

fulltext_limit_search as (

  SELECT

    id, title, MATCH(title) AGAINST (@search_term) as match_score

  FROM products

  WHERE MATCH(title) AGAINST (@search_term)

  ORDER BY match_score LIMIT 10

),

vector_score as (

    SELECT

        id, title,

        1 / (@k + RANK() OVER (ORDER BY dist)) as partial_rrf

    FROM vector_limit_search

),

fulltext_score AS (

    SELECT

        id, title,

        1 / (@k + RANK() OVER (ORDER BY match_score)) AS partial_rrf

    FROM fulltext_limit_search

),

full_outer_join_output as (

  SELECT

      v.id, v.title,

      (v.partial_rrf + NVL(f.partial_rrf, 0)) AS rrf

  FROM

    vector_score v LEFT JOIN fulltext_score f USING (id)

  UNION

  SELECT

      f.id, f.title,

      (NVL(v.partial_rrf, 0) + f.partial_rrf) AS rrf

  FROM

    fulltext_score f LEFT JOIN vector_score v USING (id)

select *

from full_outer_join_output

ORDER BY rrf DESC

LIMIT 2;

Sergei Petrunia added a comment - 1 week ago - edited I suppose that if certan row X is not found by fulltext or vector search, we need to count that its contribution to the score, that is, its (1 / (@k + relevance) is zero. There can be rows that vector finds highly relevant while fulltext finds them irrelevant (and vice versa). So, instead of requiring that both searches find them relevant and using INNER JOIN, lets emulate FULL OUTER JOIN. Using one more CTE, I get: WITH vector_limit_search as ( SELECT id, title, VEC_DISTANCE_EUCLIDEAN(embedding, @search_term_vector) AS dist FROM products ORDER BY dist LIMIT 10 ), fulltext_limit_search as ( SELECT id, title, MATCH(title) AGAINST (@search_term) as match_score FROM products WHERE MATCH(title) AGAINST (@search_term) ORDER BY match_score LIMIT 10 ), vector_score as ( SELECT id, title, 1 / (@k + RANK() OVER ( ORDER BY dist)) as partial_rrf FROM vector_limit_search ), fulltext_score AS ( SELECT id, title, 1 / (@k + RANK() OVER ( ORDER BY match_score)) AS partial_rrf FROM fulltext_limit_search ), full_outer_join_output as ( SELECT v.id, v.title, (v.partial_rrf + NVL(f.partial_rrf, 0)) AS rrf FROM vector_score v LEFT JOIN fulltext_score f USING (id) UNION SELECT f.id, f.title, (NVL(v.partial_rrf, 0) + f.partial_rrf) AS rrf FROM fulltext_score f LEFT JOIN vector_score v USING (id) ) select * from full_outer_join_output ORDER BY rrf DESC LIMIT 2;

People

Assignee:: Adam Luciano

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2025-01-29 12:13

Updated:: 1 hour ago

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Study of the approach with window functions.

The task

Search algorithm draft

Notes about the algorithm

A case of non-matches

Limiting look-ahead when the user doesn't need the rrf value.

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration