[MDEV-35629] Hybrid Search (Text + Vector) w/ Simple Re-Ranking - Jira

Details

Type: Epic
Status: Open (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: None
Component/s: Vector search
Labels:
None

Description

Feature Overview:
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

Key Components
Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
Proposed Text Search Component:
Uses BM25 as an example for scoring keyword-based relevance.
BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: BM25
Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.RRF from Microsoft

How It Works:
Query Processing:

A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
The vector embedding is compared against the indexed vectors in the database.
The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

Rank Fusion:

RRF assigns scores to results from both search methods.
Text chunks ranking highly in either or both methods are given priority in the final ranking.
The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

Benefits:

Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.

Attachments

Issue Links

includes

MDEV-35970 streaming window functions

Needs Feedback

relates to

MDEV-32887 vector search

Stalled

Activity

Ascending order - Click to sort in descending order

Adam Luciano created issue - 2024-12-11 17:56

Adam Luciano made changes - 2024-12-11 17:58

Field	Original Value	New Value
Description	Feature Overview: Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance. Key Components Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space. Proposed Text Search Component: Uses BM25 as an example for scoring keyword-based relevance. BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25\|https://en.wikipedia.org/wiki/Okapi_BM25 Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft\|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking] How It Works: Query Processing: # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search). # The vector embedding is compared against the indexed vectors in the database. # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance. Rank Fusion: # RRF assigns scores to results from both search methods. # Documents ranking highly in either or both methods are given priority in the final ranking. # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types. Benefits: # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching. # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches. # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.	Feature Overview: Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance. Key Components Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space. Proposed Text Search Component: Uses BM25 as an example for scoring keyword-based relevance. BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25\|https://en.wikipedia.org/wiki/Okapi_BM25] Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft\|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking] How It Works: Query Processing: # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search). # The vector embedding is compared against the indexed vectors in the database. # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance. Rank Fusion: # RRF assigns scores to results from both search methods. # Documents ranking highly in either or both methods are given priority in the final ranking. # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types. Benefits: # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching. # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches. # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.

Adam Luciano made changes - 2024-12-11 18:19

Description

*Feature Overview:*
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

*Key Components*
Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
Proposed Text Search Component:
Uses BM25 as an example for scoring keyword-based relevance.
BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25]
Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking]

*How It Works:*
Query Processing:
# A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
# The vector embedding is compared against the indexed vectors in the database.
# The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

Rank Fusion:
# RRF assigns scores to results from both search methods.
# Documents ranking highly in either or both methods are given priority in the final ranking.
# The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

Benefits:
# Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
# User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
# Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.

*Feature Overview:*
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

*Key Components*
Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
Proposed Text Search Component:
Uses BM25 as an example for scoring keyword-based relevance.
BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25]
Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking]

*How It Works:*
Query Processing:
# A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
# The vector embedding is compared against the indexed vectors in the database.
# The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

Rank Fusion:
# RRF assigns scores to results from both search methods.
# Text chunks ranking highly in either or both methods are given priority in the final ranking.
# The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

Benefits:
# Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
# User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
# Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.

Adam Luciano made changes - 2024-12-13 14:54

Due Date

2024-12-11

Sergei Golubchik added a comment - 2024-12-18 07:07

This can be expressed in SQL like

SELECT id, 1/RANK() OVER (ORDER BY VEC_DISTANCE(vec, ?)) + 1/RANK() OVER (ORDER BY MATCH text AGANST (?)) AS rrf

FROM t1 ORDER BY rrf DESC LIMIT 10

And it works already now.

The tricky part is to make optimizer to use the vector index here. The algorithm can approximately look like that:

below we'll merge two streams of rows in the index (distance or relevance) order.
read first N (say, 10) rows from both indexes, push in a priority queue ordered by rrf desc
keep reading rows from indexes and pushing until the difference between first two entries in the queue is more than 1/N
keep popping and returning rows from the queue until difference between first two entries is less than 1/N
repeat

Sergei Golubchik added a comment - 2024-12-18 07:07 This can be expressed in SQL like SELECT id, 1/RANK() OVER ( ORDER BY VEC_DISTANCE(vec, ?)) + 1/RANK() OVER ( ORDER BY MATCH text AGANST (?)) AS rrf FROM t1 ORDER BY rrf DESC LIMIT 10 And it works already now. The tricky part is to make optimizer to use the vector index here. The algorithm can approximately look like that: below we'll merge two streams of rows in the index (distance or relevance) order. read first N (say, 10) rows from both indexes, push in a priority queue ordered by rrf desc keep reading rows from indexes and pushing until the difference between first two entries in the queue is more than 1/N keep popping and returning rows from the queue until difference between first two entries is less than 1/N repeat

Ralf Gebhardt made changes - 2025-01-10 14:17

Fix Version/s

11.9 [ 29945 ]

Sergei Golubchik made changes - 2025-01-28 13:29

Assignee

Sergei Golubchik [ serg ]

Sergei Petrunia [ psergey ]

Sergei Petrunia made changes - 2025-01-29 12:13

Link

This issue includes MDEV-35970 [ MDEV-35970 ]

Adam Luciano made changes - 2025-01-29 18:32

Priority

Critical [ 2 ]

Minor [ 4 ]

Sergei Golubchik added a comment - 2025-01-29 20:22

Before this is implemented, one can perform hybrid search by explicitly specifying limit, like in https://github.com/pgvector/pgvector-python/blob/master/examples/hybrid_search/rrf.py

Sergei Golubchik added a comment - 2025-01-29 20:22 Before this is implemented, one can perform hybrid search by explicitly specifying limit, like in https://github.com/pgvector/pgvector-python/blob/master/examples/hybrid_search/rrf.py

Sergei Golubchik made changes - 2025-02-15 13:55

Link

This issue relates to MDEV-32887 [ MDEV-32887 ]

Sergei Golubchik made changes - 2025-02-15 13:57

Fix Version/s

12.0 [ 29945 ]

Adam Luciano made changes - 2025-02-24 14:06

Priority

Minor [ 4 ]

Major [ 3 ]

Adam Luciano made changes - 2025-03-11 18:29

Fix Version/s

12.1 [ 29992 ]

Ralf Gebhardt made changes - 2025-03-27 15:53

Priority

Major [ 3 ]

Critical [ 2 ]

Adam Luciano made changes - 2025-04-02 13:24

Issue Type

New Feature [ 2 ]

Epic [ 5 ]

Adam Luciano made changes - 2025-04-02 13:24

Fix Version/s

12.1 [ 29992 ]

People

Assignee:: Sergei Petrunia

Reporter:: Adam Luciano

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2024-12-11 17:56

Updated:: 2025-04-02 13:24

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration