Details
-
Epic
-
Status: Open (View Workflow)
-
Critical
-
Resolution: Unresolved
-
None
-
None
Description
Feature Overview:
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.
Key Components
Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
Proposed Text Search Component:
Uses BM25 as an example for scoring keyword-based relevance.
BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: BM25
Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.RRF from Microsoft
How It Works:
Query Processing:
- A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
- The vector embedding is compared against the indexed vectors in the database.
- The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.
Rank Fusion:
- RRF assigns scores to results from both search methods.
- Text chunks ranking highly in either or both methods are given priority in the final ranking.
- The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.
Benefits:
- Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
- User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
- Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.
Attachments
Issue Links
- includes
-
MDEV-35970 streaming window functions
-
- Needs Feedback
-
- relates to
-
MDEV-32887 vector search
-
- Stalled
-
Activity
Field | Original Value | New Value |
---|---|---|
Description |
*Feature Overview:*
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance. *Key Components* Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space. Proposed Text Search Component: Uses BM25 as an example for scoring keyword-based relevance. BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25 Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking] *How It Works:* Query Processing: # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search). # The vector embedding is compared against the indexed vectors in the database. # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance. Rank Fusion: # RRF assigns scores to results from both search methods. # Documents ranking highly in either or both methods are given priority in the final ranking. # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types. Benefits: # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching. # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches. # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains. |
*Feature Overview:*
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance. *Key Components* Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space. Proposed Text Search Component: Uses BM25 as an example for scoring keyword-based relevance. BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25] Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking] *How It Works:* Query Processing: # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search). # The vector embedding is compared against the indexed vectors in the database. # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance. Rank Fusion: # RRF assigns scores to results from both search methods. # Documents ranking highly in either or both methods are given priority in the final ranking. # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types. Benefits: # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching. # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches. # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains. |
Description |
*Feature Overview:*
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance. *Key Components* Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space. Proposed Text Search Component: Uses BM25 as an example for scoring keyword-based relevance. BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25] Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking] *How It Works:* Query Processing: # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search). # The vector embedding is compared against the indexed vectors in the database. # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance. Rank Fusion: # RRF assigns scores to results from both search methods. # Documents ranking highly in either or both methods are given priority in the final ranking. # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types. Benefits: # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching. # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches. # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains. |
*Feature Overview:*
Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance. *Key Components* Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space. Proposed Text Search Component: Uses BM25 as an example for scoring keyword-based relevance. BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25] Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking] *How It Works:* Query Processing: # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search). # The vector embedding is compared against the indexed vectors in the database. # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance. Rank Fusion: # RRF assigns scores to results from both search methods. # Text chunks ranking highly in either or both methods are given priority in the final ranking. # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types. Benefits: # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching. # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches. # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains. |
Due Date | 2024-12-11 |
Fix Version/s | 11.9 [ 29945 ] |
Assignee | Sergei Golubchik [ serg ] | Sergei Petrunia [ psergey ] |
Link | This issue includes MDEV-35970 [ MDEV-35970 ] |
Priority | Critical [ 2 ] | Minor [ 4 ] |
Link | This issue relates to MDEV-32887 [ MDEV-32887 ] |
Fix Version/s | 12.0 [ 29945 ] |
Priority | Minor [ 4 ] | Major [ 3 ] |
Fix Version/s | 12.1 [ 29992 ] |
Priority | Major [ 3 ] | Critical [ 2 ] |
Issue Type | New Feature [ 2 ] | Epic [ 5 ] |
Fix Version/s | 12.1 [ 29992 ] |
This can be expressed in SQL like
And it works already now.
The tricky part is to make optimizer to use the vector index here. The algorithm can approximately look like that: