Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35629

Hybrid Search (Text + Vector) w/ Simple Re-Ranking

Details

    • Epic
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • None
    • Vector search
    • None

    Description

      Feature Overview:
      Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

      Key Components
      Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
      Proposed Text Search Component:
      Uses BM25 as an example for scoring keyword-based relevance.
      BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: BM25
      Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.RRF from Microsoft

      How It Works:
      Query Processing:

      1. A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
      2. The vector embedding is compared against the indexed vectors in the database.
      3. The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

      Rank Fusion:

      1. RRF assigns scores to results from both search methods.
      2. Text chunks ranking highly in either or both methods are given priority in the final ranking.
      3. The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

      Benefits:

      1. Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
      2. User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
      3. Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.

      Attachments

        Issue Links

          Activity

            adamluciano Adam Luciano created issue -
            adamluciano Adam Luciano made changes -
            Field Original Value New Value
            Description *Feature Overview:*
            Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

            *Key Components*
            Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
            Proposed Text Search Component:
            Uses BM25 as an example for scoring keyword-based relevance.
            BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25
            Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking]

            *How It Works:*
            Query Processing:
            # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
            # The vector embedding is compared against the indexed vectors in the database.
            # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

            Rank Fusion:
            # RRF assigns scores to results from both search methods.
            # Documents ranking highly in either or both methods are given priority in the final ranking.
            # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

            Benefits:
            # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
            # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
            # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.
            *Feature Overview:*
            Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

            *Key Components*
            Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
            Proposed Text Search Component:
            Uses BM25 as an example for scoring keyword-based relevance.
            BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25]
            Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking]

            *How It Works:*
            Query Processing:
            # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
            # The vector embedding is compared against the indexed vectors in the database.
            # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

            Rank Fusion:
            # RRF assigns scores to results from both search methods.
            # Documents ranking highly in either or both methods are given priority in the final ranking.
            # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

            Benefits:
            # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
            # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
            # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.
            adamluciano Adam Luciano made changes -
            Description *Feature Overview:*
            Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

            *Key Components*
            Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
            Proposed Text Search Component:
            Uses BM25 as an example for scoring keyword-based relevance.
            BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25]
            Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking]

            *How It Works:*
            Query Processing:
            # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
            # The vector embedding is compared against the indexed vectors in the database.
            # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

            Rank Fusion:
            # RRF assigns scores to results from both search methods.
            # Documents ranking highly in either or both methods are given priority in the final ranking.
            # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

            Benefits:
            # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
            # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
            # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.
            *Feature Overview:*
            Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

            *Key Components*
            Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
            Proposed Text Search Component:
            Uses BM25 as an example for scoring keyword-based relevance.
            BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: [BM25|https://en.wikipedia.org/wiki/Okapi_BM25]
            Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.[RRF from Microsoft|https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking]

            *How It Works:*
            Query Processing:
            # A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
            # The vector embedding is compared against the indexed vectors in the database.
            # The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

            Rank Fusion:
            # RRF assigns scores to results from both search methods.
            # Text chunks ranking highly in either or both methods are given priority in the final ranking.
            # The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

            Benefits:
            # Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
            # User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
            # Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.
            adamluciano Adam Luciano made changes -
            Due Date 2024-12-11

            This can be expressed in SQL like

            SELECT id, 1/RANK() OVER (ORDER BY VEC_DISTANCE(vec, ?)) + 1/RANK() OVER (ORDER BY MATCH text AGANST (?)) AS rrf
            FROM t1 ORDER BY rrf DESC LIMIT 10
            

            And it works already now.

            The tricky part is to make optimizer to use the vector index here. The algorithm can approximately look like that:

            • below we'll merge two streams of rows in the index (distance or relevance) order.
            • read first N (say, 10) rows from both indexes, push in a priority queue ordered by rrf desc
            • keep reading rows from indexes and pushing until the difference between first two entries in the queue is more than 1/N
            • keep popping and returning rows from the queue until difference between first two entries is less than 1/N
            • repeat
            serg Sergei Golubchik added a comment - This can be expressed in SQL like SELECT id, 1/RANK() OVER ( ORDER BY VEC_DISTANCE(vec, ?)) + 1/RANK() OVER ( ORDER BY MATCH text AGANST (?)) AS rrf FROM t1 ORDER BY rrf DESC LIMIT 10 And it works already now. The tricky part is to make optimizer to use the vector index here. The algorithm can approximately look like that: below we'll merge two streams of rows in the index (distance or relevance) order. read first N (say, 10) rows from both indexes, push in a priority queue ordered by rrf desc keep reading rows from indexes and pushing until the difference between first two entries in the queue is more than 1/N keep popping and returning rows from the queue until difference between first two entries is less than 1/N repeat
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 11.9 [ 29945 ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Sergei Petrunia [ psergey ]
            psergei Sergei Petrunia made changes -
            adamluciano Adam Luciano made changes -
            Priority Critical [ 2 ] Minor [ 4 ]

            Before this is implemented, one can perform hybrid search by explicitly specifying limit, like in https://github.com/pgvector/pgvector-python/blob/master/examples/hybrid_search/rrf.py

            serg Sergei Golubchik added a comment - Before this is implemented, one can perform hybrid search by explicitly specifying limit, like in https://github.com/pgvector/pgvector-python/blob/master/examples/hybrid_search/rrf.py
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 12.0 [ 29945 ]
            adamluciano Adam Luciano made changes -
            Priority Minor [ 4 ] Major [ 3 ]
            adamluciano Adam Luciano made changes -
            Fix Version/s 12.1 [ 29992 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            adamluciano Adam Luciano made changes -
            Issue Type New Feature [ 2 ] Epic [ 5 ]
            adamluciano Adam Luciano made changes -
            Fix Version/s 12.1 [ 29992 ]

            People

              psergei Sergei Petrunia
              adamluciano Adam Luciano
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.