Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35629

Hybrid Search (Text + Vector) w/ Simple Re-Ranking

Details

    • Epic
    • Status: Open (View Workflow)
    • Critical
    • Resolution: Unresolved
    • None
    • Vector search
    • None

    Description

      Feature Overview:
      Hybrid search combines the strengths of vector search and keyword-based search to deliver more relevant and comprehensive results. This feature proposes integrating a text search component into MariaDB server, using BM25 as an example algorithm. Reciprocal Rank Fusion (RRF) can be employed to merge results from vector search and text search, providing users with a unified ranking of results based on semantic and linguistic relevance.

      Key Components
      Vector Search: Leverages the database's existing capabilities to retrieve results based on semantic similarity in the vector space.
      Proposed Text Search Component:
      Uses BM25 as an example for scoring keyword-based relevance.
      BM25 ranks documents by considering term frequency, inverse document frequency, and document length normalization. Ref: BM25
      Reciprocal Rank Fusion (RRF): Combines rankings from vector search and text search, ensuring results that reflect both semantic understanding and precise keyword relevance.RRF from Microsoft

      How It Works:
      Query Processing:

      1. A user query is processed for both semantic embedding (vector search) and keyword extraction (text search).
      2. The vector embedding is compared against the indexed vectors in the database.
      3. The keywords are scored using BM25 or a similar algorithm to rank results based on textual relevance.

      Rank Fusion:

      1. RRF assigns scores to results from both search methods.
      2. Text chunks ranking highly in either or both methods are given priority in the final ranking.
      3. The fused ranking balances the strengths of both approaches, ensuring relevance across different query types.

      Benefits:

      1. Balanced Relevance: Delivers results that combine deep semantic understanding with precise keyword matching.
      2. User Flexibility: Supports diverse query types, including vague or broad descriptions and exact matches.
      3. Scalability and Adaptability: Suitable for handling large datasets and evolving user needs across various domains.

      Attachments

        Issue Links

          Activity

            People

              psergei Sergei Petrunia
              adamluciano Adam Luciano
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.