Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36568

Research: Enhancing MariaDB Full Text Search

    XMLWordPrintable

Details

    • Q1/2026 Server Development

    Description

      Objective

      To determine the feasibility and performance benefits of implementing advanced full-text search in MariaDB, considering options such as the BM25 search algorithm using a sparse vector approach or integrating the Tantivy search engine, another option.

      1. Background & Hypothesis
      MariaDB's current TF-IDF-based full-text search can be improved. We hypothesize that integrating either the BM25 algorithm via sparse vectors, inspired by modern search solutions like Milvus, or the Tantivy search engine will provide enhanced search relevance with acceptable performance overhead. This research will validate that hypothesis by evaluating both options.

      2. Key Research Questions
      This investigation will answer:

      Integration: What are the most viable paths to integrate either a BM25 ranking function via sparse vectors or Tantivy into MariaDB's query engine and FTS syntax, or are there other options?
      Storage: For BM25, what is the optimal way to represent and store the sparse vectors, and what is the storage cost compared to the current FTS index? For Tantivy, what are the storage implications and costs relative to the current FTS index?
      Performance: How do prototypes for BM25 sparse vectors and Tantivy compare against the existing InnoDB FTS in terms of query latency, relevance, and resource usage?
      Implementation: What can be learned from existing open-source BM25 implementations (e.g., in Postgres) and Tantivy to accelerate a potential build?

      3. Research Plan & Deliverables

      Analyze: Review existing implementations of BM25 sparse embeddings (Milvus, Postgres) and Tantivy, along with MariaDB's FTS architecture, to identify integration strategies for each option. Other options are ok too.
      Prototype: Develop minimal Proofs of Concept (PoCs) to model the BM25 algorithm with sparse vector data storage and a Tantivy integration.
      Benchmark: Compare the PoCs' performance and relevance against MariaDB's native FTS using a standard dataset.

      Key Deliverable: A technical write-up containing:

      A clear recommendation on whether to proceed with a full implementation, which option (BM25 sparse vectors, Tantivy, or neither) is preferred, and if any alternatives should be considered.
      Comparative performance and relevance data across the options.
      A proposed high-level design for the recommended approach if the project is deemed feasible.

      Benefits

      Better results: Deliver more relevant outcomes across varied queries and document sizes.
      Market strength: Position MariaDB as a stronger full-text search option.
      User control: Enable customization through configurable settings in BM25 or Tantivy.
      Performance: Leverage efficient approaches like sparse embeddings or Tantivy's optimized indexing for quick searches with minimal resource use.

      Attachments

        Activity

          People

            Unassigned Unassigned
            adamluciano Adam Luciano
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.