Details
-
Task
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
Q1/2026 Server Development
Description
Objective
To determine the feasibility and performance benefits of implementing advanced full-text search in MariaDB, considering options such as the BM25 search algorithm using a sparse vector approach or BM25, or another option.
1. Background & Hypothesis
MariaDB's current TF-IDF-based full-text search can be improved. We hypothesize that integrating either the BM25 algorithm via sparse vectors, inspired by modern search solutions like Milvus, or BM25, or another method will provide enhanced search relevance with acceptable performance overhead. This research will validate that hypothesis by evaluating options.
2. Key Research Questions
This investigation will answer:
Integration: What are the most viable paths to integrate either a BM25 ranking function via sparse vectors into MariaDB's query engine and FTS syntax, BM25, or are there other options?
Performance: How do prototypes for BM25 sparse vectors and Tantivy (BM25 written in Rust) compare against the existing InnoDB FTS in terms of query latency, relevance, and resource usage? Are there other algorithm options?
3. Research Plan & Deliverables
Analyze: Review existing implementations of BM25 sparse embeddings (Milvus, Postgres) and Tantivy, along with MariaDB's FTS architecture, to identify integration strategies for each option. Other options are good to too.
Prototype: Develop minimal Proofs of Concept (PoCs) to model the BM25 algorithm with sparse vector data storage
Benchmark: Compare the PoCs' performance and relevance against MariaDB's native FTS using a MS MARCO Passage Ranking v1 (8.8M passages, 1M queries)
Subsample: 100K passages, 5K dev queries (for iteration speed) dataset.
Key Deliverable: A technical write-up containing:
A clear recommendation on whether to proceed with a full implementation, which option (BM25, BM25 sparse vectors, or other option) is preferred, and if any alternatives should be considered.
Comparative performance and relevance data across the options.
A proposed high-level design for the recommended approach if the project is deemed feasible.
Benefits
Better results: Deliver more relevant outcomes across varied queries and document sizes.
Market strength: Position MariaDB as a stronger full-text search option.
User control: Enable customization through configurable settings.
Performance: Leverage efficient approaches for quick searches with minimal resource use.