Details
-
Task
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
-
None
-
Q1/2026 Server Development
Description
Background on MariaDB's Current Vector Capabilities
MariaDB introduced vector support in version 11.7, stabilizing it in the 11.8 LTS release, using a modified HNSW algorithm that supports concurrent reads/writes and transaction isolation. This enables approximate nearest neighbor searches for AI embeddings, but future roadmaps hint at enhancements without specifics on new indexes. Adding alternatives could address HNSW's limitations in memory usage and scale.
Value of Extending Beyond HNSW
HNSW excels in low-latency, high-recall searches for mid-sized datasets but requires substantial RAM, making it less ideal for massive corpuses. DiskANN and SCANN offer paths to handle larger scales cost-effectively, potentially reducing infrastructure needs and supporting diverse AI use cases like recommendation systems.
Goal
Evaluate alternatives like DiskANN, SCANN, IVF variants, and others beyond the current HNSW implementation, so that MariaDB can support billion-scale vector operations with improved efficiency, lower costs, and better performance for large AI workloads.
Description
MariaDB's vector feature currently relies on a modified HNSW for ANN searches, which is effective for concurrent operations but memory-bound, limiting scalability for datasets exceeding RAM capacity. This research will explore DiskANN for disk-optimized large-scale indexing, SCANN for quantized in-memory efficiency, and others like IVF to identify additions that enhance scale without sacrificing accuracy. Value includes reduced RAM usage (e.g., DiskANN's 40x savings), faster builds/queries in some cases, and support for dynamic updates, positioning MariaDB competitively against pgvector and Milvus.
Acceptance Criteria
Compile pros/cons, key differences, and use cases for each algorithm.
Identify benchmarks (e.g., using ANN-Benchmarks with SIFT1B dataset) measuring QPS, recall, memory, build time, and update efficiency.
Perform or simulate tests on sample hardware to compare against HNSW.
Deliver a jira to propose a recommendation on what to do next, including integration feasibility and time estimate to complete.
Notes on other indexing approaches from research - please validate with own testing:
| Algorithm | Key Mechanism | Memory Usage | Build Time | Query Time | Recall@10 | Best Use Case | Limitations |
|---|---|---|---|---|---|---|---|
| MHNSW | Hierarchical graph navigation | Moderate (~50% data) | O(N log N), slow | O(log N), fast | 95-99% | Mid-scale, dynamic updates | Memory-bound, inefficient deletions |
| DiskANN | Vamana graph with SSD optimization | Low (disk-reliant, e.g., 96GB for 1B vectors) | Slow, compute-heavy | Moderate (15ms for 1B) | 90-95%+ | Billion-scale, cost-sensitive | Immutable base, slower builds |
| SCANN | Partitioning + anisotropic quantization + re-ranking | Low (10-30% data) | O(N log N) | O(√N), fast | 95-98% | High-throughput in-memory | Batch updates, complex tuning |
| IVF-PQ | Clustering + product quantization | Moderate-low with compression | Fast | High for clustered data | 80-95% | Balanced large top-K | Lower recall without refinements |