[MDEV-38081] Research Additional Vector Indexing Algorithms for Higher Scalability in MariaDB - Jira

XML

Word

Printable

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: ROADMAP
Component/s: Vector search
Labels:
None

Sprint:
Q2/2026 Server Development

Description

Background on MariaDB's Current Vector Capabilities
MariaDB introduced vector support in version 11.7, stabilizing it in the 11.8 LTS release, using a modified HNSW algorithm that supports concurrent reads/writes and transaction isolation. This enables approximate nearest neighbor searches for AI embeddings, but future roadmaps hint at enhancements without specifics on new indexes. Adding alternatives could address HNSW's limitations in memory usage and scale.

Value of Extending Beyond HNSW
HNSW excels in low-latency, high-recall searches for mid-sized datasets but requires substantial RAM, making it less ideal for massive corpuses. DiskANN and SCANN offer paths to handle larger scales cost-effectively, potentially reducing infrastructure needs and supporting diverse AI use cases like recommendation systems.

Goal
Evaluate alternatives like DiskANN, SCANN, IVF variants, and others beyond the current HNSW implementation, so that MariaDB can support billion-scale vector operations with improved efficiency, lower costs, and better performance for large AI workloads.

Description
MariaDB's vector feature currently relies on a modified HNSW for ANN searches, which is effective for concurrent operations but memory-bound, limiting scalability for datasets exceeding RAM capacity. This research will explore DiskANN for disk-optimized large-scale indexing, SCANN for quantized in-memory efficiency, and others like IVF to identify additions that enhance scale without sacrificing accuracy. Value includes reduced RAM usage (e.g., DiskANN's 40x savings), faster builds/queries in some cases, and support for dynamic updates, positioning MariaDB competitively against pgvector and Milvus.

Acceptance Criteria
Compile pros/cons, key differences, and use cases for each algorithm.
Identify benchmarks (e.g., using ANN-Benchmarks with SIFT1B dataset) measuring QPS, recall, memory, build time, and update efficiency.
Perform or simulate tests on sample hardware to compare against HNSW.
Deliver a jira to propose a recommendation on what to do next, including integration feasibility and time estimate to complete.

Notes on other indexing approaches from research - please validate with own testing:

Algorithm	Key Mechanism	Memory Usage	Build Time	Query Time	Recall@10	Best Use Case	Limitations
MHNSW	Hierarchical graph navigation	Moderate (~50% data)	O(N log N), slow	O(log N), fast	95-99%	Mid-scale, dynamic updates	Memory-bound, inefficient deletions
DiskANN	Vamana graph with SSD optimization	Low (disk-reliant, e.g., 96GB for 1B vectors)	Slow, compute-heavy	Moderate (15ms for 1B)	90-95%+	Billion-scale, cost-sensitive	Immutable base, slower builds
SCANN	Partitioning + anisotropic quantization + re-ranking	Low (10-30% data)	O(N log N)	O(√N), fast	95-98%	High-throughput in-memory	Batch updates, complex tuning
IVF-PQ	Clustering + product quantization	Moderate-low with compression	Fast	High for clustered data	80-95%	Balanced large top-K	Lower recall without refinements

Attachments

Issue Links

links to

Planet Scale implementation explained for MySQL

Activity

People

Assignee:: Unassigned

Reporter:: Adam Luciano

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2025-11-12 18:58

Updated:: 2025-12-10 09:50

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.