Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-38081

Research Additional Vector Indexing Algorithms for Higher Scalability in MariaDB

    XMLWordPrintable

Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • ROADMAP
    • None
    • None
    • Q1/2026 Server Development

    Description

      Background on MariaDB's Current Vector Capabilities
      MariaDB introduced vector support in version 11.7, stabilizing it in the 11.8 LTS release, using a modified HNSW algorithm that supports concurrent reads/writes and transaction isolation. This enables approximate nearest neighbor searches for AI embeddings, but future roadmaps hint at enhancements without specifics on new indexes. Adding alternatives could address HNSW's limitations in memory usage and scale.

      Value of Extending Beyond HNSW
      HNSW excels in low-latency, high-recall searches for mid-sized datasets but requires substantial RAM, making it less ideal for massive corpuses. DiskANN and SCANN offer paths to handle larger scales cost-effectively, potentially reducing infrastructure needs and supporting diverse AI use cases like recommendation systems.

      Goal
      Evaluate alternatives like DiskANN, SCANN, IVF variants, and others beyond the current HNSW implementation, so that MariaDB can support billion-scale vector operations with improved efficiency, lower costs, and better performance for large AI workloads.

      Description
      MariaDB's vector feature currently relies on a modified HNSW for ANN searches, which is effective for concurrent operations but memory-bound, limiting scalability for datasets exceeding RAM capacity. This research will explore DiskANN for disk-optimized large-scale indexing, SCANN for quantized in-memory efficiency, and others like IVF to identify additions that enhance scale without sacrificing accuracy. Value includes reduced RAM usage (e.g., DiskANN's 40x savings), faster builds/queries in some cases, and support for dynamic updates, positioning MariaDB competitively against pgvector and Milvus.

      Acceptance Criteria
      Compile pros/cons, key differences, and use cases for each algorithm.
      Identify benchmarks (e.g., using ANN-Benchmarks with SIFT1B dataset) measuring QPS, recall, memory, build time, and update efficiency.
      Perform or simulate tests on sample hardware to compare against HNSW.
      Deliver a jira to propose a recommendation on what to do next, including integration feasibility and time estimate to complete.

      Notes on other indexing approaches from research - please validate with own testing:

      Algorithm Key Mechanism Memory Usage Build Time Query Time Recall@10 Best Use Case Limitations
      MHNSW Hierarchical graph navigation Moderate (~50% data) O(N log N), slow O(log N), fast 95-99% Mid-scale, dynamic updates Memory-bound, inefficient deletions
      DiskANN Vamana graph with SSD optimization Low (disk-reliant, e.g., 96GB for 1B vectors) Slow, compute-heavy Moderate (15ms for 1B) 90-95%+ Billion-scale, cost-sensitive Immutable base, slower builds
      SCANN Partitioning + anisotropic quantization + re-ranking Low (10-30% data) O(N log N) O(√N), fast 95-98% High-throughput in-memory Batch updates, complex tuning
      IVF-PQ Clustering + product quantization Moderate-low with compression Fast High for clustered data 80-95% Balanced large top-K Lower recall without refinements

      Attachments

        Activity

          People

            Unassigned Unassigned
            adamluciano Adam Luciano
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.