Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35418

Better support for multiple duplicates vectors in the index

    XMLWordPrintable

Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • Vector search
    • None

    Description

      HNSW doesn't work well when the data set contains many identical (or nearly identical) vectors. It tends to cross-link them together and if their number if larger than 2*M, it will create no outbound links to the rest of the graph.

      Lots of duplicate vectors are uncommon, but can easily happen in MariaDB with system versioned tables, for example. And they surely show up in tests.

      Almost-duplicates come from duplicate vectors that became slightly different because of floating point calculation errors. That is, essentially they're duplicates.

      Currently this is solved by having select_neighbor() heuristic to prefer vectors outside of a specific ε-neighborhood of a node. This is not ideal, as the vectors inside the ε-cluster will have very few links to each other, so enumerating all vectors in the cluster can be slow.

      A better solution would be to treat such a cluster as only one vector (centroid or whatever, it doesn't matter). And store not a single tref, but a list of trefs per vector.

      This doesn't solve the case when the data set contains many essentially different but very close vectors.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              serg Sergei Golubchik
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.