[MDEV-35418] Better support for multiple duplicates vectors in the index - Jira

XML

Word

Printable

Details

Type: Task
Status: Open (View Workflow)
Priority: Minor
Resolution: Unresolved
Fix Version/s: None
Component/s: Vector search
Labels:
None

Description

HNSW doesn't work well when the data set contains many identical (or nearly identical) vectors. It tends to cross-link them together and if their number if larger than 2*M, it will create no outbound links to the rest of the graph.

Lots of duplicate vectors are uncommon, but can easily happen in MariaDB with system versioned tables, for example. And they surely show up in tests.

Almost-duplicates come from duplicate vectors that became slightly different because of floating point calculation errors. That is, essentially they're duplicates.

Currently this is solved by having select_neighbor() heuristic to prefer vectors outside of a specific ε-neighborhood of a node. This is not ideal, as the vectors inside the ε-cluster will have very few links to each other, so enumerating all vectors in the cluster can be slow.

A better solution would be to treat such a cluster as only one vector (centroid or whatever, it doesn't matter). And store not a single tref, but a list of trefs per vector.

This doesn't solve the case when the data set contains many essentially different but very close vectors.

Attachments

Issue Links

relates to

MDEV-32887 Vector Search

Closed

Activity

People

Assignee:: Unassigned

Reporter:: Sergei Golubchik

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2024-11-14 10:29

Updated:: 2025-01-27 16:40

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.