Details
-
Task
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
HNSW doesn't work well when the data set contains many identical (or nearly identical) vectors. It tends to cross-link them together and if their number if larger than 2*M, it will create no outbound links to the rest of the graph.
Lots of duplicate vectors are uncommon, but can easily happen in MariaDB with system versioned tables, for example. And they surely show up in tests.
Almost-duplicates come from duplicate vectors that became slightly different because of floating point calculation errors. That is, essentially they're duplicates.
Currently this is solved by having select_neighbor() heuristic to prefer vectors outside of a specific ε-neighborhood of a node. This is not ideal, as the vectors inside the ε-cluster will have very few links to each other, so enumerating all vectors in the cluster can be slow.
A better solution would be to treat such a cluster as only one vector (centroid or whatever, it doesn't matter). And store not a single tref, but a list of trefs per vector.
This doesn't solve the case when the data set contains many essentially different but very close vectors.
Attachments
Issue Links
- relates to
-
MDEV-32887 vector search
- Stalled