Hi cvicentiu, serg, I'm doing some research on the DELETE algorithm for the HNSW index, I have summarized my current findings as following.
I'll try to create a PoC for option 3, but before dive deep into the implementation, I would like to seek early feedback on the feasibility of the options and any potential concerns regarding the preferred solution options.
In addition, should we create a separate Jira for this the DELETE/UPDATE task?
HNSW UPDATE/DELETE
Many HNSW implementations do not support update or delete vector. When users need to update or delete the vectors, they need to recreate the whole index. It introduces overly high costs.
The original HNSW paper does not provide any information or guidance regarding the update or deletion.
pgvector supports UPDATE/DELETE as summarized at the end of this comment.
High-level Options:
- Option 1: Mark graph nodes as source-invalid instead of deleting or rebuilding the graph index upon DELETE/UPDATE operations. These invalid nodes can still be used during search or insertion, but they will not be included in the results or added to the neighbor lists of new nodes.
- pros: easier maintenance.
- cons: index size continues to grow, and query speed and recall may degrade if too many updates/deletes occur.
- Option 2: Upon DELETE/UPDATE operations, traverse all nodes in the graph and delete/recreate the related connections.
- pros: minimal impact on recall, and the index is always up-to-date.
- cons: extremely slow process, as it requires traversing all nodes in the graph and rebuilding them for complete cleanup.
- Option 3: Combine Options 1 and 2. [ preferred ]\
Mark records as source-invalid and use them for search but exclude them from results. Update the index following Option 2 only when the ANALYSIS table is called (investigate the possibility of triggering this cleanup during ANALYSIS).
- pros: minimizes performance impact, and users can update the index only when needed (during ANALYSIS).
- Option 4: Simply do not support index maintenance during DELETE/UPDATE operations. Require a complete index rebuild after UPDATE/DELETE.
- Option 5: Mark graph nodes as source-invalid and simply skip all those nodes as if they do not exist during search or insertion.
- this workaround will undoubtedly impact the recall.
Implementation if above option 3 selected:
MariaDB does not have an existing way to identify if one element in the index points to a valid or invalid record. MariaDB does immediately updates the index if the record is updated or deleted.
- To save the invalid state:\
Option 2 is preferred as it aligns with the high-level index mechanism by using the same secondary table without introducing too much complexity.
- Option 1: Save another list of invalid references. This needs maintenance of an additional list and an extra logic when searching.
- Option 2: Add a column on the secondary table to mark the index records' state (invalid/valid).
- On DELETE
- load the secondary table
- for each layer from layer_max to 0, mark the corresponding nodes to the source_invalid state.
- On UPDATE
- changes of the graph in the secondary table are similar to INSERT + DELETE
- On SEARCH/INSERT
- use all (invalid and valid) nodes for search.
- do not add invalid nodes to the results list
- do not count the invalid node in ef_search or ef_construct
- do not add invalid nodes to the neighbor lists of new nodes
- One possible trigger to start cleaning up all invalid nodes from the graph could be the ANALYSIS table.
- Haven't checked much, needs more investigation.
On DELETE in pgvector
Term Explanations:
heap: In PostgreSQL, the term "heap" refers to the main storage structure used by PostgreSQL to store the actual data of a table.\
TID: In PostgreSQL, TID stands for "Tuple ID". It is a low-level identifier that uniquely identifies a specific row (tuple) within a table's heap (storage).
In pgvector HNSW index, it saves not only the TID of the data but also a copy of the vector value.
When the DELETE query is executed:
- The heap TID is marked as invalid so later index search can check if the corresponding row is still valid or not.
- But at that moment, nothing has changed in the index. I did not find an index interface called during the DELETE query either.
When a row is deleted but before vacuum happens:
- If a search or insert happens, it still use those to-be-deleted elements in the index for search, but not add them in return results.
- It uses the copy of the vector data in the index element to compare the distance but does not read from the heap data.
- It uses the heaptidsLength to identify if it's a valid record to be added in the final results or ef counts. (I haven't figure out what triggers the heaptidsLength to be 0 when the DELETE query happens)
When vacuum executes:
- It traverses through all index pages, gets a list of to-be-deleted nodes
- Then it traverses through all index pages again, and for each node, it checks if its neighbor list contains any to-be-deleted nodes.
- if no, skip
- if yes, empty the neighbor list and rebuild similar to a new insert
Insert and search now are in a functioning state, although some refactoring is needed.