Based on the FAISS documentation,
HNSW does not require training and does not support removing vectors from the index.. So what happens when I remove a document from the index or remove its k-NN vector field? Does the item get removed from the HNSW graph or is it just marked as deleted and remains in the graph, taking space and compute during search? If it stays in the index, does adding and removing items in the long term start populating the graph with deleted documents and the performance keeps dropping even though the number of undeleted documents in the index is not growing as much?
And how is this whole situation for Lucene and nmslib?
In hnswlib v0.7, there was this new feature:
Added support for replacing the elements that were marked as delete with newly inserted elements (to control the size of the index. And there is the
allow_replace_deleted argument when creating an index that allows this (source). Is this supported by OpenSearch when using nmslib?
I couldn’t find much about Lucene but I found this article which says:
Lucene ANN transparently handles deleted documents by skipping over 'tombstones' during the graph search. I assume this means that the deleted documents are still in the graph (similar to FAISS).
What are the best settings for k-NN if I expect lots of deletion and insertion operations? What is the best practice if I want to still have the document in the index but not in the k-NN graph (so k-NN search wouldn’t return it)? Should I update the document and remove the vector field or set it to null? Does that mark it as deleted in the k-NN graph?
I think these questions also apply to updating a document’s vector. Does the previous vector stay in the graph and is there a possibility for the document being returned based on the old vector?