Removing items from k-NN index

Based on the FAISS documentation, HNSW does not require training and does not support removing vectors from the index.. So what happens when I remove a document from the index or remove its k-NN vector field? Does the item get removed from the HNSW graph or is it just marked as deleted and remains in the graph, taking space and compute during search? If it stays in the index, does adding and removing items in the long term start populating the graph with deleted documents and the performance keeps dropping even though the number of undeleted documents in the index is not growing as much?

And how is this whole situation for Lucene and nmslib?

In hnswlib v0.7, there was this new feature: Added support for replacing the elements that were marked as delete with newly inserted elements (to control the size of the index. And there is the allow_replace_deleted argument when creating an index that allows this (source). Is this supported by OpenSearch when using nmslib?

I couldn’t find much about Lucene but I found this article which says: Lucene ANN transparently handles deleted documents by skipping over 'tombstones' during the graph search. I assume this means that the deleted documents are still in the graph (similar to FAISS).

What are the best settings for k-NN if I expect lots of deletion and insertion operations? What is the best practice if I want to still have the document in the index but not in the k-NN graph (so k-NN search wouldn’t return it)? Should I update the document and remove the vector field or set it to null? Does that mark it as deleted in the k-NN graph?

I think these questions also apply to updating a document’s vector. Does the previous vector stay in the graph and is there a possibility for the document being returned based on the old vector?

@Navneet Do you happen to know about this topic? :sweat_smile:

Hi @Alireza
Please find my response below.

So what happens when I remove a document from the index or remove its k-NN vector field? Does the item get removed from the HNSW graph or is it just marked as deleted and remains in the graph, taking space and compute during search?

When hit the DELETE API for a document the document is marked as deleted. It is not removed from the index. The document will take space on the disk. This is true for both vector and non vector documents in Opensearch.

If it stays in the index, does adding and removing items in the long term start populating the graph with deleted documents and the performance keeps dropping even though the number of undeleted documents in the index is not growing as much?

There are background merge processes that are triggered which ensure the cleaning of these deleted documents. You can also hit force_merge api with expungedelete=true flag to remove the deleted the documents.

And how is this whole situation for Lucene and nmslib?

This is same for Lucene and Faiss Engine too. The above thing I mentioned is true for any document(that may or may not contain vector field) you ingest in an Opensearch Index.

I assume this means that the deleted documents are still in the graph (similar to FAISS).

Yes this is true.

What are the best settings for k-NN if I expect lots of deletion and insertion operations?

Generally what I have seen is customers hitting force merge apis with only_expunge_deletes flag as true to remove their deleted documents if you cannot wait for background merges to trigger and cleanup your index.

Should I update the document and remove the vector field or set it to null? Does that mark it as deleted in the k-NN graph?

You can do any of this, any update in a document of Opensearch first marks the old document as deleted and then insert a new document. This comes from the fact that Lucene segments(where documents are stored) are immutable.

Does the previous vector stay in the graph and is there a possibility for the document being returned based on the old vector?

Yes the old document will stay in the graph but that deleted document will never be returned back.

1 Like

Thanks for the response @Navneet .

There are background merge processes that are triggered which ensure the cleaning of these deleted documents. You can also hit force_merge api with expungedelete=true flag to remove the deleted the documents.

So the vectors are reindexed into a new HNSW graph? I mean we cannot delete nodes from the graph (at least with FAISS) and unless we make a new graph, the deleted nodes will always be in the graph, taking space and computation.