Question about knn and Opensearch shards

Hello OpenSearch Community,

I hope you’re all doing well. I’ve been delving into the mechanics of k-NN search in an OpenSearch index setup that we have, and I’ve come across an interesting challenge that I’d appreciate some insight on.

Our current index configuration uses "number_of_shards": 4, and for our k-NN searches, we have set k=40. If I’ve understood the documentation correctly, this means that for each query, each shard would return its top 40 most relevant documents based on the specified vector similarity.

This brings me to a potential concern: Consider a scenario where the 41st nearest neighbor in shard1 is, in fact, more relevant than, say, the 20th nearest neighbor from shard2. Given our current configuration, the coordinating node would never see the 41st result from shard1. This could potentially lead to us missing out on more relevant results.

What SELECT kNN LIMIT L will return when L > k. This doesn’t throw any errors, and can indeed sometimes return L results.

I would be grateful if someone could shed light on this behavior. Are there recommended strategies to address this in a multi-shard setup with k-NN?

Thank you for your time and expertise!

Best regards

This is a great question. Let me take a stab at resolving your query. Taking your use case of 4 shards and k=40, each shard is returning its top 40(if size =40) most relevant docs after looking at all the segments.

So if for Shard1 41st document is more relevant than shard2 20th document yes that 41st document will not reach to coordinating node. But that 20th result will never be passed to the customer too. So in way in search response is getting the top 40(size=40) most similar documents.

I hope that ans first part of the question.

Recommendation: When doing vector search keep size parameter and K same.

FYI this is same behavior for text search also. There K is replaced by size.

What SELECT kNN LIMIT L will return when L > k. This doesn’t throw any errors, and can indeed sometimes return L results.

Yes it will not throw error. This is also similar case. So the way to think here is because every segment of shard is returning k documents and at shard level we take top documents = size value and send it to coordinator. So if there are multiple shards you see that if L>K you will get L documents if multiple shards and segments are there. If there is 1 shard and 1 segment then you will get only K documents if K > size.

Let me know if you have further questions and if there is more confusion. I remember answering same question on github too let me find and attach it for reference.

Thanks @Navneet for your help

Hello everyone,

Thank you so much for your support and assistance!

I would like to add some more questions about the behaviour of search queries of the form

SELECT kNN(input)

We are also using

"min_score": 0

as a default setting. I am running experiments with and without that setting, and the results are the same.

However, we are having hard times understanding the result size that these queries return when executed against our OpenSearch cluster. We are using OpenSearch version 2.7

Our cluster is configured to have 4 shards, and currently has about 950,000 documents in the index that we are querying. We therefore expect that whatever our query input is, we will always find sufficient number of documents.

However, regardless of the values of k that we use, we are always getting 10 documents in the result. If k is 30 we are getting 10 results. If k is 5 I am also getting 10 results. Why could that be happening?

Appreciate your time and help,

@krum.bakalsky . It’s possible you are not mentioning “size” attribute in the query. By default Opensearch uses size :10. You should use “size” to determine how many neighbors you want at end. “K” will determine how many neighbors to pull at segment level for evaluation.

1 Like

Oh, that must be it! Tons of thanks, @vamshin , this solved it! :pray:

1 Like

@Navneet thanks for your response. I would like to add few points based on @eric.utrera question.

Turns out this can be a problem, definitely in our case because we had a custom ranking logic on the top of rerieved documemts. Since number of documents retrieved are actually K × shards for lucene engine and K × segments for nmslib, missing out on 41st neighbor from shard1 impacted our overall ranking.

To avoid this we tried increasing K to max alllowed which is 10000 but still few neighbors would be left out because we have more than 10000 documents in each segment

For this reason, we are now thinking of switching to exact KNN with script instead.

Do you have any suggestions on what could we do if want to ensure that 41st neighbor to be prioritized over other 20 neighbors from different shard ?
Is there a way to have more control over how documents are distributed in shards or segments ?