Need help in understanding ANN implementation vs exact KNN

cgchinmay · September 8, 2023, 9:08pm

We have around 3.7M documents and we want to implement KNN search

We tried with ANN using nmslib engine and it works great performance wise . But its behavior is non-deterministic. We are using following boolean query as we have 2 embeddings per document. Here is the query

"query": {
    "bool": {
      "should": [
        {
          "knn": {
              "description_vector": {
                "vector": <vector>,
                "k": 10000
              }
          }
        },
       {
          "knn": {
              "tag_vector": {
                "vector": <vector>,
                "k": 10000
              }
          }
        }
      ]
    }
  }

How does this query work ? Based on our experiments, this is what we noticed

Fetch K documents just based on description_vector similarity score from each segment
Fetch K documents just based on tag_vector similarity score from each segment
Sum scores for each document to generate the final score

Is above logic correct ? if not then how does it work ?

If the above logic is true then sometimes a document X which was included in step 1 might not be included in step 2. The document X gets missed out because it is on the lower end of similarity score. We did see such behavior in our experiments. To avoid this problem, we set maximum allowed value of K for each vector but we have more than 10K documents per segment and this is not enough to solve our issue. So how do we ensure that we always check all documents ?

We plan on using exact KNN but given the size of our data (3.7M documents with 2 embeddings per document which 384 dimension each ), not sure if this would scale. Does exact KNN also uses native memory to load all vectors ?

Topic		Replies	Views
Approximate kNN total hits inconsistent k-NN	1	488	January 4, 2023
Exact KNN queries not cached k-NN	2	432	November 14, 2023
Exact KNN / Approx KNN k-NN	6	847	October 28, 2023
kNN (nmslib) returns a fewer results than expected k-NN troubleshoot	2	214	August 26, 2024
Approx neighbor query is very slow k-NN	6	2204	November 30, 2021

Need help in understanding ANN implementation vs exact KNN

Related topics