[BUG] Insufficient number of hits for nested knn queries with efficient filter #2347

corentin · December 20, 2024, 3:24pm

Hello

What is the bug?

We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using all-MiniLM-L6-v2 model. Each chunk being stored in that nested field of the document.
Each document has also an account_id attribute that we use when querying (efficient filtering).

Then we do approximative knn queries with lucene hnsw.

From these documentations :

I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.

But for some specific input vector or query_text, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.

We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.

Seems similar to this [BUG] Filter on Parent Doc fields inside Nested knn query fails for many Query types · Issue #2222 · opensearch-project/k-NN · GitHub or [BUG] OpenSearch 2.17 K-NN efficient filtering with a Date Range Filter No Results · Issue #2339 · opensearch-project/k-NN · GitHub except the efficient filtering is as simple as a term filter.

How can one reproduce the bug?

Error happens on specific queries so it’s hard to reproduce.

Here is the mapping of the index :

{
  "knowledge-index": {
    "mappings": {
      "properties": {
        "accountId": {
          "type": "keyword"
        },
        "id": {
          "type": "keyword"
        },
        "metadata": {
          "type": "text"
        },
        "metadataEmbedding": {
          "type": "nested",
          "properties": {
            "knn": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            }
          }
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Here is the query :

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "from": 0,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "metadataEmbedding",
      "query": {
        "neural": {
          "metadataEmbedding.knn": {
            "query_text": "<query_text>",
            "model_id": "9QxR8YsBSCN1wquQEH2b",
            "k": <k>,
            "filter": {
                "term": {
                  "accountId":  "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

For k = 38, I get 6 hits

  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.50342417,

But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.

For another query_text value, I have different results where hits is always = k (or the max of documents that match filter which is 232)

I have the same results when converting first the text in vector and use directly the vector without the neural instruction :

POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
  "text_docs":[ "<query_text>"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "size": 5,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "path": "metadataEmbedding",
      "query": {
        "knn": {
          "metadataEmbedding.knn": {
            "vector": [
                 ....
             ],
            "k": 38,
            "filter": {
              "term": {
                "accountId": "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

What is the expected behavior?

Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.

What is your host/environment?

opensearch version : 2.17.1

Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?

Thanks for your help.

Navneet · December 20, 2024, 4:53pm

@corentin I can see in some of your queries the k and size value is different. Did you do a test where your K and size have exactly same value?

corentin · December 20, 2024, 5:17pm

Thanks @Navneet ! I just did and get the same issue. Usually, in my understanding the number of hits should be k (whatever size is) and the number of returned documents should be size. But I have the same issue when k = size.

Thanks for your help

system · February 18, 2025, 5:17pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Lucene HNSW nested knn with efficient filtering does not work for non-nested fields k-NN	1	403	December 19, 2023
Lucene HNSW filter with nested knn vectors not working k-NN troubleshoot	5	1048	May 29, 2023
Hybrid search on nested fields OpenSearch troubleshoot , configure , feature-request	1	68	April 28, 2025
Sporadic empty inner hits on nested kNN search k-NN troubleshoot	2	884	August 17, 2022
kNN (nmslib) returns a fewer results than expected k-NN troubleshoot	2	216	August 26, 2024

[BUG] Insufficient number of hits for nested knn queries with efficient filter #2347

Related topics