[BUG] Insufficient number of hits for nested knn queries with efficient filter #2347

Hello :wave:

What is the bug?

We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using all-MiniLM-L6-v2 model. Each chunk being stored in that nested field of the document.
Each document has also an account_id attribute that we use when querying (efficient filtering).

Then we do approximative knn queries with lucene hnsw.

From these documentations :

I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.

But for some specific input vector or query_text, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.

We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.

Seems similar to this [BUG] Filter on Parent Doc fields inside Nested knn query fails for many Query types 路 Issue #2222 路 opensearch-project/k-NN 路 GitHub or [BUG] OpenSearch 2.17 K-NN efficient filtering with a Date Range Filter No Results 路 Issue #2339 路 opensearch-project/k-NN 路 GitHub except the efficient filtering is as simple as a term filter.

How can one reproduce the bug?

Error happens on specific queries so it鈥檚 hard to reproduce.

Here is the mapping of the index :

{
  "knowledge-index": {
    "mappings": {
      "properties": {
        "accountId": {
          "type": "keyword"
        },
        "id": {
          "type": "keyword"
        },
        "metadata": {
          "type": "text"
        },
        "metadataEmbedding": {
          "type": "nested",
          "properties": {
            "knn": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            }
          }
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Here is the query :

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "from": 0,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "metadataEmbedding",
      "query": {
        "neural": {
          "metadataEmbedding.knn": {
            "query_text": "<query_text>",
            "model_id": "9QxR8YsBSCN1wquQEH2b",
            "k": <k>,
            "filter": {
                "term": {
                  "accountId":  "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

For k = 38, I get 6 hits

  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.50342417,

But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.

For another query_text value, I have different results where hits is always = k (or the max of documents that match filter which is 232)

I have the same results when converting first the text in vector and use directly the vector without the neural instruction :

POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
  "text_docs":[ "<query_text>"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "size": 5,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "path": "metadataEmbedding",
      "query": {
        "knn": {
          "metadataEmbedding.knn": {
            "vector": [
                 ....
             ],
            "k": 38,
            "filter": {
              "term": {
                "accountId": "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

What is the expected behavior?

Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.

What is your host/environment?

  • opensearch version : 2.17.1

Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?

Thanks for your help.

@corentin I can see in some of your queries the k and size value is different. Did you do a test where your K and size have exactly same value?

Thanks @Navneet ! I just did and get the same issue. Usually, in my understanding the number of hits should be k (whatever size is) and the number of returned documents should be size. But I have the same issue when k = size.

Thanks for your help :pray: