Hello ![]()
What is the bug?
We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using all-MiniLM-L6-v2 model. Each chunk being stored in that nested field of the document.
Each document has also an account_id attribute that we use when querying (efficient filtering).
Then we do approximative knn queries with lucene hnsw.
From these documentations :
- k-NN search with nested fields
- k-NN search with filters
- and mosty Enhanced multi-vector support for OpenSearch k-NN search with nested fields
I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.
But for some specific input vector or query_text, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.
We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.
Seems similar to this [BUG] Filter on Parent Doc fields inside Nested knn query fails for many Query types 路 Issue #2222 路 opensearch-project/k-NN 路 GitHub or [BUG] OpenSearch 2.17 K-NN efficient filtering with a Date Range Filter No Results 路 Issue #2339 路 opensearch-project/k-NN 路 GitHub except the efficient filtering is as simple as a term filter.
How can one reproduce the bug?
Error happens on specific queries so it鈥檚 hard to reproduce.
Here is the mapping of the index :
{
"knowledge-index": {
"mappings": {
"properties": {
"accountId": {
"type": "keyword"
},
"id": {
"type": "keyword"
},
"metadata": {
"type": "text"
},
"metadataEmbedding": {
"type": "nested",
"properties": {
"knn": {
"type": "knn_vector",
"dimension": 384,
"method": {
"engine": "lucene",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
}
}
},
"timestamp": {
"type": "date"
}
}
}
}
}
Here is the query :
GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
"from": 0,
"_source": {
"excludes": [
"metadataEmbedding"
]
},
"query": {
"nested": {
"score_mode": "max",
"path": "metadataEmbedding",
"query": {
"neural": {
"metadataEmbedding.knn": {
"query_text": "<query_text>",
"model_id": "9QxR8YsBSCN1wquQEH2b",
"k": <k>,
"filter": {
"term": {
"accountId": "<account_id>"
}
}
}
}
}
}
}
}
For k = 38, I get 6 hits
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": 0.50342417,
But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.
For another query_text value, I have different results where hits is always = k (or the max of documents that match filter which is 232)
I have the same results when converting first the text in vector and use directly the vector without the neural instruction :
POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
"text_docs":[ "<query_text>"],
"return_number": true,
"target_response": ["sentence_embedding"]
}
GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
"size": 5,
"_source": {
"excludes": [
"metadataEmbedding"
]
},
"query": {
"nested": {
"path": "metadataEmbedding",
"query": {
"knn": {
"metadataEmbedding.knn": {
"vector": [
....
],
"k": 38,
"filter": {
"term": {
"accountId": "<account_id>"
}
}
}
}
}
}
}
}
What is the expected behavior?
Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.
What is your host/environment?
- opensearch version : 2.17.1
Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?
Thanks for your help.