Hello
What is the bug?
We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using all-MiniLM-L6-v2
model. Each chunk being stored in that nested field of the document.
Each document has also an account_id attribute that we use when querying (efficient filtering).
Then we do approximative knn queries with lucene hnsw.
From these documentations :
- k-NN search with nested fields
- k-NN search with filters
- and mosty Enhanced multi-vector support for OpenSearch k-NN search with nested fields
I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.
But for some specific input vector
or query_text
, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.
We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.
Seems similar to this [BUG] Filter on Parent Doc fields inside Nested knn query fails for many Query types 路 Issue #2222 路 opensearch-project/k-NN 路 GitHub or [BUG] OpenSearch 2.17 K-NN efficient filtering with a Date Range Filter No Results 路 Issue #2339 路 opensearch-project/k-NN 路 GitHub except the efficient filtering is as simple as a term filter.
How can one reproduce the bug?
Error happens on specific queries so it鈥檚 hard to reproduce.
Here is the mapping of the index :
{
"knowledge-index": {
"mappings": {
"properties": {
"accountId": {
"type": "keyword"
},
"id": {
"type": "keyword"
},
"metadata": {
"type": "text"
},
"metadataEmbedding": {
"type": "nested",
"properties": {
"knn": {
"type": "knn_vector",
"dimension": 384,
"method": {
"engine": "lucene",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
}
}
},
"timestamp": {
"type": "date"
}
}
}
}
}
Here is the query :
GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
"from": 0,
"_source": {
"excludes": [
"metadataEmbedding"
]
},
"query": {
"nested": {
"score_mode": "max",
"path": "metadataEmbedding",
"query": {
"neural": {
"metadataEmbedding.knn": {
"query_text": "<query_text>",
"model_id": "9QxR8YsBSCN1wquQEH2b",
"k": <k>,
"filter": {
"term": {
"accountId": "<account_id>"
}
}
}
}
}
}
}
}
For k = 38, I get 6 hits
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": 0.50342417,
But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.
For another query_text
value, I have different results where hits is always = k (or the max of documents that match filter which is 232)
I have the same results when converting first the text in vector and use directly the vector without the neural instruction :
POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
"text_docs":[ "<query_text>"],
"return_number": true,
"target_response": ["sentence_embedding"]
}
GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
"size": 5,
"_source": {
"excludes": [
"metadataEmbedding"
]
},
"query": {
"nested": {
"path": "metadataEmbedding",
"query": {
"knn": {
"metadataEmbedding.knn": {
"vector": [
....
],
"k": 38,
"filter": {
"term": {
"accountId": "<account_id>"
}
}
}
}
}
}
}
}
What is the expected behavior?
Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.
What is your host/environment?
- opensearch version : 2.17.1
Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?
Thanks for your help.