Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
2.19.1
Describe the issue:
I followed the available documentation and created a knn index in which every record has a semantic field (of type text) that is properly processed during ingestion so that its content is chunked and used to generate vector embeddings. The result is that each record has a multi valued field named “nested_chunks_embeddings” that contains nested elements structured this way:
{
"text" : "textual content of the chunk",
"embedding" : "embedding generated from the textual content of the chunk"
}
At this point I want to do a hybrid search on the index in order to find the CHUNKS that best respond to the query text given as input. I can’t have chunks as separate records. What I want to obtain is a list of records that satisfy the query and for each record a list of the chunks that actually matched.
At this point I used a nested query on the mentioned nested field and did a hybrid query:
- semantic on “nested_chunks_embeddings.embedding”
- lexical on “nested_chunks_embeddings.text”
By adding the inner_hits option I get the chunks that matched but it seems that the normalization processor does not work. The score for each chunk seems to be calculated by summing up the semantic and the lexical scores without properly normalizing the lexical one and without giving each of the scores the weight indicated in the normalization processor. This is obviously problematic.
What i basically need is to find the chunks that best satisfy the search criteria both semantically and lexically. Is there any other way to obtain this?
Configuration:
example of index:
PUT testindex
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"semantic_field_to_use_for_chunking": {
"type": "text"
},
"nested_chunks_embeddings": {
"type": "nested",
"properties": {
"text": {
"type": "text"
},
"embedding": {
"dimension": 768,
"type": "knn_vector"
}
}
}
}
}
}
Query that I used:
GET /testindex/_search?search_pipeline=hybrid_search
{
"query": {
"nested": {
"score_mode": "max",
"path": "nested_chunks_embeddings",
"inner_hits":{"from":0},
"query": {
"hybrid": {
"queries": [
{
"neural": {
"nested_chunks_embeddings.embedding": {
"query_text": "pipeline configuration in opensearch",
"model_id": "PZCY0pUB9e1VVreM-Wei",
"expand_nested_docs": true,
"filter": {
"match":{
"nested_chunks_embeddings.field":"passage_text"
}
}
}
}
},
{
"query_string": {
"query": "pipeline AND configuration AND opensearch",
"fields": [
"nested_chunks_embeddings.chunk"
]
}
}
]
}
}
}
}
}
The search pipeline is standard:
PUT _search/pipeline/hybrid_search
{
"description": "processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.5,
0.5
]
}
}
}
}
]
}
Another approach I tries is using a hybrid query as a wrapper and then making both the lexical and the semantic queries nested. This way scores seem fine but I lose the inner_hits option, which is essential. The mentioned query:
GET /testindex/_search?search_pipeline=hybrid_search
{
"query": {
"hybrid":{
"queries":[
{
"nested":{
"score_mode": "max",
"path": "nested_chunks_embeddings",
"inner_hits":{},
"query":{
"neural": {
"nested_chunks_embeddings.embedding": {
"query_text": "pipeline configuration in opensearch",
"model_id": "PZCY0pUB9e1VVreM-Wei",
"expand_nested_docs": true,
"filter": {
"match":{
"nested_chunks_embeddings.field":"passage_text"
}
}
}
}
}
}
},{
"nested":{
"score_mode": "max",
"path": "nested_chunks_embeddings",
"inner_hits":{},
"query":{
"query_string": {
"query": "pipeline AND configuration AND opensearch",
"fields": [
"nested_chunks_embeddings.chunk"
]
}
}
}
}
]
}
}
}
Relevant Logs or Screenshots: