Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch 2.18
Describe the issue:
I am using a text chunking preprocessor to store not only the text but also chunks of the text. I’m also using hybrid search so i want to have neural search and bm25 to retrieve relevant documents. This is all working but im only getting the score for the whole document. I would like to have score per chunk also so that i can extract the most relevant chunk from the document. Is there any tutorial i can use to check how to do this? It might be just a small configuration effort but im still new to the OpenSearch so im not sure how to do this. You can find the whole configuration below.
Configuration:
I based my configuration mostly on the opensearch documentation.
For the ingest pipeline i followed the text chunking example:
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 10,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "LMLPWY4BROvhdbtgETaI",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
For the ingest index i followed the same tutorial for text chunking:
{
"settings": {
"index": {
"knn": true,
"default_pipeline": "nlp-ingest-pipeline"
}
},
"mappings": {
"properties": {
"text": {
"type": "text"
},
"passage_chunk_embedding": {
"type": "nested",
"properties": {
"knn": {
"type": "knn_vector",
"dimension": 768
}
}
}
}
}
}
For the search pipeline i followed the hybrid search tutorial:
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.2,
0.8
]
}
}
}
}
],
"request_processors": [
{
"neural_query_enricher": {
"default_model_id": "LMLPWY4BROvhdbtgETaI"
}
}
]
}
In the end i use hybrid query to perform the search:
{
"_source": {
"exclude": [
"passage_embedding"
]
},
"query": {
"hybrid": {
"queries": [
{
"match": {
"passage_text": {
"query": "Hi world"
}
}
},
{
"nested": {
"score_mode": "max",
"path": "passage_chunk_embedding",
"query": {
"neural": {
"passage_chunk_embedding.knn": {
"query_text": "document",
"model_id": "-tHZeI4BdQKclr136Wl7"
}
}
}
}
}
]
}
}
}