How to get score per chunk so that i can retrieve the most relevant chunk from the document?

miroshuSan · January 8, 2025, 6:29pm

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
Opensearch 2.18

Describe the issue:
I am using a text chunking preprocessor to store not only the text but also chunks of the text. I’m also using hybrid search so i want to have neural search and bm25 to retrieve relevant documents. This is all working but im only getting the score for the whole document. I would like to have score per chunk also so that i can extract the most relevant chunk from the document. Is there any tutorial i can use to check how to do this? It might be just a small configuration effort but im still new to the OpenSearch so im not sure how to do this. You can find the whole configuration below.

Configuration:
I based my configuration mostly on the opensearch documentation.
For the ingest pipeline i followed the text chunking example:

{
  "description": "A text chunking and embedding ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "LMLPWY4BROvhdbtgETaI",
        "field_map": {
          "passage_chunk": "passage_chunk_embedding"
        }
      }
    }
  ]
}

For the ingest index i followed the same tutorial for text chunking:

{
  "settings": {
    "index": {
      "knn": true,
       "default_pipeline": "nlp-ingest-pipeline"
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "passage_chunk_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 768
          }
        }
      }
    }
  }
}

For the search pipeline i followed the hybrid search tutorial:

{
    "description": "Post processor for hybrid search",
    "phase_results_processors": [
        {
            "normalization-processor": {
                "normalization": {
                    "technique": "min_max"
                },
                "combination": {
                    "technique": "arithmetic_mean",
                    "parameters": {
                        "weights": [
                            0.2,
                            0.8
                        ]
                    }
                }
            }
        }
    ],
    "request_processors": [
        {
            "neural_query_enricher": {
                "default_model_id": "LMLPWY4BROvhdbtgETaI"
            }
        }
    ]
}

In the end i use hybrid query to perform the search:

{
  "_source": {
    "exclude": [
      "passage_embedding"
    ]
  },
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "passage_text": {
              "query": "Hi world"
            }
          }
        },
        {
          "nested": {
            "score_mode": "max",
            "path": "passage_chunk_embedding",
            "query": {
              "neural": {
                "passage_chunk_embedding.knn": {
                  "query_text": "document",
                  "model_id": "-tHZeI4BdQKclr136Wl7"
                }
              }
            }
          }
        }
      ]
    }
  }
}

yuye-aws · January 16, 2025, 1:54am

Thanks for creating this question. Perhaps you can try inner_hits (Inner hits - OpenSearch Documentation) in OpenSearch. This is a valid use case and I’m considering an RFC on neural-search Github repo (GitHub - opensearch-project/neural-search: Plugin that adds dense neural retrieval into the OpenSearch ecosytem). Before that, I just want to confirm you requirement:

Obtain relevance score for each chunk.
The search query can directly return the matched chunk from your query.

yuye-aws · January 16, 2025, 1:55am

If the inner_hits solution does not work, please try this workaround: flatten the chunked documents and ingest the data into another index. You can use other tools like OpenSearch python client and LogStash.

Topic		Replies	Views
How can I chunk PDF with ingest attachment and text chunking processor OpenSearch discuss	7	190	January 6, 2025
Opensearch 2.13 Text chunking test error OpenSearch troubleshoot	3	65	September 4, 2024
Hybrid search on nested fields OpenSearch troubleshoot , configure , feature-request	2	74	June 19, 2025
Neural Search Plugin Chunking For Large Text Machine Learning	4	753	September 30, 2023
Provided Text Chunking Example fails with Neural Sparse! OpenSearch	0	26	May 9, 2025

How to get score per chunk so that i can retrieve the most relevant chunk from the document?

Related topics