Neural Search Plugin Chunking For Large Text

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.9

Describe the issue:
Does the Neural Search Plugin ingestion pipeline support chunking for large text? So if I have let us say, a complete page or even the text of a full document, would it generate a pooled embedding for each chunk in the document?


Can you elaborate more ? I guess you mean such steps

  1. Split a big document into smaller chunks
  2. Calculate embedding for each chunk
  3. Calculate pooled embedding for all chunks

Is that correct ? If yes, neural search doesn’t support this feature now. Why generate pooled embedding, rather than save embedding for each chunk?

Yes what you said is correct. Sometimes you don’t want to store embeddings for all chunks to save storage space.You only want relevant text blocks in one embedding. Also, neural search currently does not support nested documents. In addition, storing each chunk means I have to repeat the document (a user manual for instance) metadata for each chunk which will be unnecessary waste of storage size.

1 Like

Thanks for the explaination. Can you cut a Github issue for feature request on this repo Issues · opensearch-project/ml-commons · GitHub ?

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.