Storing sentences separately consumes too much storage space

Versions (relevant - OpenSearch/Dashboard/Server OS/Browser):
OpenSearch 2.9

Describe the issue:
I have a wikipedia dataset of 1.2 million articles that I want to store in OpenSearch. When I store each article as one document it consumes 1.9GB. But if I store each sentence as an independent document then I end up with 4.4GB, almost 2 times larger! The source data is exactly the same for both indices and it is not replicated in any way.


  1. Who is it consuming too much space? What is the overhead associating with creating documents?

  2. Is there a way to have that size reduced to a natural state? I expect the index size to be equal to dataset size * percentage factor such as size*1.2 or so.

  3. I wanted to the above to satisfy two scenarios:
    a- Store documents as pages or paragraphs so that search queries can highlight which page was matched (this feedback is important information for viewing the matching page directly rather than leave it to the user to do the scrolling).

b. Store embeddings as sentences or paragraphs to perform semantic search. Embedding in their own consume a lot of space! So, we want to eliminate any unneeded document storage overhead.

One straight forward way to do is to store paragraphs as array while enabling the user to return which paragraph index was matched. But is that possible today in OpenSearch (even through plugins/extensions)?