Best way to store document chunks for vector search as production standard

Hi, working on a RAG setup and trying to land on a sensible production architecture for chunk storage and retrieval. Curious what others are running at scale.

Large documents get split into chunks at ingestion, each chunk gets a vector embedding. The parent document has metadata that may change over time. The chunk text and vectors should stay the same after indexing.

We’ve looked at three approaches:

Flat chunks (each chunk is its own document with a parent_id field): the relationship between chunk and parent exists only on the application side, the engine has no awareness of it at all. So beyond the basic indexing, the application has to manage the full lifecycle: grouping search results by parent, picking the best scoring chunk, extracting the matched text, over-fetching to end up with enough results after deduplication, cleaning up orphan chunks on parent delete, and keeping parent metadata in sync on every chunk. On top of that, any parent field used as a search filter has to be copied onto every chunk document, so changing it means updating potentially hundreds of documents at once.

Nested (chunks as nested objects on the root document): the relationship is managed by the engine, which is the main appeal. Engine handles parent deduplication natively and returns the parent document directly from a chunk-level vector search, no grouping logic needed on our side. Parent-level filters also work without copying fields onto every chunk. What we’re less sure about is production behaviour: the docs mention a performance overhead for nested queries compared to flat, and updating any field on the parent rewrites the whole block including all nested chunks. For frequent metadata updates on large documents, is this a real problem in practice or not noticeable?

Parent/Child join: we looked at this briefly and dropped it. The docs explicitly say has_child/has_parent queries add significant overhead, and there are threads here with 12+ second query times even on small datasets.

So the question is: for this kind of chunk storage setup, is nested the standard approach now? From documentations perspective all seem to push in that direction. Or is the nested query overhead actually noticeable in production and teams prefer to deal with the additional logic on the application side?

Hi @grunggy, good question, and you’ve already correctly rejected Option 3 (parent/child join). Having tested this on a live cluster, see my finding below. If anyone else would like to weight in, would be great.

Short answer: Use nested documents. For OpenSearch 3.1+ the semantic field type implements this for you automatically.

Your concern about “parent field rewrites” is understandable but ultimately not a reason to choose flat. The flat approach comes with other caviats too, see below:

Problem 1: Duplicate parent hits
A document with multiple chunks can dominate your results. Running the same query against a flat index vs a nested index on identical data:

Flat index result for query “filtering production vector search”
Doc 4_chunk_1 | parent=4 | score=0.1330 # same parent
Doc 3_chunk_0 | parent=3 | score=0.1283
Doc 4_chunk_0 | parent=4 | score=0.1230 # same parent AGAIN
Doc 1_chunk_0 | parent=1 | score=0.1218
Doc 2_chunk_0 | parent=2 | score=0.1189

Nested index result, same query, same data
Doc 4 | Complete Guide to Vector Search | score=0.1330 # once, best chunk wins
Doc 3 | Comparison of ANN Algorithms | score=0.1283
Doc 1 | Introduction to Vector Databases| score=0.1218
Doc 2 | HNSW Algorithm Deep Dive | score=0.1189

At scale, a long document split into 20 chunks can consume your entire top-10. Every application that uses flat chunks has to deduplicate results before passing context to the LLM which is not a trivial logic (do you pick the highest-scoring chunk per parent? merge chunks? re-rank after dedup?).

Problem 2: Orphaned chunks on delete

When you delete a document from a nested index, all its chunks are gone atomically, one operation. With flat chunks:

Delete parent doc from flat index, chunks are NOT deleted

DELETE /my-index/_doc/parent-42

These chunk docs still exist and are still being returned in searches:

GET /my-index/_search
{"query": {"term": {"parent_id": "42"}}}
# returns 8 chunk docs that no longer have a parent

You must remember to run this yourself, transactionally, every time:

POST /my-index/_delete_by_query
{"query": {"term": {"parent_id": "42"}}}

In practice this means every delete in your application is two operations with no atomicity guarantee between them.

Your concern about metadata update overhead is addressed more practically by skip_existing_embedding: true on the semantic field, it detects whether the source text changed and skips the ML inference call if not. The write cost of the document itself is the same either way; the expensive part is the embedding model call, and that’s handled.

Example:

Step 1: Register and deploy your embedding model

POST /_plugins/_ml/model_groups/_register
{"name": "rag-models", "description": "Models for RAG pipeline"}

POST /_plugins/_ml/models/_register
{
  "name": "my-embedding-model",
  "version": "1.0.0",
  "model_format": "TORCH_SCRIPT",
  "function_name": "TEXT_EMBEDDING",
  "model_group_id": "<model_group_id>",
  "model_content_hash_value": "<sha256-of-zip>",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 768,
    "framework_type": "sentence_transformers",
    "additional_config": {
      "space_type": "l2"
    }
  },
  "url": "<model-zip-url>"
}

POST /_plugins/_ml/models/<model_id>/_deploy

Step 2: Create the index

PUT /my-rag-index
{
  "settings": {"index.knn": true},
  "mappings": {
    "properties": {
      "title": {"type": "text"},
      "body": {
        "type": "semantic",
        "model_id": "<model_id>",
        "skip_existing_embedding": true,
        "chunking": [
          {
            "algorithm": "fixed_token_length",
            "parameters": {
              "token_limit": 300,
              "overlap_rate": 0.1,
              "tokenizer": "standard"
            }
          }
        ]
      }
    }
  }
}

Step 3: Verify the auto-generated mapping

GET /my-rag-index/_mapping

You will see that body expanded into body_semantic_info with this structure, this is what the engine considers the correct production layout:

"body_semantic_info": {
  "properties": {
    "chunks": {
      "type": "nested",
      "properties": {
        "text":      {"type": "text"},
        "embedding": {"type": "knn_vector", "dimension": 768, ...}
      }
    },
    "model": {
      "properties": {
        "id":   {"type": "text", "index": false},
        "name": {"type": "text", "index": false},
        "type": {"type": "text", "index": false}
      }
    }
  }
}

Step 4: Index documents (no pipeline setup needed)

PUT /my-rag-index/_doc/1
{
  "title": "My Document Title",
  "body": "Your full document text here. The semantic field handles chunking and embedding automatically during ingest."
}

Step 5: Search

For RAG, hybrid search (BM25 + neural) consistently outperforms either alone. First create a normalization pipeline:

PUT /_search/pipeline/hybrid-rag-pipeline
{
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {"technique": "min_max"},
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {"weights": [0.3, 0.7]}
        }
      }
    }
  ]
}

Then query, note you target the semantic field directly, not the internal path:

GET /my-rag-index/_search?search_pipeline=hybrid-rag-pipeline
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "body": {"query": "your query text here"}
            // works because semantic field defaults raw_field_type to "text"
          }
        },
        {
          "neural": {
            "body": {
              "query_text": "your query text here",
              "model_id": "<model_id>",
              "k": 5
            }
          }
        }
      ]
    }
  },
  "_source": ["title", "body"]
}

Why hybrid outperforms pure neural for RAG, same query, three approaches on the same 4-document dataset:

Query: “ACORN filtering production vector search”

The corpus had one document that explicitly covered ACORN and production filtering,
and one that covered general vector database concepts with no mention of filtering.

Pure BM25: Filtering doc ranked 1st, exact term match on “ACORN”, “filtering”, “production”
Pure neural: General vectors doc ranked 1st, small model failed to discriminate semantically
Hybrid: Filtering doc ranked 1st, BM25 term signal rescued the neural ranking failure,
with a much wider margin between 1st and 2nd (0.79 vs 0.70 after min-max normalisation)
vs pure neural where scores were nearly indistinguishable (0.1294 vs 0.1368)

*BM25 and neural complement each other’s failure modes, BM25 handles exact terminology, neural handles paraphrase and synonym matching. The scores above are illustrative, with a real production model the neural component would carry more weight.